Text Fu

A Little Background

One of the things I've loved about working at Blue Parabola thus far is that my bosses not only allow me time for professional development, but value it and make it part of the company's policies and culture. With the demands that exist outside of my work, such as being a father of three, this makes it significantly easier to find time for sharpening my skills and improving myself as a developer.

When first getting into PHP or open source in general, something that a lot of people don't realize up front is that both programming and using software require reading. Lots and lots of reading. The latest instance of said reading is none other than The Pragmatic Programmer. There are a number of things to learn from this book, but it was suggested to me that I zone in on one particular section.

Back to Basics

Being a developer is somewhat of a juggling act. Communication, requirements gathering, documentation, implementation, testing -- at some level and in some fashion, these things are all involved in a web development project. In handling each piece of the puzzle, perhaps for multiple projects at a time, it can be easy to lose focus sometimes. Being a geek to the core, a favorite quote of mine comes from the late Pat Morita during the film The Karate Kid II: "When you feel life out of focus, always return to basic of life."

Our ultimate goal as developers is to obtain requirements to fulfill a perceived need, derive a solution that meets those requirements, and convert that solution into something functional that both fellow developers can understand and a computer can execute. High-level programming languages provide a way to accomplish the last part of that objective via a common form of communication suited to our purposes. Source code in these languages is to us as web pages are to crawlers: just plain text. There are times when it's easy to forget that simple fact.

Why does that matter? Several reasons.

  • Versatility: Text files can contain data in formats ranging in nature from free-form and delimited text to more complex formats like XML and JSON.
  • Malleability: From tools that originated in UNIX and continue to live on in its variants, all the way to present-day programming languages like PHP and Python, a plethora of tools exist to allow you to manipulate text.
  • Universality: All operating systems support it and most programs, regardless of what type of data they deal with, generally have a way of accepting input from a text file.

So I thought I'd take a blog post to review a few of the CLI tools that I use on a semiregular basis. (If you're curious, I'm running the latest version of Ubuntu these days.) You may have already read about my beginnings with bash, so I'll simply refer to that here as an example of what cool things you can do when you're familiar with the shell you use.

Of Diffs and Patches

When contributing to an open source project, access to its version control repository often has to be earned over the course of several contributions. Whether contributing to documentation, tests, or the main source code, you will most likely have to submit a patch at some point. diff is often used for this by calling it like so.

diff originalfile modifiedfile > originalfile.patch

This patch file can then be used on another copy of the original file, perhaps by a developer, to automatically apply any changes that were made to result in the modified version. This is done using patch.

patch originalfile originalfile.patch

Slicing and Dicing

You've likely already used head and tail to get the first or last few lines (configurable using the -n flag) of output from a command or file. Same with grep and its variants for filtering lines based on a substring or regular expression.

There are also ways to combine files. Want to concatenate the contents of one file to another in sequence? Use cat. Want to do the same, but horizontally in a columnar fashion? Try paste. How about grouping the contents of lines in multiple files together based on a common field value? join can do that.

Looking to extract data from a file instead? Take a look at cut. It can be used to filter out specify line ranges (using the -c flag) as well as columns in a delimited file (using the -d flag to specify a delimiter and -f to specify column ranges). I often use it in conjunction with grep to do quick analysis of web server logs.

Maybe you're looking to split one file up into multiple files? Check out split to do it by character or line count or csplit to do it by regular expression.

OK, you probably get the point by now. There are a lot of ways to put text files together and take them apart using these utilities.

Having It Your Way

If you work in an environment with multiple developers and don't have a coding standard (shame on you!), or if you're taking on a legacy project from a previous developer, you may find that source files use tabs where you'd rather have spaces or vice versa. expand and unexpand perform these conversions and allows you to specify the desired number of spaces from or to which to convert.

Maybe you need to sort a list of items? Try sort! (Yeah, I know, there's a stretch.) You can use the -u flag of that utility to eliminate duplicate lines. Or, if you want to do that without affecting the line order, you can use uniq.

fmt is great for taking in a few paragraphs of text with each paragraph on a single line and converting them to lines of a certain width. nl can output a file with line numbers. tr performs a series of character-to-character translations and supports regular expression character ranges.

Finally, for more multipurpose modifications, there are utilities with their own accompanying languages like awk and sed. I use the latter occasionally for doing quick regular expression-based replacements. (Be careful when doing it inline!)

If you know these tools well, you can spend seconds rather than minutes you might otherwise use writing logic to do the same thing in a general purpose language.

Sifting Through

You may have also used more and its slightly improved variant less to be able to scroll through large amounts of text. But maybe you need more information about the contents of a text file?

wc can return statistics about a file like character, word, and line count. uniq, mentioned earlier, can be used with its -c flag to get a count of the number of instances of unique lines within a file.

Want to eyeball line-by-line differences between two files? diff can work for that, or you can use comm to get a columnar lists of of lines unique to each file and lines common to both.

One great thing about all these tools is that they can be used in conjunction with the UNIX pipeline in many useful ways.

Beyond The Command Line

For more flexibility and power in dealing with text, you generally turn to text editors. You may have only ever used very simple editors up to this point, or only IDEs. Take the time to learn an editor like vim or emacs. I was born and bred on GUIs and I'll be the first one to say that I'm significantly more productive using a plain text editor.

So peruse available options. Compare them to find which meets your needs in terms of feature set, OS compatibility, and cost. Pick one and learn to use to the fullest extent possible. These editors are our tools just as much as a hammer or saw is a carpenter's tool.