On Writing CDCSubmitted by Matthew Turland on Wed, 03/25/2009 - 10:24
No, I'm not talking about change data capture, the Cult of the Dead Cow, or any other equivalent meaning for the acronym "CDC" that probably springs to mind when you see it. I'm talking about a modest project of mine that I decided to take a blog post to plug: Ceres Document Checker.
Let's Start From the Beginning
I've had a pretty fair amount of involvement with php|architect in the past few years.
- I wrote two articles for the magazine and have another two slated for the June issue this year.
- I also wrote a two-part article for the C7Y web site reflecting on my experiences in creating a PHP-based IRC bot, Phergie.
- In January of this year, I began serving as a technical editor for the magazine.
- Finally, I'm working on a book on web scraping to be added to the fine line of books available from php|architect.
If you've written or done editing for php|architect before, you're probably familiar with the custom markup format they use called Ceres, which looks a bit like Markdown. Both articles and books use it, though each has slightly different formatting requirements. Some of these requirements can be tedious to check for and easy to miss. As much as I've been working with documents in the format, I decided to write a tool to help me out.
Here are some of the things that I wanted this tool to do.
- Be executable as a shell script from the command line
- Support checking multiple Ceres files at once and use of directories/recursion and wildcards when specifying which files to check
- Detect inline code examples of more than 6 lines in length
- Detect lines in inline code examples with a width greater than 60 characters for magazine articles and 70 for books
- Perform lint checks on inline code examples to confirm that they are free of syntax errors and return line numbers relative to those of the original document
- Return a rough word count estimate with the option to exclude inline code samples
- Check any referenced URLs to confirm that they're still accessible (mainly for books)
I planned to execute this script mainly from a command line, so that reexecuting was a simple matter of hitting up and enter, done. The only feature of PHP I used that was specific to this was the $argv variable, which is a predefined variable populated with any command line arguments received by the PHP interpreter. This combined with use of a while loop and the current() and next() functions to iterate over $argv gave me all I really needed for basic argument handling.
To make the script executable as a shell script on my local system, I included a shebang or sha-bang line. After that, with the script file in my PATH, I could just enter its name from any directory on my filesystem and execute it.
Check This Out
I wanted to be able to specify a file to be evaluated, a directory to be recursed and each file within it to be evaluated, or a wildcard pattern to be applied and any matching files evaluated. The logic to determine which I was dealing with went something like this.
Run the argument containing the target through is_file(). If it passes, add it to the list of files to evaluate and move onto the evaluation phase.
If it's not a file, run it through is_dir(). If it passes, initialize an instance of the SPL's RecursiveDirectoryIterator class with that directory path. If it's not a directory, assume it's a pattern and initialize the RecursiveDirectoryIterator instance with a path to the current directory instead.
Now iterate over the list of files. Each element returned by the iterator will be an instance of the SPLFileInfo class. Call its getPath() method and check the return value to ensure it doesn't start with a period. This will eliminate the current and parent directories (which the DirectoryFilterDots iterator can also do) as well as directories like .svn or .git (which DirectoryFilterDots doesn't handle to my knowledge) when working with a local version control copy.
Next, call isFile() to check that the current element is not a directory or symlink. Finally, if a pattern was specified rather than a directory, use fnmatch() to evaluate the return value of the current element's getPathname() method against the specified pattern. If the element has made it this far, stick that getPathname() return value in the list of files to analyze.
Once the directory recursion ends, pass the list through sort() for better readability.
What's That Code For
At this point, files are just retrieved using file() and analyzed line by line. Word counts and line lengths and widths are just counters and string length checks, not really anything to write home about. When a line beginning a code block is encountered, subsequent lines are buffered in a string until a line indicating the end of the code block is reached. At that point, it starts to get a bit more interesting.
At that point, the contents of the block are analyzed using preg_match_all() to pull out the PHP code blocks. This is the regular expression used for that: '/<\?php.*(?:\?>|$)/UsS'. All it does is find substrings that start with <?php and end with either ?> or the end of the string. It uses a few modifiers to do this.
- U (ungreedy): This makes matching stops at the first ?> that might be found. Without this modifier, .* would cause the pattern to match as much as possible rather than as little as possible.
- s (lowercase, dot all): This makes the . in .* include newlines when matching, as they aren't normally included.
- S (uppercase): Because this pattern is likely to be used multiple times within the script's runtime, this modifier is used to perform a little extra analysis so that matches beyond the first execute faster.
Once the code has been isolated, it has to be put through a lint check. php_check_syntax() is an option, but its manual page notes that it's deprecated and that the preferred way of conducting a lint check is running PHP via command line and using the -l flag.
shell_exec() is useful for doing this and returning the command's output as a string. In order to pass the code as input to the PHP interpreter, two things have to happen. First, it has to be escaped using escapeshellcmd() because it may contain shell metacharacters, which would throw the shell off when it attempts to parse the command. Second, because the code isn't contained within a file, it can only get to the PHP interpreter via the pipeline. To facilitate this, echo must be used as part of the issued command in order to send the code to stdout so that it can be passed to the PHP interpreter via stdin.
Lastly, the output of the command is checked to see if any syntax errors occurred in the lint check. If so, the line numbers have to be adjusted so that they're relative to the original document rather than to the individual code example. An easy way to do this is by using preg_replace() in combination with a /e modifier in the expression so that the replacement string is (after backreference substitution) evaluated as PHP code. (Yes, it's evil, but it works.)
Whenever a URL is encountered (using preg_match() again), fopen() can be used to open a read-only stream to it. Assuming that doesn't return false, stream_get_meta_data() can be used to analyze the response received from the stream and determine whether the document no longer exists or is otherwise inaccessible.
I find it quite satisfying that I was able to come up with a relatively lean but robust solution to this problem using a variety of the array (no pun intended) of features that PHP has to offer. If you're writing or editing an article or book for php|architect, I'd appreciate it if you gave CDC a spin. You can let me know about any issues you encounter or features you'd like to see added by e-mailing me at email@example.com.