Coding Standard Analysis using PHP_CodeSniffer

Some Background

A recent recurring task of mine has been to help with a code audit of an application for one of our clients. The application is based on the Kohana framework. One part of the code audits has been to evaluate the code's adherence to coding standards. For the sake of consistency, the development team stuck with the coding standard used by the framework itself. However, evaluating the code manually is tedious and time-consuming.

There's a solution to this type of problem: the PHP_CodeSniffer package from PEAR, which builds an infrastructure around tokenizers for PHP, CSS, and JavaScript and utilities to detect coding standard violations within code in any of those languages. Units of code intended to do this for an individual aspect of a coding standard is referred to as a "sniff". Sniffs are organized into directories first by their associated coding standard (Zend, PEAR, Kohana, etc.) and then by what area of analysis they deal with (operators, whitespace, control structures, etc.).

Kohana currently doesn't offer a set of PHP_CodeSniffer standard files and the sets that come with PHP_CodeSniffer itself are limited to a small handful (though most other standards are based on one of them). I spent some time this week working on getting a Kohana standard together because I figured it would help me complete code audits more easily, the application's development team could use it themselves, and it could be contributed to the Kohana project to help all developers that build applications on the framework.

Tokenization

To start, PHP_CodeSniffer does offer a tutorial on how to write your own coding standard that is fairly good, but there are a few things that were very useful in my process of writing the Kohana standadd for it and are either not included or buried deep in the content that I thought I'd share in a blog post. PHP_CodeSniffer is based on the tokenizer extension. If you aren't familiar, tokenization is the process of splitting segments of code into individual units called tokens, which can include things like operators and keywords. An example might be the best way to showcase this.

$ php -r 'print_r(token_get_all("<?php echo \"Hello world!\", PHP_EOL;"));'
Array
(
[0] => Array
(
[0] => 367
[1] => <?php
[2] => 1
)

[1] => Array
(
[0] => 316
[1] => echo
[2] => 1
)

[2] => Array
(
[0] => 370
[1] =>
[2] => 1
)

[3] => Array
(
[0] => 315
[1] => "Hello world!"
[2] => 1
)

[4] => ,
[5] => Array
(
[0] => 370
[1] =>
[2] => 1
)

[6] => Array
(
[0] => 307
[1] => PHP_EOL
[2] => 1
)

[7] => ;
)

Each item in the array that is returned is either an array with three elements or a string. Let's look at the former case first.

The first element of the returned array is the value of a tokenizer constant. For example, the array returned for the first token in this example has a value of 367 for its first element. This corresponds to the constant T_OPEN_TAG. If not using PHP_CodeSniffer, which does it for you automatically, you can get the name corresponding to the numeric value using the token_name function.

The second array element is the actual content of the token. In this example, the content for the first token is <?php.

Finally, the third array element is the line number on which the code corresponding to the token is located. In this example, since all code is entered on one line via command line, the value of all third array elements is 1.

In cases where an element of the returned array is a string, that string is simply the token content.

The PHP manual has a list of tokens that it supports natively. If you'd like to get a list specific to your version of PHP, you can run the command below in addition to reviewing the relevant manual section. PHP_CodeSniffer nicely adds more tokens to handle the cases where token_get_all would return a string for a token instead of an array, adding consistency to its interface.

$ php -r '$tokens = get_defined_constants(true); \
$tokens = $tokens["tokenizer"]; print_r($tokens);'

Initial Setup

Directories shown in the following examples are relative to the root directory of your PEAR installation. Since it's hosted on PEAR's servers, installing it is as simple as pear install PHP_CodeSniffer to get the current stable version (1.1.0, though 1.20RC1 is available). Once that's done, PHP/CodeSniffer should exist in your PEAR installation. Within PHP/CodeSniffer, there's a Standards directory that contains a subdirectory for each supported standard. I started by creating a Kohana directory there.

Within that directory, you have to create a Sniffs subdirectory and a class file for the standard. The format for the class filename is MyStandardCodingStandard.php where MyStandard should be the same as the name of the standard's directory. In that file, create a class with a name of the form PHP_CodeSniffer_Standards_MyStandard_MyStandardCodingStandard and have it extend PHP_CodeSniffer_Standards_CodingStandard.

At this point, you have the option of overriding methods of the base class to either include or exclude one or more sets of sniff files from other standards. Note that sniff files specific to the standard (i.e. included in the Sniffs subdirectory) are automatically included, so you don't need to explicitly specify that directory in the getIncludedSniffs method (and in fact PHP_CodeSniffer will end up throwing an exception if you do). In my case, since Kohana uses the BSD/Allman style of indentation and the Generic standard already has a sniff file for that, I included it.

Creating Sniffs

Within the Sniffs subdirectory, create additional subdirectories to categorize your sniffs. I'm not entirely certain whether or not this is required, but other standards do it and it does help to make files more easily navigable, so I followed suit. For example, a categorization subdirectory Category will contain files named in the form FooBarSniff.php and each file will contain classes named in the form MyStandard_Sniffs_Category_FooBarSniff.

In some cases, it may be helpful to extend a specialized abstract class like PHP_CodeSniffer_Standards_AbstractVariableSniff or a sniff class from an existing standard like PEAR_Sniffs_NamingConventions_ValidFunctionNameSniff. In general, however, sniff classes will only need to implement the PHP_CodeSniffer_Sniff interface.

The Sniff interface has two public methods: register() and process(). register() simply returns an enumerated array of values for token constants. When run, PHP_CodeSniffer will instantiate each sniff class and call register() on each instance to figure out which sniff classes are interested in which tokens.

It will then tokenize one or more source files and iterate over the resulting tokens of each. When it encounters a token for which a particular sniff class has registered itself, it will call process() and pass it an object representing the source file and the position of the token within that file. process() can then perform any analysis needed and, if necessary, add an error or warning to the file that will be displayed once analysis on all source files is complete.

Familiarity with the file class can make writing sniffs a lot easier. getTokens() is used fairly often to get an enumerated array containing a copy of the token stack, getTokensAsString() a little less so to get the content of a range of tokens as a string. findNext() and findPrevious() are immensely useful for finding the tokens nearest to the current token that are or aren't of certain types. Looking at sniffs for other standards is a good way to seek out approaches for dealing with particular validation situations.

Testing Sniffs

Like anything else, sniffs should be tested on both positive and negative cases to confirm that they function as expected. Luckily, PHP_CodeSniffer includes a PHPUnit-based infrastructure to make this easy. From the root directory of your PEAR installation, look in tests/PHP_CodeSniffer/CodeSniffer/Standards. Create a directory with the same name as the one created under PHP/CodeSniffer/Standards. Within that directory, create a directory called Tests. Finally, within Tests, mirror the directory structure under PHP/CodeSniffer/Standards/MyStandard. For each sniff file you've created, create two files in the same directory under Tests: FooBarUnitTest.inc and FooBarUnitTest.php.

The INC file should contain PHP code appropriate to test its associated sniff file for conditions that both do and don't cause errors or warnings to be emitted.

The PHP file should contain a class named with the form MyStandard_Tests_Category_FooBarUnitTest that extends AbstractSniffUnitTest. The class should contain two public methods, getErrorList() and getWarningList(). Both methods are intended to return associative arrays which map a line number in the INC file to the number of errors and warnings respectively that it should generate when the sniff is executed. Only lines that generate errors or warnings need be included in either array. If no errors or warnings are generated by the entire INC file, simply have the appropriate method return an empty array.

Finally, to actually execute the tests, run the command below from the root PEAR directory.

phpunit PHP_CodeSniffer_AllTests tests/PHP_CodeSniffer/AllTests.php

Parting Thoughts

I did run into a minor annoyance where the $_magicMethods and $_magicFunctions properties of PEAR_Sniffs_NamingConventions_ValidFunctionNameSniff, which I started to extend for one of my own sniffs, had private access modifiers and the methods using them were checking names for camel case, which the Kohana standard doesn't use. So, I had to instead extend that class's parent class and copy the property declarations into my own class. I hated duplicating this information, but couldn't avoid it.

There are two features that PHP_CodeSniffer doesn't currently appear to support that I'd like to see. The first would involve adding a regular expression-like utility in addition to the existing iterator-like utilities that it currently makes available for accessing tokens. The latter are the equivalent of basic string functions; they don't make searching for token patterns very easy. The second feature I'd like to see would add token bounds for the current statement to each token entry. This is already done for enclosing parentheses and brackets and the addition of the same for statements would make analysis of single- versus double-line statements easier.

I haven't completed the Kohana standard files yet, but hope to wrap them up fairly soon. Once I do, I plan to have the primary developer of the Kohana project review them before they are contributed. If you use Kohana, keep an eye out for them. Otherwise, I hope my experiences prove useful for someone attempting to write a PHP_CodeSniffer standard for their own project.

Very interesting!

Please make sure you do post the Kohana stuff when you have it working; I was just beginning to get round to developing the same thing myself! If you want any help beta testing it, please shout.

It's up

You can grab the Kohana PHP_Codesniffer standard using git. The directory structure is relative to the root directory of your PEAR installation. Using it is a simple matter of entering phpcs --standard=Kohana /path/to/code.

Nice work

Thanks very much for sharing this /Matt