Code Elegance: Learning from Brian Kernighan's Regular Expression Matcher

| | 3 min read

Brian Kernighan is one of the most prominent computer scientists, known for his contribution to developing the UNIX operating system and his work in developing the C programming language. In his article "A Regular Expression Matcher," Kernighan describes how he and his colleague Rob Pike created a concise yet powerful regular expression package.

Brian Kernighan
Brian Kernighan in 2012 at Bell Labs By Ben Lowe from Wikimedia

In the article, Kernighan outlines the challenges they faced while attempting to create the regular expression package, including the fact that existing packages were too large and the need to find a package that would be both pedagogical and efficient. 

The regular expression matcher presented in the article is implemented in just 30 lines of C code. It can handle basic regular expressions and is optimized for grep-style patterns. This matcher is small and efficient, and easy to understand and modify, making it an excellent example of elegant code.

Code

    /* match: search for regexp anywhere in text */
	int match(char *regexp, char *text)
	{
	    if (regexp[0] == '^')
	        return matchhere(regexp+1, text);
	    do {    /* must look even if string is empty */
	        if (matchhere(regexp, text))
	            return 1;
	    } while (*text++ != '\0');
	    return 0;
	}
	/* matchhere: search for regexp at beginning of text */
	int matchhere(char *regexp, char *text)
	{
	   if (regexp[0] == '\0')
	       return 1;
	   if (regexp[1] == '*')
	       return matchstar(regexp[0], regexp+2, text);
	   if (regexp[0] == '$' && regexp[1] == '\0')
	       return *text == '\0';
	   if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
	       return matchhere(regexp+1, text+1);
	   return 0;
	}
	/* matchstar: search for c*regexp at beginning of text */
	int matchstar(int c, char *regexp, char *text)
	{
	   do {   /* a * matches zero or more instances */
	       if (matchhere(regexp, text))
	           return 1;
	   } while (*text != '\0' && (*text++ == c || c == '.'));
	   return 0;
	}

The significance of this effort lies in the fact that Kernighan and Pike's code not only created a streamlined approach to regular expressions but also highlighted the importance of elegant and efficient code, which is now recognized as one of the key pillars of good programming practices.

Rob Pike
Rob Pike giving a keynote at OSCON 2010 By Kevin Shockey from Wikimedia

Kernighan's article is an excellent resource for software developers and engineers who seek to improve their coding skills and those who wish to adopt best practices in programming. The code illustrated in the article talks to us by itself. It is one of the best examples of recursion.

Kernighan highlights the key design principles used in creating the matcher. 

  1. Keep the code simple and easy to understand
  2. Avoid unnecessary features or complexity
  3. Optimize for the most common use case

The first principle is to keep the code simple and easy to understand, while the second principle is to avoid unnecessary features or complexity. The third principle is to optimize for the most common use case, in this case, grep-style patterns, while also making sure that the code is flexible enough to handle other types of regular expressions.

The article provides a detailed explanation of the code, line by line, and describes the role of each function in the regular expression matching process. The three key functions used in the matcher are match(), matchhere(), and matchstar().

  • The match() function searches for a regular expression pattern in the given text. If the pattern starts with '^', it searches only at the beginning of the text; otherwise, it looks for the pattern anywhere within the text.
  • The matchhere() function checks if the regular expression pattern matches at the current position in the text. It can handle cases like matching any character with '.' and ensures that the pattern correctly matches till the end if it concludes with '$'.
  • The matchstar() function specifically handles cases where a character in the pattern is followed by a '*'. It checks for zero or more occurrences of the preceding character in the text.

Conclusion

The Practice of Programming

"A Regular Expression Matcher" is an excellent example of how focusing on good programming practices can lead to developing efficient, streamlined, and robust code. By adopting the approaches outlined in "The Practice of Programming," software developers and engineers can become better coders and create more effective and efficient code.

We highly recommend you read "The Practice of Programming" 1 and the essay "A Regular Expression Matcher" 2 by Brian Kernighan to learn more about coding practices that will help you become a better software engineer.