Things Have History

code-editors

QED, or how fifteen years of mathematics became a search command

Listen

Stephen Kleene invented regular expressions in 1951 for a mathematics paper, and they spent the next fifteen years being perfectly useless. You cannot type a regular expression. You cannot compile one or run one. You can only prove things about them — which is satisfying work if you are a logician and somewhat less satisfying if you are a programmer staring at a file full of text you need to search.

In 1965, Butler Lampson and L. Peter Deutsch at UC Berkeley built QED for the Berkeley Timesharing System, running on an SDS 940 mainframe. It was a clean, capable line editor: address lines by number or pattern, insert, delete, rearrange, print. It solved the problem of editing text without physically handling tape or cards. A year later, Ken Thompson — who had used the Berkeley version as an undergraduate — rewrote QED from scratch for MIT’s Compatible Time-Sharing System on an IBM 7094, and in the process crossed the border that Kleene had left uncrossed. He put regular expressions in the search command.

The addition was technically modest: a new syntax for the /pattern/ address. Type /[0-9]+/ and QED would find every line containing a number. Type /^import/ and it found every import statement. Under the hood, Thompson compiled each pattern on the fly into a nondeterministic finite automaton — a machine that chased every possible match in parallel — and ran it across the file in a single linear pass. He wrote the method up for the June 1968 issue of Communications of the ACM under the title “Regular expression search algorithm,” and then, in a move that tells you something about 1968, patented the technique.

Dennis Ritchie was simultaneously building another QED variant for General Electric’s timesharing system, and the two of them eventually carried the design into Unix. Thompson’s descendant, the line editor ed, shipped with the first Unix release in 1971. It kept regular expressions and trimmed away QED’s more elaborate features — including, somewhat cryptically, the ability to execute editor commands that lived inside a buffer, which amounted to a macro language that no one could quite use responsibly.

Here is the part that stings. When Brian Kernighan needed a quick pattern-search tool in 1973 and asked Thompson to factor the g/re/p command out of ed — global, regular expression, print — both ed and the new grep inherited a simple backtracking engine rather than the NFA method Thompson had published and patented five years earlier. The inventor of the faster approach shipped the slower one, and for decades the tools that introduced regular expressions to practicing programmers quietly did them wrong. Russ Cox, revisiting the history in 2007, observed that the people most responsible for spreading regular expressions had also led the field toward the inferior implementation.

Regular expressions spread anyway, the way useful ideas do. From ed to sed, sed to awk, awk to Perl, Perl to every language that followed. The notation Kleene invented for logicians now lives in the standard libraries of Python, JavaScript, Ruby, and a few dozen others.

He called the paper “Regular expression search algorithm.” The algorithm took forty years to become the standard. The notation was ready on day one.

Sources