Think about the find and replace dialogue in Microsoft Word for a minute. You enter a word or series of words and the program scans through the document to find the word, right? Well, actually, since the program can’t read, it looks for patterns of letters that match the pattern you’ve given it. Now you may know that in Word, you can use a wildcard like “*” to match any character(s) or “?” to match a single character. For instance typing w*y would return why, whinny and even webmonkey.
Such pattern matching is the tip of the iceberg of true regular expression searches. But we don’t want to explore only the tip of the iceberg. We want to be able to search very complex patterns that go beyond Word’s simplistic wildcard characters.
Just what does regular expression mean in this context? The term “expressions” is handed down from algebra, which you may remember from high school. But the “regular” part often befuddles newcomers. What is regular about these expressions? In this case, think of the term “regular” the way a laxative manufacturer thinks of regular, as in “occurring repeatedly” or “following a pattern.”
You will often see the term “regular expression” used interchangeably with the term “grep.” Grep is actually an acronym of sorts and refers to a specific program that uses regular expressions. Long ago in the land of Unix, there lived a line editor named “ed,” and the command
would search a file for lines containing the given regular expression. Because terseness was prized above all in the land of Unix, the term “grep” was coined. As time went on, grep became semi-synonymous with regular expression. However, if you’re looking for a shorthand for regular expression, “regex” is more accurate (and sounds less like a debilitating disease).
I point this out only so that you can look really cool the next time you’re at one of those Unix-rehab mixers and you’re trying to impress. Or perhaps I point it out because, technically speaking, grep is only one of the many forms of regular expression searching. Regular expression engines have been written into all kinds of programs on virtually every platform, not just the one all the cool people seem to use.
OK enough semantics; on to the patterns.
Let’s start nice and simple. The good news is that, when crafting regular expressions, most characters match themselves. That is, searching for the pattern “webmonkey” will find the word webmonkey, just like any find and replace dialogue. But obviously that’s not what we’re interested in. What we want to look at are the meta characters, because that is where the power of grep lies.
Going back to our Word example, what Word calls “wildcards” are to the rest of the world called meta characters. They are characters which are not interpreted literally. They have special meanings within the context of the search pattern. Meta characters are the heart of any regular expression. And here comes the big snag:Meta characters vary somewhat from program to program.
So, what programs have regular expression engines in them? Well, I don’t have space to list them all, so I’m going to pick some well-known text editors from each major platform and let you investigate the full potential of each on your own (Then there is Perl. Perl is covered in a follow up article, Use Regex in Perl).
Editors with regex support
On Unix and its siblings, there are two major editors, Emacs and Vim, both of which support roughly the same meta characters. Emacs meta characters can be varied according to what modules you load, but we’ll leave that alone for now. For Mac OS X users both BareBones Software’s BBEdit and TextMate are both popular and powerful. OS X also comes with egrep installed. Fire up your terminal and type “man egrep” for more info.
Windows users have plenty of options, too. Vim and Emacs have both been ported to Windows, and countless other freeware/shareware editors exist as well. jEdit is a popular cross platform editor, as is Eclipse. If I left out your favorite editor, don’t feel slighted. Just get the documentation, look up the regular expression meta characters and follow along while making the necessary substitutions.
Learning regular expressions is like learning any programming language. At first, everything looks like gibberish surrounding some familiar items. But, as you go on and keep encountering the gibberish, eventually it starts to makes sense.
Let’s look at some meta characters. First off, we’ll examine the “.” which I’ll call the dot meta character. A dot is a shorthand character class that matches any character. Don’t worry if you don’t know what a character class is, we’ll get to that in a minute. So, just remember that a dot matches anything, including spaces and tab characters. The behavior of a dot meta character varies slightly from application to application. For instance, in BBEdit, a dot does not match carriage returns.
Another handy one is the “?” meta character, which means optional. This can be useful in situations where you want to match something that might have additional characters, but doesn’t necessarily have to. An example would be where the spelling of a word may differ. For instance, the pattern “labou?r” will match the American word “labor” as well as the English “labour.”
Two very import meta characters are the ^ (carat) and the $ (dollar sign), which are the start and end of a line respectively. Searching for the pattern ^banana$ will find the word banana, but only if it’s on a line by itself.
There are also meta characters to control how many things are matched. For instance a plus sign (+) means “match one or more of the immediately-preceding item.” An asterisk on the other hand means “match any number, including none, of the immediately preceding item.” The difference between the two can be a little confusing, so let’s put it in literal terms the way grep thinks of it.
In regex, the + sign means find one or more of this character, otherwise the match does not exist. This is useful in situations where you have a long pattern with a unique combination of characters at the beginning. If regex fails to find one or more of the selected characters, then it stops looking for the rest of the pattern. On the other hand, the same pattern written with * instead will keep looking at the rest of pattern because * is perfectly happy with no results, and will allow the search to continue.
A Search Example
To illustrate the difference between * and +, let’s look at an example. Suppose you were told to scan some old e-mails and compile a list of subject lines. Mbox files generally look something like this (give or take a few dozen headers):
To:[email protected] From:[email protected] Subject:The subject of my novel is essentially my haircut. CC: BCC: Body:
We want to match everything in the subject line, which could be of any length and contain just about any character. So we could write:
Regrettably this matches too much. Our fictional subject line is not there simply because it’s funny (by the way, that line is from a poem by Paul Killebrew). I slipped the word subject in the subject line to make our example more complicated. We can’t just search for subject and everything after it.
But we know that “subject” will be at the beginning of the line so we can use the ^ carat to constrain the matches to only those which occur at the beginning of a line. Granted, we could have also included the colon in our original pattern and we wouldn’t have had a problem. So, putting together what we know about meta characters we could search using this pattern:
Here is a diagram that breaks down the search pattern and tells you what each part is doing.
This would find our subject line above just fine, but suppose our e-mail had no subject. That is, suppose the header looked like this:
To:[email protected] From:pa[email protected] Subject: CC: BCC:
The pattern ^Subject:?(.+) would fail in this case because there are not one or more characters after the optional space. In fact, there are no characters after the colon, so our pattern gives up and complains “no matches found.” If we change the “+” to an “*” we’ll have the pattern we actually want:
This pattern will find all the subject lines in our document, even the blank ones.
Using Pipe (or)
Another meta character you’ll use quite a bit is | (pipe) which means “or.” But the pipe character isn’t all that useful with out parentheses to constrain the scope of our statement. Parentheses have several uses. One important one is to limit the scope of an “or” statement. For instance, we could write:
This sequence would find any instance of the word webmonkey or the capitalized version, Webmonkey. But we could also have written it as:
or even more tersely:
The results will be the same. Obviously, the last method is much easier on the fingers. Parentheses have other powers as well, which we’ll investigate shortly.
So far, our searches have been fairly useless. Let’s do something useful.
The “capital I” example
If you’re like me, you type too quickly to bother with the shift keys and you often end up with dozens of non-capitalized i’s floating around in your text. True, Microsoft Word can take care of this for you on the fly. But, if you’re like me, you do your coding in a text editor, and after a while you began writing everything in a text editor. E-mail to friends, letters, web content, you name it. But, the text editor doesn’t compensate for your aversion to the shift key, so all of your i’s need to be fixed.
To get that satisfying feeling of having all one’s i’s in order, we can create a regex pattern to search through some text, find all of the places an “i” would need to be capitalized, then replace those instances with “I”. The first thing to do is to figure out what those instances are. Obviously, there is I by itself, but there is also I’ll, I’m, I’ve, and I’d.
Our pattern would look like this:
Literally:match ‘i’ followed by a space character and/or an apostrophe and/or a period and/or a comma and/or a semi-colon and/or a colon. Because the “dot” character is a meta character, we must escape it with a backslash so that our regex engines know that we want a period and not the “dot” meta character. This is true of any meta character that you want to search for. For example, to find a question mark you would need to enter “\?” so the regex engine knows to look for a literal question mark. Note that the escape sequence (the backslash) may vary between regex implementations.
This regex pattern also demonstrates a second use for parentheses — as placeholders. We found all the places where an i needs to be made into an I. For our replacement sequence, we can say:
Now what does that mean? Well, the “\u” is BBEdit’s meta character for “make uppercase”. The exact meta character for this functionality will vary from editor to editor, so you’ll have to look up the specific syntax in your manual. So, we’re telling regex to make all the “i?s? we found into “I?s?. Next, we need to do something with the rest of our matches. Luckily for us, parentheses also serve as placeholders, so our extra characters have been stored and may be recalled using the syntax:
The progression is sequential. The first set of parentheses is recalled with \1, the second \2 and so on.
If you ever purchase a book on regular expressions, you will almost always get this example because it’s simple to write and understand. Part of learning regular expressions is memorization (mainly the meta characters), but to move from learning to understanding to becoming capable of writing complex patterns lies in thinking like regex. Regex doesn’t understand English or any other language. Rather, it looks at characters and finds patterns in them. This becomes very handy when dealing with documents that contain both human language and machine language — say a markup document like HTML or XML. To really use regex patterns, you have to stop thinking in terms of words and think at a character level. For instance, it’s very common to accidentally type the same word twice.
On the next page, let’s look at a regular expression that could find one pattern of characters followed by the exact same pattern of characters. While that sentence may read awkwardly, it accurately describes what a regular expression engine would do to your text.
The “find repeated words” example
Here is our regex example that finds repeated words in a document.
The first thing we need to do is explain the concept of a character class represented in regex terms by brackets, the [ and ] characters. The brackets are meta characters used for grouping a character class. Within a character class, we can define any range of characters. We could write [aeiou] and find every vowel in our document. The pattern [a-z89] finds every character and the numbers 8 and 9. Here’s a starting pattern:
As we know, the plus sign means “find one or more of the immediately preceding pattern”. In this case, the immediately preceding pattern is any lowercase character. The parentheses tell regex to store that pattern for recall. We’ve taken the first step towards our solution, but here we encounter a bit of divergence between grep implementations. Most regular expression implementations have a meta character to match word boundaries. The trouble is that the syntax varies. What we want to say with regex is “find the start of a word, then find and store all the characters that make up that word.? Assuming the word boundary meta character is “\b” we could write:
In this case, I’ve used BBEdit’s syntax since I happen to be using BBEdit to write this sentence at this moment. What we’ve added here is the \b, which is BBEdit’s meta character for a word boundary. You might be tempted to think that it’s simply a space, but putting a space in our command would match spaces, which regex considers characters like any other. We don’t want to actually match the space, we just want to acknowledge that it exists. In this sense, a word boundary meta character isn’t technically matching anything.
Thinking like regex, what we’ve said is, “find all word boundaries and match all lowercase characters that occur one or more times immediately following the word boundary, then store that information”. Basically, this is regex-speak for “find and remember every word”. Next, let’s use the recall function to recall the stored word and see if the word immediately after it is the same. To do that, we use the following notation:
Notice that we used a space to separate our words, but the word boundary meta character at the end. The reason for this is that we want to get rid of the space between the words, but we want to retain the space after the second word. So, a space in the middle, but not at the end. Our replacement pattern is simple. Just print out the first word using \1. Since we stored the first occurrence of the word, but we returned both words, replacing our returned selection eliminates the second word. Bam! No double words.
To give you a sense of a different program’s syntax, here is the same example written for egrep, a command line utility. The expression would be:
In egrep the word boundary meta characters are “\<” for the beginning of a word and “\>” for the end.
The “strip out HTML markup” example
Now we’ll return to the little teaser that we started out with at the beginning of today’s lesson:converting older HTML to CSS. Suppose you’re a webmaster presiding over a site full of pages that contain code like this:
<font size="20" face="Arial,Helvetica" color="#000000">some text</font>
You know why that’s bad, right? If you don’t, I’ll tell you:Web standards dictate that you remove all of the formatting from your HTML and use CSS for font declarations, layout, color and all of that fancy stuff. It’s cleaner, it’s easier, and, most of all, it’s correct.
So, being good web builders, we’d like to move all of those font definitions to a stylesheet. Using regex, we need to extract the “some text” portion and replace the font tags with
tags that we can style with CSS. Here’s some more regex:
This will find all our font tags, and, using the placeholder properties of parentheses, store the characters between them. Then, to drop in the <p> tag, our replacement pattern would look like this:
Next, we can add a “myclass” definition to our stylesheet and format our text however we’d like. Now a couple of caveats. This pattern finds every font tag, which may not be what we want. To make our search more specific, we can alter our pattern to be more specific. For instance, by changing the initial sequence from
<font size="20" .*>
we can ensure that this pattern will only touch those font tags with a size parameter of 20, which is an attribute of our original target.
The other caveat is that this pattern will only match text without hard returns. In most implementations of regex, the dot character matches every character except a line break.
Regular expressions should look a little less like Klingon now that we’ve seen a few places where they can come in handy. Once you have learned to think in regex-speak, you’ll find a near infinite number of ways to save time and smooth out your workflow.
Tips & advice
Tip: Most regex implementations include a method for applying a search pattern across multiple files. Consult the documentation for your favorite program for the appropriate syntax and you’ve got a quick method of changing code in a whole website worth of files.
Tip: As long as your text editor supports them, regular expressions can work on any language — human, machine, alien, what have you. Be it poetry, prose, HTML, XML, Perl, PHP, ASP, or C, if it has patterns (and it does) regex can handle it.
If today’s lesson has whetted your appetite and you’d like more info on regex, there are a number of books worth picking up.
- Jeffrey Friedl’s Mastering Regular Expressions from O’Reilly Press is the de facto bible of regular expressions. This book, along with the BBEdit and emacs documentation files, taught me everything I ever wanted to know, and more.
- O’Reilly also publishes a pocket guide, which is handy for looking up metacharacters at a glance.
And, of course, any good search engine will lead you to more resources.