Alright, so this is going to be my attempt at explaining regular expressions, or regexes for short. This is a massive topic in computer science that a lot of people struggle with. There are entire courses dedicated to learning how to analyze and create regular expressions, and it is a very complex subject. This tutorial will explain a few of the basic methods used to generate regular expressions, and will cover searching for single characters, usernames, social security numbers, passwords, and e-mail addresses.
So first, what is a regular expression?
Wikipedia, (remember when you would say Webster's Dictionary), defines regular expressions as
In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
In the most basic of terms, regular expressions are a more complex version of the find function in your favorite text editor. They search for patterns within a string. They are also excellent for data validation, which you will most likely use them for in PHP.
What are regular expressions used for?
Regular expressions are used for many different tasks in the computing world. For those of you studying computer science, you will often run into them in your systems software course. They can be used while creating compilers to ensure that variables are named properly. For instance, if your compiler requires that all variables begin with an alphabetic character, you would check for that with a regular expression.
Regular expressions are also used for searching large bodies of text for certain words and patterns. For instance, lets say that you wanted to search a giant flat file database of a company's employees. Not only that, but you only want to pull out the social security numbers for insertion into a separate database. You would write a regular expression that runs on the text to search for a combination of 3 digits, followed by a dash, followed by 2 digits, followed by a dash, followed by 4 digits. If that doesn't make sense, stay tuned, as we'll be covering those.
Finally, regular expressions can be used for form validation. There are an endless amount of regexes for validating forms, some more effective than others. There are regexes to search for valid zip codes, e-mail addresses, usernames, passwords, and pretty much anything else you could put into a form.
Can you show me an example of a regular expression?
Sure, here's one of the most simple regular expressions available
Can you tell what that regex does? It returns true of the string that is being tested is a single alphabetic character. I know what you're thinking. "Holy crap, that's a lot of stuff for searching for a single character!" My response? Indeed it is.
This brings me to a very important point. Regular expressions are very confusing the first time you take a look at them. It took me a while to understand them when I started studying them, and even now they're rough to look at. There's nothing like staring at something like
and wondering, WTF? For the record that is a regular expression that checks for a valid e-mail address and includes some of the newer domain endings. We will cover e-mail validation in the next tutorial, as that's a beast all of its own.
Can we break down a couple of expressions?
Yeah lets do it! We'll work our way towards understanding the one we used above
Regular expressions have their own syntax that is used for analyzing strings. There are many many different symbols used, so we'll cover a few as we come along them, rather than me just throwing a whole bunch at you.
First, lets note the '^' and '$' symbols at the beginning and end of the regex. The carat '^' symbol means that the regex will search for the pattern at the beginning of a string. Searching for
will return true of the word begins with the string 'cat'. This would return true for 'cat', 'catastrophic', 'catch', and many others. However, it will return false for a string like 'fatcat'. This is because 'fatcat' does not start with the string 'cat'.
Next, look at the dollar '$' symbol. That means that the regex will search for the pattern at the end of the string. Searching for
will result true only if 'cat' is at the end of the string. This will flag false for 'catastrophic', and 'catch', but will return true for 'fatcat'. This will also return true for 'cat'. This is very important to keep in mind.
Placing your text between a '^' and '$' will search for that string alone.
will return true if the string checks 'cat', but will return false for 'catch', and 'fatcat'.
The next symbols we will look at are the square brackets. The  will search for one instance of whatever characters are between the brackets. Searching for
will return true for 'bet', or 'bat'. It will not match true for 'beat', because it is searching for only one instance of either letter. We will get to repetitions on a later tutorial.
Finally, the hyphen '-' character will search for a character that is in a particular range. For instance, searching for
will search for one instance of a digit between 0 and 9. You can combine ranges and single characters in brackets as well.
Will return true if the string is a single character that is a digit between 0 and 9, or is the capital letter 'A'.
Take a moment to look back at everything that I've just covered. There a lot more to the information above than you may first realize, so if there's something you don't understand, look over it again. It's very easy to get frustrated with regexes, but stick with it. It's totally worth it in the end. Plus, if you can tell your employers that you can successfully write regexes, it'll really help your chances on a job. Plus the ladies love the regexes. Ha ha. Surely I kid.
Knowing what you know now, could you analyze that regex? First, it will search for the pattern as an entire string. This is because of the ^ and $ symbols. Second, we know that we are searching for a single character, because the only thing between the ^ and $ symbols are everything between the square brackets. Finally, we are searching for a range of alphabetic characters from lowercase a to z or capital A to Z.
Basically the regular expression will flag as true if the string that was tested is a single lowercase or capital letter.
Wow, that was an unbelievably long tutorial, and I've covered so little. Depending on reader response, I will write the second part. I plan on the second part covering some more complicated patterns, such as usernames, social security numbers, and passwords. We'll also touch on e-mail validation, and understand some of the complications of designing regexes.
I hope this tutorial has been helpful for someone. If you are interested in more information, and don't want to wait on my next tutorial, I could highly recommend the sites below. Take care and happy coding!
This post has been edited by akozlik: 03 June 2008 - 02:01 AM