If you haven't looked at Part 1 of this tutorial series, do so now. That knowledge will be needed to follow along with this tutorial. If you have already read it, lets jump in!
How will this tutorial be laid out?
The best way to learn regular expressions is to take a look at examples, and to break them down. For the rest of this tutorial series, that is what we are going to do. I will give you a regular expression, explain what it does, introduce any new symbols that you may need to know, and then we'll analyze the expression. Hopefully by doing this enough times you will learn to analyze expressions on your own. This tutorial is going to cover validating social security number formats, and creating restrictions on usernames and passwords. The final tutorial will cover e-mail validations, and introduce developing your own regular expressions.
Can we just get to the meat?
Most definitely! As always, lets take a look at the first regular expression we will be breaking down.
CODE
^\d{3}-\d{2}-\d{4}$
Can you tell what that regular expression does? In case you haven't been following along, that is one regular expression that is used to verify that a social security number has been entered into a form using the correct format.
It's very important that you note the precise phrasing of what that regex does. I'll repeat:
QUOTE
It is used to verify that a social security number has been entered into a form using the correct format
Notice that I did not say that that expression checks for a valid social security number. I will cover why this is important after we break down the expression.
Well lets begin breaking down our regex. First, we notice that everything is contained between ^ and $ characters. Do you remember what that does? It causes the expression to check for the given pattern within an entire string. Anytime you want to verify that an entire string meets the given patter, you must enclose it within ^ and $ characters.
The next symbol that we need to look at is the '\d' symbol. That is an escape symbol much like '\t' or '\n' in many of the programming languages you may have used. Any idea what that does? It is the symbol for any numberical digit. Essentially, '\d' is shorthand for typing [0-9]. Either way will have the same end result. It's typically best practices though to use the shorthand for \t, and [x-x] for ranges of numbers.
Now that we know what '\d' does, we can look at the next symbol, '{3}'. Any idea what this does? This will check for 3 instances of whatever immediately precedes it. This is about the point that some of the concepts get to be a bit abstract, so we'll take this part slow.
Basically, what we're look for with the above regex is 3 instances of a numerical digit. That would return true against '353', '543' and '000'. It would return false against '4', '46' and '5423'. Seems quite obvious now, doesn't it?
Pop quiz: What does the following regex do
CODE
[A-Z]{3}$
Guesses? It checks any given string and sees if the last three characters are capitlized letters of the alphabet. It would return true on 'CAT', 'interNET', 'hELP'. It would also return true on 'DREAMINCODE', because the last three characters of that string are capital letters. The expression would flag false for 'DS' (not enough characters) and 'DoG' (not all capital letters).
Notice that that expression flagged false for 'DS'. This shows the importance in creating your regular expressions to be as specific as possible. As you get to building more complex expressions, you will find that you need to be as specific as possible, or else your regex will be wrong. Another thing to note is that there are more than one way to write a regular expression, both of which are correct. Our original expression:
CODE
^\d{3}-\d{2}-\d{4}$
Could also be written as
CODE
^[0-9]{3}-[0-9]{2}-[0-9]{4}$
Both of those expressions check for the exact same thing. Developing efficient and accurate regular expressions is an art form in itself, and we'll cover that next tutorial.
Now, our original regular expression was designed to ensure that a social security number was inputted in the proper format. The proper format of a SSN for this form is three digits, followed by a '-', followed by two digits, followed by another '-', followed by four more digits. Notice that our next character to investigate is a dash.
CODE
^\d{3}-
Here's another spot where regexes get to be confusing. Do you remember what else a dash signifies? If you said a range, you'd be correct! However, a dash only signifies a range if it is enclosed between two square brackets '[]'/ This dash is not within any brackets, so it is read as a regular dash character. The regular expression will check within the string that the character at that position is a dash. So far, we have analyzed the expression to look for an entire string that has the first characters as digits and is followed by a dash. A valid entry on the form would be:
000-14-5832, 999-349-683, 789-234-2343
However, what if we wanted to adjust this form to allow for either spaces or dashes? Do you know how we would write that regular expression?
CODE
^\d{3}[ \-]\d{2}[ \-]\d{4}$
Is that what you expected? You may not have known to include the backslash before the dash in the square brackets. Remember how we said that a dash would indicate a range? If we had just used [ -], we would have had an error. Much like when writing in different programming languages, the backslash '\', is used to escape the character after it. Because - has a special meaning in square brackets, we must escape it to search for that particular character. Hopefully that makes sense, as I know it's a difficult concept to explain, but an easy one to understand.
Back to the original regex
CODE
^\d{3}-\d{2}-\d{4}$
The rest of the expression is pretty much the same. It searches for two more digits, followed by another dash, followed by another four digits. If the regular expression is in the form 'xxx-xx-xxxx', where 'x' is actually a digit, the expression will flag true. Anything else will flag false.
Now, I'm going to come back to what I had discussed in the beginning of this tutorial, and it's a very important point to make. This regex does not check for a valid social security number. It checks for a valid formatting pattern of a social security number. Actual US social security numbers have various restrictions on them. They cannot start with 000 or have 00 in the first or second group of numbers. The first three numbers also cannot be above a certain number range. One regular expression that has a better check for a valid expression is
CODE
^(?=((0[1-9]0)|([1-7][1-7]\d)|(00[1-9])|(0[1-9][1-9]))-(?=(([1-9]0)|(0[1-9])|([1-9][1-9]))-(?=((\d{3}[1-9])$|([1-9]\d{3})$|(\d[1-9]\d{2})$|(\d{2}[1-9]\d)$))))
Don't worry if that looks all crazy to you. It's because it is. There are also quite a few symbols in there that we have not yet covered, including '?', '=', '(', and '|'. We'll cover these eventually as well.
One more thing that I want to cover is the use of the curly braces {}. We used them to ensure that there were 3 instances of a given pattern. The following regex:
CODE
^[a-zA-Z]{3}$
Flags true for 'hat', 'box', 'and'. Flags false for 'dream', 'in' and 'h4x'.
That checks to ensure that a string is a three letter word with only alphabetic characters in it. But what if we wanted to see if a word was an alphabetic word that was 3-5 characters long? Well we can specify a character length range with the {} symbols. Our modified code would read
CODE
^[a-zA-Z]{3,5}$
QUOTE
By separating the ranges with a comma, the regex will search for a string that is 3 to 5 characters in length.
Finally, I wanted to cover something that I forgot to mention in the last tutorial. We specified what happens when you use the ^ and $ characters, but we did not discuss what would happen if we did not use either character.
CODE
cat
What would that return true for? That regex would return true for any string that contained 'cat', regardless of it's position.
Returns true for 'cat', 'caterpillar', 'subcat', 'blackcat3'. Hopefully that clears up any issues that people may be having.
Well, I feel like we covered a lot in this tutorial. We covered the {} symbols, \d symbol, searching for specific characters, escaping characters, and the importance of detail when using regexes. In addition, if you've read my previous tutorial, you know how to use the ^, $, [], and - symbols. Using those you can begin generating your own very basic regexes.
This tutorial is going to wind up being a few sections longer than I had originally intended, but I want to avoid throwing too much information at you at one time. The next tutorial will cover more basic validations. Specifically, we will impose restrictions on the generation of usernames and passwords. The fourth tutorial will be on validating e-mail addresses and go into more detail on the importance of being specific with your regexes. Finally, the fifth tutorial will discuss how to incorporate regular expressions using standard php functions. Thanks for being patient while I write these out, and hopefully they're not too long for everybody. As always, questions, comments, criticisms, and thanks are welcome. :-D
