Page 1 of 1

Regular Expressions, Part 2 Seriously, it's not to scary. Come in and check it out! Rate Topic: -----

#1 akozlik  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 90
  • View blog
  • Posts: 797
  • Joined: 25-February 08

Post icon  Posted 04 June 2008 - 01:54 PM

Welcome to part two of my regular expressions tutorial! If you're reading this, then I haven't scared you away yet. Hopefully you've been learning a thing or two from my tutorials, and have perhaps begun looking at other regexes to analyze and understand. This second part tutorial will go into the importance of detail, escaping special characters, checking for string lengths, and a few other things along the way. I originally planned on this series being three parts, but after doing some writing and discussing, it's clear that we're only beginning to scratch the service. Due to this, it looks like the tutorial is going to be expanded to 5 different sections. I need your help while developing, so if anything is unclear or you have suggestions or comments, it's very important to voice them. I read all comments and take everything to heart, so your voice will be heard!

If you haven't looked at Part 1 of this tutorial series, do so now. That knowledge will be needed to follow along with this tutorial. If you have already read it, lets jump in!

How will this tutorial be laid out?

The best way to learn regular expressions is to take a look at examples, and to break them down. For the rest of this tutorial series, that is what we are going to do. I will give you a regular expression, explain what it does, introduce any new symbols that you may need to know, and then we'll analyze the expression. Hopefully by doing this enough times you will learn to analyze expressions on your own. This tutorial is going to cover validating social security number formats, and creating restrictions on usernames and passwords. The final tutorial will cover e-mail validations, and introduce developing your own regular expressions.

Can we just get to the meat?

Most definitely! As always, lets take a look at the first regular expression we will be breaking down.

^\d{3}-\d{2}-\d{4}$



Can you tell what that regular expression does? In case you haven't been following along, that is one regular expression that is used to verify that a social security number has been entered into a form using the correct format.

It's very important that you note the precise phrasing of what that regex does. I'll repeat:

Quote

It is used to verify that a social security number has been entered into a form using the correct format


Notice that I did not say that that expression checks for a valid social security number. I will cover why this is important after we break down the expression.

Well lets begin breaking down our regex. First, we notice that everything is contained between ^ and $ characters. Do you remember what that does? It causes the expression to check for the given pattern within an entire string. Anytime you want to verify that an entire string meets the given patter, you must enclose it within ^ and $ characters.

The next symbol that we need to look at is the '\d' symbol. That is an escape symbol much like '\t' or '\n' in many of the programming languages you may have used. Any idea what that does? It is the symbol for any numberical digit. Essentially, '\d' is shorthand for typing [0-9]. Either way will have the same end result. It's typically best practices though to use the shorthand for \t, and [x-x] for ranges of numbers.

Now that we know what '\d' does, we can look at the next symbol, '{3}'. Any idea what this does? This will check for 3 instances of whatever immediately precedes it. This is about the point that some of the concepts get to be a bit abstract, so we'll take this part slow.

Basically, what we're look for with the above regex is 3 instances of a numerical digit. That would return true against '353', '543' and '000'. It would return false against '4', '46' and '5423'. Seems quite obvious now, doesn't it?

Pop quiz: What does the following regex do

[A-Z]{3}$



Guesses? It checks any given string and sees if the last three characters are capitlized letters of the alphabet. It would return true on 'CAT', 'interNET', 'hELP'. It would also return true on 'DREAMINCODE', because the last three characters of that string are capital letters. The expression would flag false for 'DS' (not enough characters) and 'DoG' (not all capital letters).

Notice that that expression flagged false for 'DS'. This shows the importance in creating your regular expressions to be as specific as possible. As you get to building more complex expressions, you will find that you need to be as specific as possible, or else your regex will be wrong. Another thing to note is that there are more than one way to write a regular expression, both of which are correct. Our original expression:

  
^\d{3}-\d{2}-\d{4}$



Could also be written as

^[0-9]{3}-[0-9]{2}-[0-9]{4}$



Both of those expressions check for the exact same thing. Developing efficient and accurate regular expressions is an art form in itself, and we'll cover that next tutorial.

Now, our original regular expression was designed to ensure that a social security number was inputted in the proper format. The proper format of a SSN for this form is three digits, followed by a '-', followed by two digits, followed by another '-', followed by four more digits. Notice that our next character to investigate is a dash.

^\d{3}-



Here's another spot where regexes get to be confusing. Do you remember what else a dash signifies? If you said a range, you'd be correct! However, a dash only signifies a range if it is enclosed between two square brackets '[]'/ This dash is not within any brackets, so it is read as a regular dash character. The regular expression will check within the string that the character at that position is a dash. So far, we have analyzed the expression to look for an entire string that has the first characters as digits and is followed by a dash. A valid entry on the form would be:

000-14-5832, 999-349-683, 789-234-2343

However, what if we wanted to adjust this form to allow for either spaces or dashes? Do you know how we would write that regular expression?

^\d{3}[ \-]\d{2}[ \-]\d{4}$



Is that what you expected? You may not have known to include the backslash before the dash in the square brackets. Remember how we said that a dash would indicate a range? If we had just used [ -], we would have had an error. Much like when writing in different programming languages, the backslash '\', is used to escape the character after it. Because - has a special meaning in square brackets, we must escape it to search for that particular character. Hopefully that makes sense, as I know it's a difficult concept to explain, but an easy one to understand.

Back to the original regex

^\d{3}-\d{2}-\d{4}$



The rest of the expression is pretty much the same. It searches for two more digits, followed by another dash, followed by another four digits. If the regular expression is in the form 'xxx-xx-xxxx', where 'x' is actually a digit, the expression will flag true. Anything else will flag false.

Now, I'm going to come back to what I had discussed in the beginning of this tutorial, and it's a very important point to make. This regex does not check for a valid social security number. It checks for a valid formatting pattern of a social security number. Actual US social security numbers have various restrictions on them. They cannot start with 000 or have 00 in the first or second group of numbers. The first three numbers also cannot be above a certain number range. One regular expression that has a better check for a valid expression is

^(?=((0[1-9]0)|([1-7][1-7]\d)|(00[1-9])|(0[1-9][1-9]))-(?=(([1-9]0)|(0[1-9])|([1-9][1-9]))-(?=((\d{3}[1-9])$|([1-9]\d{3})$|(\d[1-9]\d{2})$|(\d{2}[1-9]\d)$))))



Don't worry if that looks all crazy to you. It's because it is. There are also quite a few symbols in there that we have not yet covered, including '?', '=', '(', and '|'. We'll cover these eventually as well.

One more thing that I want to cover is the use of the curly braces {}. We used them to ensure that there were 3 instances of a given pattern. The following regex:

^[a-zA-Z]{3}$



Flags true for 'hat', 'box', 'and'. Flags false for 'dream', 'in' and 'h4x'.

That checks to ensure that a string is a three letter word with only alphabetic characters in it. But what if we wanted to see if a word was an alphabetic word that was 3-5 characters long? Well we can specify a character length range with the {} symbols. Our modified code would read

^[a-zA-Z]{3,5}$



Quote

By separating the ranges with a comma, the regex will search for a string that is 3 to 5 characters in length.

Finally, I wanted to cover something that I forgot to mention in the last tutorial. We specified what happens when you use the ^ and $ characters, but we did not discuss what would happen if we did not use either character.

cat



What would that return true for? That regex would return true for any string that contained 'cat', regardless of it's position.

Returns true for 'cat', 'caterpillar', 'subcat', 'blackcat3'. Hopefully that clears up any issues that people may be having.

Well, I feel like we covered a lot in this tutorial. We covered the {} symbols, \d symbol, searching for specific characters, escaping characters, and the importance of detail when using regexes. In addition, if you've read my previous tutorial, you know how to use the ^, $, [], and - symbols. Using those you can begin generating your own very basic regexes.

This tutorial is going to wind up being a few sections longer than I had originally intended, but I want to avoid throwing too much information at you at one time. The next tutorial will cover more basic validations. Specifically, we will impose restrictions on the generation of usernames and passwords. The fourth tutorial will be on validating e-mail addresses and go into more detail on the importance of being specific with your regexes. Finally, the fifth tutorial will discuss how to incorporate regular expressions using standard php functions. Thanks for being patient while I write these out, and hopefully they're not too long for everybody. As always, questions, comments, criticisms, and thanks are welcome. :-D

This post has been edited by akozlik: 08 June 2008 - 10:10 AM


Is This A Good Question/Topic? 0
  • +

Replies To: Regular Expressions, Part 2

#2 silverblaze  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 69
  • Joined: 18-January 08

Posted 08 June 2008 - 08:08 AM

hello akozlik...

its a great tutorial. After reading this tutorial and doing some researches on my own on this topic. i just felt like i was a new born baby. Some months back i had done a clone of urlcash. at that time to get the urls for a string of text i had to rite some crazy substr, strpos n all. It was a real dirty code and it cost me a hell lot of time. but after learning this it took only half a minute to rite a code which took me hours earlier. Truely Regular expersions are so powerful. i did try to learn this once bt whn i saw the complex patterns i jst left it thr. bt today i realize i had made a big mistake and also its nt at all complex once u learnt hw to rite it even though it seems too tough to understnt.

Really bro this tutorial is really hlpful. Thankyou for ur gr8 work.


But bro i had a dought in the following segment.

View Postakozlik, on 4 Jun, 2008 - 01:54 PM, said:

Pop quiz: What does the following regex do

[A-Z]$



Guesses? It checks any given string and sees if the last three characters are capitlized letters of the alphabet. It would return true on 'CAT', 'interNET', 'hELP'. It would also return true on 'DREAMINCODE', because the last three characters of that string are capital letters. The expression would flag false for 'DS' (not enough characters) and 'DoG' (not all capital letters).


Isn't the pattern [A-Z]$ checks only the last letter? and for checking the last 3 letter are capital dont we have to use the patter [A-Z]{3}$ or somethin? .

Please take a look at it and correct me if im wrong in anyway.

Thankyou its been a great tutorial..

takecare.
Was This Post Helpful? 0
  • +
  • -

#3 PsychoCoder  Icon User is offline

  • Google.Sucks.Init(true);
  • member icon

Reputation: 1642
  • View blog
  • Posts: 19,853
  • Joined: 26-July 07

Posted 08 June 2008 - 08:38 AM

Excellent tutorial akozlik, keep up the great work! :)
Was This Post Helpful? 0
  • +
  • -

#4 akozlik  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 90
  • View blog
  • Posts: 797
  • Joined: 25-February 08

Posted 08 June 2008 - 10:09 AM

Quote

Isn't the pattern [A-Z]$ checks only the last letter? and for checking the last 3 letter are capital dont we have to use the patter [A-Z]{3}$ or somethin? .


You're absolutely right. I'm going to pretend like I put that in there to trick someone, but really it was my own thoughtlessness. Ha ha. I'm editing it now. You're exactly right. Maybe that's a good thing though because it shows I'm actually teaching! Ha ha.
Was This Post Helpful? 0
  • +
  • -

#5 silverblaze  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 69
  • Joined: 18-January 08

Posted 08 June 2008 - 12:56 PM

haha... dear akozlik.. u r defenitly teaching.. n im always here ready to learn new things.. :) .. keep up the gud wrk buddy ..

takecare.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1