Page 1 of 1

Regular Expression understanding their meaning and usage Rate Topic: -----

#1 Anarion   User is offline

  • The Persian Coder
  • member icon

Reputation: 387
  • View blog
  • Posts: 1,663
  • Joined: 16-May 09

Post icon  Posted 29 November 2009 - 05:00 AM

What is a Regular Expression ?
Well, a regular expression, usually called "regex" or "regexp" too, describes text patterns. Assume you're looking for a piece of text that starts with either two, three or four letters 'a', followed by exactly three letters 'b'. This pattern can be described with the regex a{2,4}b{3}.
From the above regex one can determine that not all characters are interpreted literally. The accolades (or curly braces, take your pick) clearly have a special meaning. Characters with such a special meaning are called meta characters. So, regular expressions have their own particular syntax, and so you could speak of a regex language.

As with most human languages the regex language has many dialects; regexes written for perl aren't automatically suited for sed, awk or grep, to name just a few standard UNIX tools.

I've chosen to write all the regexes in this tutorial in the POSIX dialect. This because POSIX is slowly winning terrain in the world of regexes, and because a fair amount of dialects are similar to it (well, actually it's the other way around). But this doesn't mean I'll be covering all the features of the POSIX 1003.2 regular expression standard. Another reason for using the POSIX dialect as opposed to the Perl dialect is because the Perl documentation does a much better job of explaining the Perl dialect than I ever will. Also, this way you won't be locked into any particular tool's regex extensions. In a way, the POSIX dialect can be considered the greatest common denominator.

A regex by itself does very little. Only by applying such a description of a text pattern to a piece of text does anything happen. The actual applying is done by a piece of software called a regex engine. The text is searched from the start until a piece of text is found that matches the pattern description (the regex), or until it runs out of text. Such a match is called a pattern match.

There are basically two ways of using regular expressions. One is by using special-purpose tools that were built specifically to apply regexes to text, like grep, egrep and sed. The other way is by using the regex capabilities built into a programming or scripting language. These days, most languages, like C, C++, Javascript, Python and PHP for example, provide functions or methods that can apply a regex to a piece of text. The code that actually applies regular expressions to text is called a regular expression engine.
awk and particularly perl don't quite fit either way. Once you get the hang of perl, you'll notice how tightly the concept of applying a regex to data is integrated into the whole design of perl.

Meta Characters
To be able to discuss meta characters, we first have to determine what "ordinary" characters mean to a regex. The regex cat does indeed find the "cat" in the text The neighbour's cat pees on my lawn but also the "cat" in the winter catalog. So, regular expressions work purely on text, and don't look at the semantics. It's important to realise that the above regex doesn't mean anything more to the regex engine than a 'c', followed by an 'a', followed by a 't', where ever it may be in the text to which the regex is applied.

Now I guess is the time to view an example, let's create a text file and put these lines in it:

I saved it under the name "mylist"... OK, now let's do an egrep command on this list. Type this in your terminal:
egrep pear mylist
The output will be the line containing "pear", so this command outputs only "pear".
Now, do this:
egrep ea mylist
What do you see ? both "pear" and "peach".
These are easy, just play with them a little. Just a note, if you write egrep -v a mylist, you will see the lines that do not contain "a", this is what -v does, invert the results.

Using ^ and $ you can force a regex to match only at the start or end of a line, respectively. So ^cat matches only those lines that start with cat, and cat$ only matches lines ending with cat.

Now look at this:
egrep ^b mylist
What do you see now? it will output both banana and blueberry.
Just as a note when using $, put it in single quotes to avoid variable substitution by shell, like egrep 'e$' mylist.

Moving on, ^cat$ only matches lines that contain exactly cat. You can find empty lines in a similar way with ^$. If you're having trouble understanding that last one, just apply the definitions. The regex basically says: "Match a start-of-line, followed by an end-of-line".
Also, a regex with only a start-of-line anchor ^ always matches, since every line has a start. The same obviously goes for the end-of-line anchor.

A lot of regex implementations offer the ability to use word anchors. As you saw, a regex like cat not only finds the word cat, but also all those cases where cat is "hidden" in other, longer words. In such cases you can use the start-of-word and end-of-word anchors, \< and \>, respectively. These meta characters don't match on characters, but between them.
So if you were looking only for occurrences of the word cat, you could use the regex \<cat\>.

OK, now let's change our list file's contents, have it like this:
brown cat

Now, type this command in your terminal:
egrep '\<cat' mylist
What do you see? this will output cat and brown cat and catalog. Now, try this one:
egrep 'cat\>' mylist
This will output cat and brown cat.

Character Classes
With the [] construct you can indicate that on a certain position within the pattern one of several characters may appear. Suppose for instance that you're trying to find both cake and coke. In that case you can use the regex c[ao]ke.
Another example, to recognise hexadecimal digits, is [0123456789abcdefABCDEF]. This quickly becomes impractical though. Fortunately you can use a hyphen to specify a range: [0-9]. More than one range in a character class is also allowed: [0-9a-fA-F].
Just make sure you don't write [A-z] when you mean [A-Za-z]. Though it might look convenient, the first regex also catches the six characters between 'Z' and 'a' (if you're using the ASCII character set, that is).

You can also specify a negated character class by placing a caret (^) directly after the opening bracket: [^]. This inverts the sense of the character class: [^0-9] matches any character but digits.

Fine, but what if you want those brackets, hyphen and caret to appear as characters inside a character class? One way is to escape them with a backslash: [\^\]]. Another way is to put them in places in the character class where they're not valid. The regex engine will then treat the character as a literal. So, place the dash first or last within the character class, the caret in any but the first place, and the closing bracket right after the opening bracket: []^[-] is a valid character class containing four characters.

The dot, ., can be considered a special case of a character class, in that it matches any character. th.s for instance matches both this and thus, but also thqs, th#s, etc.

With quantifiers, you can specify how many times a character can appear in a place, so ap{1,2} tells there must be an "a" and 1 or 2 "p"s after it. To match any sequence of zero or more vowels, the regex looks like [aeiou]{0,}
Just note that a quantifier only applies to the item that precedes it.

With the | meta character, the or, you can merge several regexes into a single regex. With this you supply the regex engine with alternatives. Jack and Jill are two seperate regexes, whereas Jack|Jill is one that will match either.
Further back I mentioned the regex c[ao]ke. Using alternation you can write it (less efficiently) as c(a|o)ke, where the parentheses (which therefor are meta characters too, more on this later) are used to limit the effect of the alternation.

In addition to the function of limiting the effect of alternation, parentheses () have another function, which is grouping for quantifiers. Everything about quantifiers that applies to characters and character classes also applies to groups.
An example is (apple*){2,3}, which matches apple apple as well as apple apple apple.

It seems this tutorial got long... but as you can see, Regular Expressions are quite useful in scripts :D So I wish you learned them well!

Is This A Good Question/Topic? 0
  • +

Page 1 of 1