Page 1 of 1

An Introduction to Regular Expressions (Grep) in Linux Rate Topic: -----

#1 macosxnerd101  Icon User is offline

  • Self-Trained Economist
  • member icon




Reputation: 10397
  • View blog
  • Posts: 38,466
  • Joined: 27-December 08

Posted 23 December 2012 - 04:43 PM

Using Regular Expressions is a powerful and efficient way to find key pieces of text. This tutorial will explore the tools used to construct basic regular expressions, compatible with grep for Linux.

Period
The period meta-character is a wildcard, matching any single character. Consider the following regular expressions. For example, the regular expression c.d will match ced, cdd, c8d, etc. The middle character can be anything. Note that the following strings are examples of what the regular expression will not consider a match: ceed, c45d, cddd. However, those strings would be matched by the new regular expression c..d. Now there are two periods, each acting as a wildcard for the particular spot in the string.

Asterisk, Plus, and Question Mark
These three meta-characters match strings based on the desired number of times the preceding regular expression appears.

The asterisk character matches 0+ occurrences of the preceding regular expression. An example would include fg*h, with g being evaluated by the wildcard. So fh, fgh,, fgggggh, etc., would all be matched by the regular expression. Note that ffffgh would not be considered a match, as the f isn't evaluated by the wildcard.

The plus character works similarly to the asterisk character, except that is matches at least 1 occurrence of the preceeding regular expression. So using the previous example, fg+h matches fgh, fgggggh, but not fh or ffffgh.

The question mark character matches 0 or 1 occurrence of the preceding regular expression. So in similar fashion as the asterisk and plus characters: fg?h will only match fh and fgh. Another example would be f.?h, which would match fh or anything matched by f.h.

Logical Or
The logical or meta-character works the same way as in traditional programming languages. It matches one or both of the regular expressions provided. The pipe character is used as the operator. An example would include: ab|cd, which would match regular expressions with either ab or cd or both. Also, note that the pipe symbol has special uses in both the Linux and Windows shells. Therefore, single quotes are required to enclose the regular expression in Linux, and double quotes in Windows.

Beginning and End of Line Meta-Characters
It is sometimes necessary to match a pattern at either the beginning or end of the line. The caret and dollar sign meta-characters are used for these purposes. Some examples:

^test would match the following lines:
testing 123
test test test
tested

And test$ would match:
test
new test

However, equally would not be matched.

Escaping
In order to match certain meta-characters as literal Strings, it is necessary to escape them. Some examples include:
test\. matches test.
equal\^ matches equal^

Note that the period and caret symbols don't act as tools for matching patterns.

Groups and Sets
Parentheses are used to group characters together, to be treated as a single unit. The unit is then evaluated by other meta-characters. As an example, (ab)+c will match abc, ababc, ababababababc, etc. It will not match abbbbbc or aabbc.

Brackets are used to define sets of characters which will be matched. Only one character from the set will be chosen. Some examples include:
[A-Z]- Capital Letters
[A-Za-z]- All letters
[AEIOUae- Capital vowels, lower-cased a and e.

Sets can also be negated using the caret-operator. An example would be:
[^A-Z]- Matches any character except capital letters.

Curly Braces are used to specify the number of elements to match from a set. They can also handle upper and lower bounds. Some examples:
[A-Z]{3}- Matches 3 uppercased letters
[0-9]{0,3]- Matches up to 3 numeric characters.

Word Bounds
The last tool to discuss is the word bounds. The syntax for this tool is:
\<Word Here\>.

Some examples include:
\<equal\>- Matches equal, but not equally.

\<test\>- Matches test, but not tested or testing.

Putting It All Together
Let's look at a couple of examples for putting a regular expression together.

First, let's design a regular expression to match a vowel, followed by at most one numeric character, then any number of lower-cased letters. The regular expression for this would be: [AEIOUaeiou][0-9]?[a-z]*. The first set defines the vowels

Next, let's design a regular expression to match all five digit numbers with a second digit as 4 and the fifth digit as an even digit. The regular expression would be: \<[0-9]4[0-9]{2}[02468]\>. Here, a word bound is used because we are only interested in five digit numbers. The [0-9]{2} bounds picks any two numeric characters to fill the 3rd and 4th digit slots, and the last set [02468] defines the set of even digits.

Conclusion
I hope this tutorial provided some insight and intuition on building regular expressions using the grep flavor.

Is This A Good Question/Topic? 3
  • +

Replies To: An Introduction to Regular Expressions (Grep) in Linux

#2 JonBernal  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 199
  • Joined: 14-March 09

Posted 23 December 2012 - 05:04 PM

Quite great!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1