Subscribe to 10 GOTO 10        RSS Feed
-----

Why Should I Learn RegEx?

Icon 7 Comments
Coding with RegEx Enabled Editors

Many programmers H A T E regular expressions. Regular Expressions are an alphabet/symbolic soup and its hard to tell what is going on and harder still to figure out how to do anything with them. Even the name RegEx is confusing. They are not very standard and the theory behind them is technical and boring (to some).

However these little symbolic soup regular expressions can be a life saver, not to mention time saver, to a programmer. One way that regex can be very useful is in taking data from some source and formatting it into a valid syntax for a programming language. I tend to do a great deal of re-formatting of data to fit it into my programs.

Here are some examples.

Example #1: C++ Keywords.


On various web pages one can find a list of the C++ keywords however importing these into a parser program can be quite a boring task. So I needed to take a list like this:

alignas continue friend reinterpret_cast typedef
alignof decltype goto return typeid
asm default if short typename
auto delete inline signed union
bool double int sizeof unsigned
break do long static_assert using
case dynamic_cast mutable static_cast virtual
catch else namespace static void
char enum new struct volatile
char16_t explicit nullptr switch wchar_t
char32_t export operator template while
class extern private this
const false protected throw
constexpr float public true
const_cast for register try


and format this as an array of strings.

I use Programmer's Notepad 2.0 as my go-to editor because it has some wonderful RegEx abilities (not to mention it is script-able). I began by pasting this list into a new file. Then ctrl-H beings up the regex-search-and-replace.

Search: (\S+)
Replace: "\1",

Then just add in the brackets and give the array a name and tada, I was done.
const char *cppkeywords[] = {
    "alignas", "continue", "friend", "reinterpret_cast", "typedef",
    "alignof", "decltype", "goto", "return", "typeid",
    "asm", "default", "if", "short", "typename",
    "auto", "delete", "inline", "signed", "union",
    "bool", "double", "int", "sizeof", "unsigned",
    "break", "do", "long", "static_assert", "using",
    "case", "dynamic_cast", "mutable", "static_cast", "virtual",
    "catch", "else", "namespace", "static", "void",
    "char", "enum", "new", "struct", "volatile",
    "char16_t", "explicit", "nullptr", "switch", "wchar_t",
    "char32_t", "export", "operator", "template", "while",
    "class", "extern", "private", "this",
    "const", "false", "protected", "throw",
    "constexpr", "float", "public", "true",
    "const_cast", "for", "register", "try" 
};


Total time, about 3 seconds.

Example 2: Binary Data


I was working on a snippet that would print out some large block letters today (derivative from this topic). I didn't really want the font to be in a standalone file that would have to follow the program/code about. But the font file contains binary data. I could have written a quick little script that would read in the binary data and print out the data in a C++ friendly hex format.

However I really did not want to waste all of that time writing/debugging such a program. So regex to the rescue: I started with a hex dump of the file:

00000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010 | 00 00 7E 81 A5 81 81 BD 99 81 81 7E 00 00 00 00 ..~.....~....
00000020 | 00 00 7E FF DB FF FF C3 E7 FF FF 7E 00 00 00 00 ..~.....~....
00000030 | 00 00 00 00 6C FE FE FE FE 7C 38 10 00 00 00 00 ....l|8.....
...
...
...


Then I used three steps to get me from the hex dump text to my final data.

First I wanted to clean off the data so I only had the hex values in the middle:

Search: ^\S+\s\|\s((\S\S\s)+).*$
Replace: \1

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 7E 81 A5 81 81 BD 99 81 81 7E 00 00 00 00 
00 00 7E FF DB FF FF C3 E7 FF FF 7E 00 00 00 00 
00 00 00 00 6C FE FE FE FE 7C 38 10 00 00 00 00 
...


Next I wanted to add the 0x and add a comma into the data.

Search: (\S\S)
Replace: 0x\1,

0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x7E, 0x81, 0xA5, 0x81, 0x81, 0xBD, 0x99, 0x81, 0x81, 0x7E, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x7E, 0xFF, 0xDB, 0xFF, 0xFF, 0xC3, 0xE7, 0xFF, 0xFF, 0x7E, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x00, 0x00, 0x6C, 0xFE, 0xFE, 0xFE, 0xFE, 0x7C, 0x38, 0x10, 0x00, 0x00, 0x00, 0x00, 
...


And then finally I wanted to make this an array of 16 char arrays so I used:

Search: ^(.*),\s*$
Replace: {\1},

add in the variable name at the top and a semicolon at the bottom: DONE
const char LargePrint::font[256][16] = {
    {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00},
    {0x00, 0x00, 0x7E, 0x81, 0xA5, 0x81, 0x81, 0xBD, 0x99, 0x81, 0x81, 0x7E, 0x00, 0x00, 0x00, 0x00},
    {0x00, 0x00, 0x7E, 0xFF, 0xDB, 0xFF, 0xFF, 0xC3, 0xE7, 0xFF, 0xFF, 0x7E, 0x00, 0x00, 0x00, 0x00},
    {0x00, 0x00, 0x00, 0x00, 0x6C, 0xFE, 0xFE, 0xFE, 0xFE, 0x7C, 0x38, 0x10, 0x00, 0x00, 0x00, 0x00},
    ...
    {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}
};


Time taken: About 1 min. (I did have to try a couple of things to get exactly what I wanted).

A good RegEx search and replace feature is a must for any text editor used for coding. Programmers should really take some time to learn how to use these features in their favorite IDE/Editor.

Leaning how to use sed can also be a life-saver.

Like so many tools regular expressions become more and more valuable the more familiar you are with them. The more you use them the more uses you will find for them.

7 Comments On This Entry

Page 1 of 1

westforduk 

24 February 2010 - 02:10 AM
Great post. I have always avoided using RegEx at all costs but it has some very nice applications it would seem.
0

NickDMax 

24 February 2010 - 10:47 AM
I even use RegEx a good bit when posting to DIC. For example when formatting something for display. Say you wanted to display the output of a program that did this:
....*
.. * *
..* * *
.* * * *
..* * *
.. * *
....*


But when you paste it into your post you get this instead:
*
* *
* * *
* * * *
* * *
* *
*


The forum software compresses spaces and removes leading spaces... this leads to problems when trying to display formatted text. One can use the code tags but even that can have problems.

My solution has been to convert the spaces to dots and then set the color to white, which makes the dots nearly invisible (the DIC background color is not exactly white I am sure I could look it up in the CSS but I don't really care THAT much... white works for me).

So taking the original output and then using a regex to replace spaces (generally I replace only multiple spaces) with dots:

Search: " "
(note: that is just two spaces (hit space-bar twice) with no quotations).
Replace: ..

That gives me:
....*
.. * *
..* * *
* * * *
..* * *
.. * *
....*

-- I need to add 1 dot in the 4'th line (DIC ignores all leading spaces). Then I do a search and replace for dots...

Search: (\.+)
Replace: [color=white]\1[/color]

And then I have something ready to paste into the post (note to preserver the formatting I also set the font to a fixed-width fort like courier).

Another example was updating the C++ Roll call post. where I needed to insert the member tag. Copied the list into my editor, did a simple RegEx search & replace and and then pasted it back. Took about 30 seconds.

Now what would be cool is a Firefox plugin that lets me do these RegEx search and replace operations directly within an editor control.
0

WolfCoder 

24 February 2010 - 12:10 PM
The reason I don't use RegEx is because it opens you up to all sorts of bugs.
0

NickDMax 

24 February 2010 - 12:40 PM
I will agree that using regex in a program can be a bit tricky since they are very dependent upon the input string. They are great for highly structured situations but when used in less structured situations (like in HTML or even doing a search and ReplaceAll in an editor) they can be problematic. I love RegEx, but I too am very careful where I choose to use them in code because they do indeed open you to bugs if you are not careful.

But using them as a tool in your favorite text editor or IDE is definitely a good place to use them.
0

Programmist 

25 February 2010 - 04:49 AM
I never quite understood why people don't love regular expressions, but the last couple of times I've used them to solve some string matching/replacing problem other devs have gotten this look on their faces like they smelled something bad. One time, I asked a guy what was the matter and he just said, "ugh..regular expressions. I suck at regular expressions." This blows my mind because they are not hard, yet seem to have this reputation of difficulty (like Math does in America), so that if you have some mastery of them you are considered a "wizard" or something. Pretty ridiculous.
0

moopet 

28 February 2010 - 12:25 PM

NickDMax, on 24 February 2010 - 06:40 PM, said:

I will agree that using regex in a program can be a bit tricky since they are very dependent upon the input string. They are great for highly structured situations but when used in less structured situations (like in HTML or even doing a search and ReplaceAll in an editor) they can be problematic. I love RegEx, but I too am very careful where I choose to use them in code because they do indeed open you to bugs if you are not careful.

But using them as a tool in your favorite text editor or IDE is definitely a good place to use them.


I disagree. I think they are fine in highly-structured situations, but less structured situations (like in HTML) is where they can really shine. There are limitations, of course, but they're ideal for taking something that roughly matches what you were expecting and sanitises it.
0

mostyfriedman 

17 May 2010 - 11:49 AM
good read Nick.. I haven't done a lot of work with regular expressions practically, but have designed many regular expressions in my theory of computation course. It was one of my favorite parts of the course, and I never found it to be difficult, it was just a little tricky sometimes because you gotta be really careful, your regex can generate strings that aren't supposed to be generated, your regex could also miss some strings that were supposed to be included in the language. I found those little problems to be challenging but very amusing.
0
Page 1 of 1

April 2019

S M T W T F S
 123456
78910111213
14151617181920
212223 24 252627
282930    

Recent Entries

Search My Blog

0 user(s) viewing

0 Guests
0 member(s)
0 anonymous member(s)