Subscribe to Stuck in an Infiniteloop        RSS Feed
***** 1 Votes

Guess what rAWKs

Icon 3 Comments
I recently ran into a problem where I needed to convert a file from format A to XML. Essentially, take the lovely tidbits in their random format of 'A' and put them nice and neat into a well groomed XML file. I'm not going to lie, I started to do this by hand. After about five records (in a several thousand line file) I uttered an expletive out loud and started to search for a better way to go about this.

Lo and behold! Upon my searching did sed and awk then appear! Since I really didn't need to manipulate the stream as is (since the data in format A is consistent) I could write myself a little awk script to do the conversion. I took the next hour to teach myself awk and that's when the magic started to happen. In 50 lines or so of awk scriptery I allowed the computer to do in a microsecond what it would have taken some poor bastard a day to do by hand. Huzzah!

Consider the overly simplified example:

I have the following input file:

Quote

Employee: Bob
Salary: $60000
Age: 45
Employee: Karen
Salary: $70000
Age: 33
Employee: Joel
Salary: $40000
Age: 54
Employee: Billy
Salary: $80000
Age: 23
Employee: Kermit
Salary: $35000
Age: 18


The company is finally deciding to upgrade their HR software and requires that this be rewritten in XML, you could do it by hand to get something like:

<Employees>
<employee>
     <name></name>
     <!-- etc etc-->
</employee>
</Employees>



Or you could whip up a quick awk script to do it for you. Awk works sequentially, meaning each line is fed into awk in the order it appears in the file. This works great for us since we know if we hit an "Employee" line, we need to create a new employee node.

We could do this via the command line, but let's do a script instead:

#!/usr/bin/awk -f

BEGIN 		{ print "<employees>";  }
/Employee/	{ printf("\t<employee>\n\t\t<name>%s</name>\n", $2); }
/Salary/	{ printf("\t\t<salary>%s</salary>\n", substr($2, 2)); }
/Age/		{ printf("\t\t<age>%s</age>\n", $2);  print "\t</employee>" }
END 		{ print "</employees>"; }



The first line says "this is an awk script that I want to run independently", so chmod +x this file and run it as:

./nameOfScript inputFile > outputFile

so I did:

./formatter.awk inputData > test.xml

BEGIN and END are special cases where you want to do something when you start and finish (who would have thunk it?) rather then "firing" on a keyword.

The /word/ notation is regex. For each line that matches this hit, do whatever is in the brackets. This will return the entire line when fired. Whitespace is the default delimiter and elements are indexed like in shell scripting ($1, $2, etc...) So the elements we are interested in are the second item in the string in each instance.

print is simple: it prints whatever you give it and automatically appends a newline feed. printf acts identically to its C counterpart. There are quite a few built in functions, but the one here returns a substring, given a string and a start position(since we want the numerical salary only). WARNING: string indexes start at 1 in awk, so keep that in mind.

So awk eats through the file starting with an opening tag for all employees, adding elements for each employee, and finally wrapping it up with a closing tag. The final product should look akin to:

<employees>
        <employee>
                <name>Bob</name>
                <salary>60000</salary>
                <age>45</age>
        </employee>
        <employee>
                <name>Karen</name>
                <salary>70000</salary>
                <age>33</age>
        </employee>
        <employee>
                <name>Joel</name>
                <salary>40000</salary>
                <age>54</age>
        </employee>
        <employee>
                <name>Billy</name>
                <salary>80000</salary>
                <age>23</age>
        </employee>
        <employee>
                <name>Kermit</name>
                <salary>35000</salary>
                <age>18</age>
        </employee>
</employees>



And that is how you automate yourself out of a job. Happy coding!

3 Comments On This Entry

Page 1 of 1

Shane Hudson Icon

29 January 2012 - 05:42 AM
Very nice, I have seen books on sed and awk before but have not read much about them... looks like I ought to!
0

skorned Icon

29 January 2012 - 11:09 AM
Something like this should be pretty easy to do in excel or a similar spreadsheet program, although the script you've written seems pretty simple too. I think there was some dedicated app also to do data manipulations of exactly this sort, wish I could remember what it was called.

The funniest experience I had of IT was when some administrators in our school would go around taking 2 numbers from each printer in the school - the pages printed at the start of the day, and the pages printed by the end of the day. Kind of like taking readings from the odometer on a car to calculate the miles travelled. They would then write this all out on a sheet of paper, calculate the differences using a calculator, and proceed to enter all three columns (A, B and A-B), into MS Excel!
0

Dogstopper Icon

30 January 2012 - 08:05 AM
That's cool! I wish I knew about AWK this summer before I spend three weeks creating a CSV library from a whole bunch of parsable files.
0
Page 1 of 1

November 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
23 242526272829
30      

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    2 user(s) viewing

    2 Guests
    0 member(s)
    0 anonymous member(s)