Page 1 of 1

Introduction to Scraping

#1 G0rman  Icon User is offline

  • New D.I.C Head
  • member icon

Reputation: 6
  • View blog
  • Posts: 46
  • Joined: 16-October 11

Posted 31 August 2012 - 06:36 AM

Introduction to Scraping

What is “scraping”?
Scraping is simply parsing human readable information to find data. You probably already do this without even thinking, but this technique can be very powerful if used consciously.

My personal opinion is that scraping is extremely important for all programmers, sysadmins, support staff, and well, anyone else who has a computer! The heart and soul of scraping is automation, it takes a large piece of data and looks for anything important, saving you time. With enough scraping and scripting you can automate any task, probably.[Citation Needed]

Fair warning: I will be using Linux commands, if you use Windows you can still follow this tutorial, but it's going to be a lot harder. I'm not very familiar with Windows commands, but most Linux commands do have a Windows equivalent, for example ls = dir, and grep = find. But that's about as far as my Windows CLI knowledge goes, sorry!

Please note: This isn't a tutorial on regular expressions, I try to explain them so that it isn't too important as to whether they look like magic spells or you understand them easily. Regex is very important and interesting, so learn that if you haven't already! I won't mind if you ctrl+tab or alt+tab and come back later, it's not integral to this tutorial, but it will help you focus on the tutorial instead of being confused by the patterns.


Scraping has 3 main steps.
  • Generate your input
  • Scrape the input
  • Reformatting the data

To be fair, steps 1 and 3 are not actually scraping, but you do need something to scrape and some way to read it, so we will talk about it anyway.

Simple scraping
Say you want to find all the directories in a target directory. Using ls we can find all the files, directories, links, and anything else that may be in a directory. A simple way to be able to distinguish between files and folders is by using the -F flag on ls, which will append one of */=>@| to files depending on it’s type, an entry in the form <name>/ is a directory.
Here is some sample output;
$ ls -F
a.out
bubblesort.c
code/
dreamInCode/



Now that we have a command to generate the input, ls -C we need to do some scraping! We need to find anything ending in ‘/’. The grep command is perfect for this problem, and actually we will be using it a lot for scraping. We simply search for the ‘/’ character followed by the end of line character ‘$’; which gives us grep /$.
Piping our ls -F into the grep /$ gives this output;
$ ls -F | grep /$
code/
dreamInCode/



If you wanted to remove that trailing ‘/’ in order to receive just the directory names, then you could use sed "s|\(.*\)/$|\1|" to remove them. This pattern says “find any string of characters followed by “/$”, a slash then an end of line, and replace it with the string of characters”.
To put it all together we have this;
$ ls -F | grep /$ | sed "s|\(.*\)/$|\1|"
code
dreamInCode


To break down the script, we have:
ls # generate input
grep # scrape the lines we need
sed # reformat




What else can we do?
Let’s scrape something a bit more exciting! Say we have an internal network, for this example I’ll be using my university’s network, and we want to find the name of every box that has a service running on port 22, so we can find any boxes to SSH into. We will be using Netcat, a utility which we will use to check if a daemon is listening to a port. The command nc -z <host> <port> will return nothing if no daemon is listening, or a message such as Connection to <host> <port> port [<protocol>/<service>] succeeded!, in our case the port will be 22, the protocol tcp, and the service will be ssh.

The first step is to iterate through all the IP’s in our subnet and run nc -z <host> 22 and catenate the output to our output file, which we will appropriately name “output”. Let’s call this program “sshscan”:
$ cat sshscan
for i in {0..254}
do
	for j in {0..254}
	do
		nc -z ###.###.$i.$j 22 >> output
	done
done


Now we will have a file filled with a bunch of lines;
$ ./sshscan; cat output
Connection to ###.###.1.1 22 port [tcp/ssh] succeeded!
Connection to ###.###.254.254 22 port [tcp/ssh] succeeded!



So far it isn’t bad, but we only want the IPs, not all the other junk. We will construct a regex expression to extract these useful pieces of information. An IP is 4 blocks of 1 to 3 numbers between 0 and 9, delimited by ‘.’ We can construct the block by [0-9]{1,3}, then simply join them together [0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}. Grep has a useful option -o which will make grep return only the matched pattern instead of the entire line, using this we can construct our scraping function;
$ ./sshscan; cat output | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}"
###.###.1.1
###.###.254.254



Now we’re rolling! Next step is to actually find the host’s name. To do this we will use nslookup. Here is a sample output from an nslookup;
$ nslookup ###.###.1.1
Server:	###.###.32.100
Address:	###.###.32.100#53

1.1.###.###.in-addr.arpa	name = server1.cs.university.edu.au.



The output has quite a bit of information, the server at ###.32.100 is the DNS server that served the request, on port 53. The host name of our IP is “server1.cs.university.edu.au”, which is the bit of information we want. Time to scrape this output and extract the host name. First let’s extract that single line first, we can find it simply by searching for “name = “, so we will just use grep “name =”. Now that we have extracted the line 1.1.###.###.in-addr.arpa name = server1.cs.university.edu.au. we need to extract only the host name. We will be using sed to extract the hostname. We can see that the hostname is preceded “name = “ and followed by ‘.’ It should be a pretty simple sed pattern, s/.*name = (.*)\./\1/ which means “replace anything followed by “name = “ followed by a substring containing anything, followed by a ‘.’ with the substring”. When we run our command now we will get:
$ ./sshscan; cat output | grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" | nslookup | grep "name = " | sed "s/.*name = \(.*\)\./\1/"
server1.cs.university.edu.au
server2.cs.university.edu.au



Now we can add all that to the sshscan script we wrote, or alias it so we don’t have to type all that again!

We can see a certain pattern emerging, first generate text, isolate any lines that we need (if needed), then reformat the output. In our case we scraped twice, the second scrape was part of reformatting the output.
nc # generate text
grep # reformat the output so we can input it to nslookup
nslookup # generate more text
grep # isolate the useful lines
sed # reformat the output again



Our final output is easily human readable, and is in a format that could be piped into another command, for example nmap, if need be.


Scrapping anything else is really no different, it doesn’t make a difference if you generate your text with lynx or a browser or anything else. You don’t even need to use grep, you could use sed or awk or java, c, python, php. The steps and theory will remain the same - you won’t find scraping where you have no input, or don’t scrape! Though often the scraped input will be enough and you won’t have to reformat it (for example if we just wanted the IPs).

I hope you learnt something, or at least were inspired to automate some boring tasks, perhaps some automate some integration tests or use watch, lynx and diff to check a blog for updates. It sure sounds like fun, doesn't it? :]

Is This A Good Question/Topic? 1
  • +

Replies To: Introduction to Scraping

#2 sptechnolab  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 16-May 13

Posted 24 August 2013 - 02:59 AM

thank you for sharing this great piece of info..It wil help me lot.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1