Join 132,603 C# Programmers for FREE! Get instant access to thousands of C# experts, tutorials, code snippets, and more! There are 936 people online right now. Registration is fast and FREE... Join Now!
I have been given an interesting problem at work. We scrape the Dept. of Labor & Industries website to get information on contractors, which in turn is then used to populate some fields in our web portal for our insurance agents. Recently, they have changed their site. To access some of the data you now have to click a link on the page which will cause a postback and a table with the additional information will show up.
The site is done in ASP.NET.
Now personally, I have never done any website scraping. Getting the HTML and parsing the data I need from it is not a problem. I am using the HttpWebRequest/HttpWebResponse objects to get the HTML.
The issue is how do I cause the postback to occur for a specific control on the page. I imagine ViewState will come into play. Since I have never done anything like this, I am not sure how to approach the problem.
Just looking for any suggestions on how to best approach this problem.
You will need to store the credentials (if any) into the network stream and catch the value of the viewstate that is set. I have a small function that catches the viewstate that I wrote a while back, haven't done much with it, but if you would like I'll post it as a snippet here. Don't know how reliable it is either as I have only tested it on one site.
This post has been edited by tody4me: 10 Jul, 2008 - 07:35 AM
Thanks for the reply. The site does use HTTPS, however it doesn't require a person to login. It is just over a secure connection.
I have managed to be partially successful at accomplishing this task. By that, I mean that if I open a browser and visit the page, get the source code, and then manually copy the data from EVENTTARGET, EVENTARGUMENT, VIEWSTATE, and EVENTVALIDATION into my project and then execute a POST request with that data then everything works great and the returned HTML has all the data I want.
However, if I try to do it all in code. Initiate the first request, get the HTML, parse the needed data, initiate a POST request with the EVENTTARGET, EVENTARGUMENT, VIEWSTATE, and EVENTVALIDATION data, then it does not work.
I have debugged the code and compared the POST data from the manual operation with the POST data from the automated operation and they match character for character. Yet, it still doesn't work.
I am getting data back, but it is the same HTML that I got the first time.
I am certain I must be missing one small element when doing it all in code. If it works one way then it should work the other, right?
// we will read data via the response stream resStream = response.GetResponseStream(); string tempString = null; int count = 0; do { // fill the buffer with data count = resStream.Read(buf, 0, buf.Length); // make sure we read some data if (count != 0) { // translate from bytes to ASCII text tempString = Encoding.ASCII.GetString(buf, 0, count); // continue building the string sb.Append(tempString); } } while (count > 0); // any more data to read? } catch (Exception err) { String exc = err.Message; } finally { response.Close(); }
return sb.ToString(); }
private string[] ParseHTML(string html) { string[] data = new string[4]; string value = ""; string temp = ""; Match match;
//Set the EVENTTARGET control data[0] = "lnkAll";
//Set the EVENTARGUMENT, should be an empty string data[1] = "";
//get the ViewState data Regex regex = new Regex("id=\"__VIEWSTATE\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>"); match = regex.Match(html); value = match.Value; temp = value.Remove(value.IndexOf("id"), 24); temp = temp.Remove(temp.LastIndexOf("\""), 4); txtViewState.Text = temp;
Im kind of working on the same thing here, so maybe we can help each other get through this. In 2.0 try using the WebClient Class to download the data from the target website
csharp
/// <summary> /// method for retrieving information from a specified URL /// using the new WebClient Class in .Net 2.0 /// </summary> /// <param name="url">url to retrieve data from</param> /// <returns></returns> public string LoadeSiteContent(string url) { //create a new WebClient object WebClient client = new WebClient();
//create a byte array for holding the returned data byte[] html = client.DownloadData(url);
//use the UTF8Encoding object to convert the byte //array into a string UTF8Encoding utf = new UTF8Encoding();
//return the converted string return utf.GetString(html); }
I have found this to be a bit more efficient than the HttpWebRequest. If there are links you need to follow once you have the data you can extract those using Regular Expressions (I add mine to an ArrayList) like so
csharp
/// <summary> /// method for extracting all URL's from the data being /// passed to the method. The data being passed will be all /// the data from a provided string /// </summary> /// <param name="str"></param> /// <returns></returns> public ArrayList ExtractLinks(string str) { try { //ArrayList to hold all the links ArrayList linksList = new ArrayList();
Thanks for the info, Rich. However, in my case the page just posts back to itself. So I don't need to get the links.
I guess I just needed to step away from it for awhile. After I got home from work, I started playing with the test project I had setup and emailed to myself from work.
After 5 minutes, I finally determine what the problem was. I was using two different HttpWebRequest objects to grab the data. I decided to just use one by making it a class level object and it worked fantastically.
If it is of any help, here is the code that worked.
csharp
using System; using System.Data; using System.Configuration; using System.Web; using System.Web.Security; using System.Web.UI; using System.Web.UI.WebControls; using System.Web.UI.WebControls.WebParts; using System.Web.UI.HtmlControls; using System.Net; using System.Text; using System.IO; using System.Xml; using System.Text.RegularExpressions;
public partial class _Default : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) {
try { // execute the request response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream resStream = response.GetResponseStream(); string tempString = null; int count = 0; do { // fill the buffer with data count = resStream.Read(buf, 0, buf.Length); // make sure we read some data if (count != 0) { // translate from bytes to ASCII text tempString = Encoding.ASCII.GetString(buf, 0, count); // continue building the string sb.Append(tempString); } } while (count > 0); // any more data to read? } catch (Exception err) { String exc = err.Message; } finally { response.Close(); resStream.Close(); }
return sb.ToString(); }
private string[] ParseHTML(string html) { string[] data = new string[4]; string value = ""; string temp = ""; Match match;
//Set the EVENTTARGET control data[0] = "lnkAll";
//Set the EVENTARGUMENT, should be an empty string data[1] = "";
//get the ViewState data Regex regex = new Regex("id=\"__VIEWSTATE\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>"); match = regex.Match(html); value = match.Value; temp = value.Remove(value.IndexOf("id"), 24); temp = temp.Remove(temp.LastIndexOf("\""), 4); txtViewState.Text = temp;
I was going to reply was some code I've used before for scraping, but then saw the asp.net requirement. Ugly stuff. Note, if your asp.net pages are xhtml compliant, your parser would be simpler if you can treat the page as an xml document. If keep thinking there must be a way to use System.Web to do the heavy lifting of page inspection, but I haven't found it yet.
So how complex does it get trying to read things from sales web pages like newegg and such? I was thinking of playing around with this but think it might be way over my head.
FYI, I update the code I posted to get away from using a global variable, instead just passing it by reference into the methods that need it.
QUOTE(baavgai @ 11 Jul, 2008 - 04:34 AM)
Note, if your asp.net pages are xhtml compliant, your parser would be simpler if you can treat the page as an xml document. If keep thinking there must be a way to use System.Web to do the heavy lifting of page inspection, but I haven't found it yet.
Unfortunetely, the site is not compliant with xhtml, so using the XmlDocument class with XPath to get the data was not an option. I also wish there was a better way of parsing data from HTML, something similar to XmlDocument, however there isn't. At least not in the standard library anyway.
QUOTE
So how complex does it get trying to read things from sales web pages like newegg and such? I was thinking of playing around with this but think it might be way over my head.
It really isn't that hard, think of it in terms of one big long string of text and you are simply extracting the data that you need from it. There are plenty of ways to extract data from a string. As you can see, I chose to use Regex to get the data I needed, but there are many different ways of doing it.
What sucks is when they change their site design and then you have to go back and change your code to work with the new design.