Welcome to Dream.In.Code
Getting C# Help is Easy!

Join 132,603 C# Programmers for FREE! Get instant access to thousands of C# experts, tutorials, code snippets, and more! There are 936 people online right now. Registration is fast and FREE... Join Now!




Website scraping and postback

 
Reply to this topicStart new topic

Website scraping and postback

Jayman
post 9 Jul, 2008 - 03:00 PM
Post #1


Student of Life

Group Icon
Joined: 26 Dec, 2005
Posts: 6,839



Thanked 38 times

Dream Kudos: 500

Expert In: C#, VB.NET, Java

My Contributions


I have been given an interesting problem at work. We scrape the Dept. of Labor & Industries website to get information on contractors, which in turn is then used to populate some fields in our web portal for our insurance agents. Recently, they have changed their site. To access some of the data you now have to click a link on the page which will cause a postback and a table with the additional information will show up.

The site is done in ASP.NET.

Now personally, I have never done any website scraping. Getting the HTML and parsing the data I need from it is not a problem. I am using the HttpWebRequest/HttpWebResponse objects to get the HTML.

The issue is how do I cause the postback to occur for a specific control on the page. I imagine ViewState will come into play. Since I have never done anything like this, I am not sure how to approach the problem.

Just looking for any suggestions on how to best approach this problem.

Thanks.
User is offlineProfile CardPM

Go to the top of the page

tody4me
post 10 Jul, 2008 - 07:35 AM
Post #2


Only Jenny Craig makes thin clients...

Group Icon
Joined: 12 Apr, 2006
Posts: 1,272



Thanked 3 times

Dream Kudos: 100
My Contributions


You will need to store the credentials (if any) into the network stream and catch the value of the viewstate that is set. I have a small function that catches the viewstate that I wrote a while back, haven't done much with it, but if you would like I'll post it as a snippet here. Don't know how reliable it is either as I have only tested it on one site.

This post has been edited by tody4me: 10 Jul, 2008 - 07:35 AM
User is offlineProfile CardPM

Go to the top of the page

Jayman
post 10 Jul, 2008 - 03:18 PM
Post #3


Student of Life

Group Icon
Joined: 26 Dec, 2005
Posts: 6,839



Thanked 38 times

Dream Kudos: 500

Expert In: C#, VB.NET, Java

My Contributions


Thanks for the reply. The site does use HTTPS, however it doesn't require a person to login. It is just over a secure connection.

I have managed to be partially successful at accomplishing this task. By that, I mean that if I open a browser and visit the page, get the source code, and then manually copy the data from EVENTTARGET, EVENTARGUMENT, VIEWSTATE, and EVENTVALIDATION into my project and then execute a POST request with that data then everything works great and the returned HTML has all the data I want.

However, if I try to do it all in code. Initiate the first request, get the HTML, parse the needed data, initiate a POST request with the EVENTTARGET, EVENTARGUMENT, VIEWSTATE, and EVENTVALIDATION data, then it does not work.

I have debugged the code and compared the POST data from the manual operation with the POST data from the automated operation and they match character for character. Yet, it still doesn't work.

I am getting data back, but it is the same HTML that I got the first time.

I am certain I must be missing one small element when doing it all in code. If it works one way then it should work the other, right?

Perhaps someone can see where I am going wrong.
csharp

protected void btnSubmit_Click(object sender, EventArgs e)
{
txtEventTarget.Text = "";
txtEventArgument.Text = "";
txtViewState.Text = "";
txtEventValidation.Text = "";

string url = "https://fortress.wa.gov/lni/bbip/Detail.aspx?License=" + txtLicense.Text;
string html = "";

html = GetResponse(url);

txtResponse1.Text = html;

string[] data = ParseHTML(html);

txtEventTarget.Text = data[0];
txtEventArgument.Text = data[1];

txtResponse2.Text = PostRequest(url, data);
}

private string PostRequest(string url, string[] args)
{
ASCIIEncoding encoding = new ASCIIEncoding();
HttpWebRequest request = null;

string postData = "__EVENTTARGET=" + args[0] + "&__EVENTARGUMENT=" + args[1];
postData += "&__VIEWSTATE=" + args[2] + "&__EVENTVALIDATION=" + args[3];

txtPostBack.Text = postData;

byte[] data = encoding.GetBytes(postData);

request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
request.Referer = url;

Stream newStream = request.GetRequestStream();
// Send the data.
try
{
newStream.Write(data, 0, data.Length);
newStream.Close();
}
catch (Exception ex)
{
Response.Write(ex.StackTrace);
}
finally
{
newStream.Close();
}

return GetResponse(url);
}

private string GetResponse(string url)
{
StringBuilder sb = new StringBuilder();
Stream resStream = null;
HttpWebRequest request = null;
HttpWebResponse response = null;
byte[] buf = new byte[8192];

request = (HttpWebRequest)WebRequest.Create(url);
//request.Method = "GET";
try
{
// execute the request
response = (HttpWebResponse)request.GetResponse();

// we will read data via the response stream
resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
}
catch (Exception err)
{
String exc = err.Message;
}
finally
{
response.Close();
}

return sb.ToString();
}

private string[] ParseHTML(string html)
{
string[] data = new string[4];
string value = "";
string temp = "";
Match match;

//Set the EVENTTARGET control
data[0] = "lnkAll";

//Set the EVENTARGUMENT, should be an empty string
data[1] = "";

//get the ViewState data
Regex regex = new Regex("id=\"__VIEWSTATE\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>");
match = regex.Match(html);
value = match.Value;
temp = value.Remove(value.IndexOf("id"), 24);
temp = temp.Remove(temp.LastIndexOf("\""), 4);
txtViewState.Text = temp;

temp = temp.Replace("/", "%2F");
temp = temp.Replace("+", "%2B");
temp = temp.Replace("=", "%3D");
data[2] = temp;

//get the EVENTVALIDATION data
regex = new Regex("id=\"__EVENTVALIDATION\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>");
match = regex.Match(html);
value = match.Value;
temp = value.Remove(value.IndexOf("id"), 30);
temp = temp.Remove(temp.LastIndexOf("\""), 4);
txtEventValidation.Text = temp;

temp = temp.Replace("/", "%2F");
temp = temp.Replace("+", "%2B");
temp = temp.Replace("=", "%3D");
data[3] = temp;

return data;
}

User is offlineProfile CardPM

Go to the top of the page

PsychoCoder
post 10 Jul, 2008 - 04:10 PM
Post #4


using DIC.Core;

Group Icon
Joined: 26 Jul, 2007
Posts: 8,933



Thanked 118 times

Dream Kudos: 8525

Expert In: VB, VB.Net, C#, SQL, ASP, ASP.Net, Web Development, HTML, CSS, Win32 API, Javascript, mySQL, J#, Boo.Net

My Contributions


Jay,

Im kind of working on the same thing here, so maybe we can help each other get through this. In 2.0 try using the WebClient Class to download the data from the target website


csharp

/// <summary>
/// method for retrieving information from a specified URL
/// using the new WebClient Class in .Net 2.0
/// </summary>
/// <param name="url">url to retrieve data from</param>
/// <returns></returns>
public string LoadeSiteContent(string url)
{
//create a new WebClient object
WebClient client = new WebClient();

//create a byte array for holding the returned data
byte[] html = client.DownloadData(url);

//use the UTF8Encoding object to convert the byte
//array into a string
UTF8Encoding utf = new UTF8Encoding();

//return the converted string
return utf.GetString(html);
}


I have found this to be a bit more efficient than the HttpWebRequest. If there are links you need to follow once you have the data you can extract those using Regular Expressions (I add mine to an ArrayList) like so


csharp

/// <summary>
/// method for extracting all URL's from the data being
/// passed to the method. The data being passed will be all
/// the data from a provided string
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
public ArrayList ExtractLinks(string str)
{
try
{
//ArrayList to hold all the links
ArrayList linksList = new ArrayList();

//regular expression pattern for finding links
string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*@)?((25[0-5]|
2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|
0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}
[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.[a-zA-Z]{2,4})(\:[0-9]+)?(/[^/][a-zA-Z0-9\.\,\?\'\\/\+&%\$#\=~_\-@]*)*$";

//create a new RegEx object
Regex reg = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

//put all the matches into a MatchCollection
MatchCollection matches = reg.Matches(str);

//loop through all the matches
foreach (Match match in matches)
{
foreach (Group group in match.Groups)
{
string link = group.Value + "";

//add the URL to the list
linksList.Add(link);
}
}

//now return the populated ArrayList
return linksList;
}
catch (Exception ex)
{
return null;
}
}



Im going to be doing some more work on this tonight so Ill post what I come up with next smile.gif
User is offlineProfile CardPM

Go to the top of the page

Jayman
post 10 Jul, 2008 - 05:19 PM
Post #5


Student of Life

Group Icon
Joined: 26 Dec, 2005
Posts: 6,839



Thanked 38 times

Dream Kudos: 500

Expert In: C#, VB.NET, Java

My Contributions


Thanks for the info, Rich. However, in my case the page just posts back to itself. So I don't need to get the links.

I guess I just needed to step away from it for awhile. After I got home from work, I started playing with the test project I had setup and emailed to myself from work.

After 5 minutes, I finally determine what the problem was. I was using two different HttpWebRequest objects to grab the data. I decided to just use one by making it a class level object and it worked fantastically.

If it is of any help, here is the code that worked.

csharp

using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Net;
using System.Text;
using System.IO;
using System.Xml;
using System.Text.RegularExpressions;

public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{

}
protected void btnSubmit_Click(object sender, EventArgs e)
{
txtEventTarget.Text = "";
txtEventArgument.Text = "";
txtViewState.Text = "";
txtEventValidation.Text = "";

string url = "https://fortress.wa.gov/lni/bbip/Detail.aspx?License=" + txtLicense.Text;
string html = "";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

html = GetResponse(url, ref request);

txtResponse1.Text = html;

string[] data = ParseHTML(html);

txtEventTarget.Text = data[0];
txtEventArgument.Text = data[1];

txtResponse2.Text = PostRequest(url, data, ref request);
}

private string PostRequest(string url, string[] args, ref HttpWebRequest request)
{
ASCIIEncoding encoding = new ASCIIEncoding();

string postData = "__EVENTTARGET=" + args[0] + "&__EVENTARGUMENT=" + args[1];
postData += "&__VIEWSTATE=" + args[2] + "&__EVENTVALIDATION=" + args[3];

txtPostBack.Text = postData;

byte[] data = encoding.GetBytes(postData);

request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
request.Referer = url;

Stream newStream = request.GetRequestStream();
// Send the data.
try
{
newStream.Write(data, 0, data.Length);
newStream.Close();
}
catch (Exception ex)
{
Response.Write(ex.StackTrace);
}
finally
{
newStream.Close();
}

return GetResponse(url, ref request);
}

private string GetResponse(string url, ref HttpWebRequest request)
{
StringBuilder sb = new StringBuilder();
Stream resStream = null;
HttpWebResponse response = null;
byte[] buf = new byte[8192];

try
{
// execute the request
response = (HttpWebResponse)request.GetResponse();

// we will read data via the response stream
resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
}
catch (Exception err)
{
String exc = err.Message;
}
finally
{
response.Close();
resStream.Close();
}

return sb.ToString();
}

private string[] ParseHTML(string html)
{
string[] data = new string[4];
string value = "";
string temp = "";
Match match;

//Set the EVENTTARGET control
data[0] = "lnkAll";

//Set the EVENTARGUMENT, should be an empty string
data[1] = "";

//get the ViewState data
Regex regex = new Regex("id=\"__VIEWSTATE\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>");
match = regex.Match(html);
value = match.Value;
temp = value.Remove(value.IndexOf("id"), 24);
temp = temp.Remove(temp.LastIndexOf("\""), 4);
txtViewState.Text = temp;

temp = temp.Replace("/", "%2F");
temp = temp.Replace("+", "%2B");
temp = temp.Replace("=", "%3D");
data[2] = temp;

//get the EVENTVALIDATION data
regex = new Regex("id=\"__EVENTVALIDATION\" value=\"/[a-zA-Z0-9\\W]+\"\\s/>");
match = regex.Match(html);
value = match.Value;
temp = value.Remove(value.IndexOf("id"), 30);
temp = temp.Remove(temp.LastIndexOf("\""), 4);
txtEventValidation.Text = temp;

temp = temp.Replace("/", "%2F");
temp = temp.Replace("+", "%2B");
temp = temp.Replace("=", "%3D");
data[3] = temp;

return data;
}

}
User is offlineProfile CardPM

Go to the top of the page

baavgai
post 11 Jul, 2008 - 03:34 AM
Post #6


Dreaming Coder

Group Icon
Joined: 16 Oct, 2007
Posts: 1,967



Thanked 96 times

Dream Kudos: 475

Expert In: C, C++, Java, C#, ASP.NET, PHP, Perl, Python, Oracle, SQL Server, MySql, HTML, JavaScript, Lua

My Contributions


Cool, thanks for sharing.

I was going to reply was some code I've used before for scraping, but then saw the asp.net requirement. Ugly stuff. Note, if your asp.net pages are xhtml compliant, your parser would be simpler if you can treat the page as an xml document. If keep thinking there must be a way to use System.Web to do the heavy lifting of page inspection, but I haven't found it yet.
User is offlineProfile CardPM

Go to the top of the page

killnine
post 11 Jul, 2008 - 06:44 AM
Post #7


D.I.C Head

**
Joined: 12 Feb, 2007
Posts: 107



Thanked 3 times
My Contributions


So how complex does it get trying to read things from sales web pages like newegg and such? I was thinking of playing around with this but think it might be way over my head.
User is offlineProfile CardPM

Go to the top of the page

Jayman
post 11 Jul, 2008 - 10:46 AM
Post #8


Student of Life

Group Icon
Joined: 26 Dec, 2005
Posts: 6,839



Thanked 38 times

Dream Kudos: 500

Expert In: C#, VB.NET, Java

My Contributions


FYI, I update the code I posted to get away from using a global variable, instead just passing it by reference into the methods that need it.

QUOTE(baavgai @ 11 Jul, 2008 - 04:34 AM) *

Note, if your asp.net pages are xhtml compliant, your parser would be simpler if you can treat the page as an xml document. If keep thinking there must be a way to use System.Web to do the heavy lifting of page inspection, but I haven't found it yet.

Unfortunetely, the site is not compliant with xhtml, so using the XmlDocument class with XPath to get the data was not an option. I also wish there was a better way of parsing data from HTML, something similar to XmlDocument, however there isn't. At least not in the standard library anyway.

QUOTE
So how complex does it get trying to read things from sales web pages like newegg and such? I was thinking of playing around with this but think it might be way over my head.


It really isn't that hard, think of it in terms of one big long string of text and you are simply extracting the data that you need from it. There are plenty of ways to extract data from a string. As you can see, I chose to use Regex to get the data I needed, but there are many different ways of doing it.

What sucks is when they change their site design and then you have to go back and change your code to work with the new design. sad.gif
User is offlineProfile CardPM

Go to the top of the page

Fast ReplyReply to this topicStart new topic
Time is now: 11/23/08 02:15AM

Live C# Help!

C# Tutorials

Reference Sheets

C# Snippets

Bye Bye Ads

Free DIC T-Shirt

T-Shirt Example

Related Sites

Monthly Drawing

Thumb Drive

Partners

Top Contributors

Top 10 Kudos This Month