3 Replies - 13832 Views - Last Post: 01 September 2008 - 10:09 AM Rate Topic: -----

#1 FunkiMunky  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 01-September 08

how to get data from HTML page

Posted 01 September 2008 - 07:47 AM

I am trying to read data from an html page. The section that has data is

<!-- begin content -->  <div class="box">
	<h2 class="title">Search results - [ <i>2997 businesses found </i>]</h2>
	<div class="content"><ul class="search-data">
<li><a href="?q=node/593">124 Facilities</a></li>
<li><a href="?q=node/597">2-0-2 Media</a></li>
<li><a href="?q=node/199">2.35 Research PLC</a></li>
<li><a href="?q=node/598">24-6 Cine & TV Services</a></li>
<li><a href="?q=node/599">27 Records</a></li>
<li><a href="?q=node/3029">2b Media Services</a></li>
<li><a href="?q=node/600">3 Bear Animations</a></li>
<li><a href="?q=node/6420">3-D Revolution Productions</a></li>
<li><a href="?q=node/580">3-D Revolution Productions</a></li>
<li><a href="?q=node/287">365Digital</a></li>
<li><a href="?q=node/601">3D Creations</a></li>
<li><a href="?q=node/603">3D Imaging</a></li>
<li><a href="?q=node/605">3D Jamie</a></li>
<li><a href="?q=node/7571">3D Orangepanda Digital Media</a></li>
<li><a href="?q=node/607">3DD Entertainment Ltd</a></li>
<li><a href="?q=node/289">3Dlabs</a></li>
<li><a href="?q=node/5846">3DRequest™</a></li>
<li><a href="?q=node/591">3p Underground Media UK Ltd</a></li>
<li><a href="?q=node/608">3rd Eye Broadcast Group</a></li>
<li><a href="?q=node/609">3rd Wave Graphics</a></li>
<li><a href="?q=node/610">3Sixty Media</a></li>
<li><a href="?q=node/613">422 South (Bristol)</a></li>
<li><a href="?q=node/612">422 South (Manchester)</a></li>
<li><a href="?q=node/310">7 Star Web Services</a></li>
<li><a href="?q=node/614">750mph</a></li>
<li><a href="?q=node/7197">A Bright Gem</a></li>
<li><a href="?q=node/582">A Double M Productions Ltd</a></li>
<li><a href="?q=node/615">A M Visualisation Ltd</a></li>
<li><a href="?q=node/616">A Productions</a></li>
<li><a href="?q=node/618">A Works TV Ltd</a></li>
<li><a href="?q=node/619">A. J. Murray</a></li>
<li><a href="?q=node/620">A.D. Modelmaking</a></li>
<li><a href="?q=node/621">A1 Vox Ltd</a></li>
<li><a href="?q=node/622">AAA 3D Imaging</a></li>
<li><a href="?q=node/65">Aardman Animations Ltd</a></li>
<li><a href="?q=node/625">Aardvark Swift Recruitment Ltd</a></li>
<li><a href="?q=node/626">AB Facility Vehicles</a></li>
<li><a href="?q=node/627">Abacus Film Productions Ltd</a></li>
<li><a href="?q=node/628">Abbey Home Media Group</a></li>
<li><a href="?q=node/629">About-Face Media Productions</a></li>
<li><a href="?q=node/630">Absolute Post</a></li>
<li><a href="?q=node/631">Absolute Studios</a></li>
<li><a href="?q=node/632">Absolutely Productions</a></li>
<li><a href="?q=node/633">Abstract Images</a></li>
<li><a href="?q=node/634">Acacia Productions Ltd</a></li>
<li><a href="?q=node/558">Academy</a></li>
<li><a href="?q=node/635">Academy Billiards</a></li>
<li><a href="?q=node/636">AccessMocap</a></li>
<li><a href="?q=node/637">Account - 4</a></li>
<li><a href="?q=node/638">ACE Accounting Ltd</a></li>
</ul>
</div>
 </div>

<div id="pager" class="container-inline"><div class="pager-first"> </div><div class="pager-previous"><div class="pager-first"> </div></div><div class="pager-list"><strong>1</strong> <div class="pager-next"><a href="?q=business/search_data&from=50">2</a></div> <div class="pager-next"><a href="?q=business/search_data&from=100">3</a></div> <div class="pager-next"><a href="?q=business/search_data&from=150">4</a></div> <div class="pager-next"><a href="?q=business/search_data&from=200">5</a></div> <div class="pager-next"><a href="?q=business/search_data&from=250">6</a></div> <div class="pager-next"><a href="?q=business/search_data&from=300">7</a></div> <div class="pager-next"><a href="?q=business/search_data&from=350">8</a></div> <div class="pager-next"><a href="?q=business/search_data&from=400">9</a></div> <div class="pager-list-dots-right">...</div></div><div class="pager-next"><a href="?q=business/search_data&from=50">next page</a></div><div class="pager-last"><a href="?q=business/search_data&from=2950">last page</a></div></div><!-- end content -->



as you can see there are a number of div tags and in particular text that reads <!-- begin content --> and <!-- end content -->
I want the hrefs and the the href text. I have thought that maybe some straight string maniplation might do the job splitting the text into parts. I have also been thinking that their might be a way to just get the ul html control directly.

Any help in the right direction would be appreciated.

Is This A Good Question/Topic? 0
  • +

Replies To: how to get data from HTML page

#2 Martyr2  Icon User is offline

  • Programming Theoretician
  • member icon

Reputation: 4334
  • View blog
  • Posts: 12,131
  • Joined: 18-April 07

Re: how to get data from HTML page

Posted 01 September 2008 - 08:07 AM

My recommendation would be to go with regular expressions. If you are unfamiliar with them, you can do a simple search on the net for how to setup and execute a regular expression.

In case you don't know what they are, regular expressions are patterns for matching text that meet certain rules. For instance, in your example you have a bunch of links that follow the pattern <a href="?q=node/numberhere">text</a> If I wanted to pull out all links from this page I can setup a pattern that states "Look for any text that matches a pattern of <a followed by href="?q=node/ followed by some number followed by text and a closing <a> tag.

The pattern may look something like this...

<a href="\?q=node/\d{1,5}">.*?</a>

You would then execute this using a regular expression object and the it would throw back a collection of matches back at you. One match for each time the pattern was found. In this instance it would return all your <a> links to you.

You can narrow or broaden the patterns to pull out any data you want. You could pull the whole <li> tag if you want or pull the whole div. Whatever you need. They are really useful and perfect for situations like this which have a bunch of data that matches a pattern.

Hope this helps you out. Enjoy!

"At DIC we be pattern creating code ninjas... did I mention we also create destruction? Now you know." :snap:
Was This Post Helpful? 0
  • +
  • -

#3 FunkiMunky  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 01-September 08

Re: how to get data from HTML page

Posted 01 September 2008 - 08:41 AM

Thanks Martyr2 I'm gonna find out about how to implement this for string manipulations. The answer sounds so simple, I would never have gotten to it in a million years.
Was This Post Helpful? 0
  • +
  • -

#4 PsychoCoder  Icon User is offline

  • Google.Sucks.Init(true);
  • member icon

Reputation: 1641
  • View blog
  • Posts: 19,853
  • Joined: 26-July 07

Re: how to get data from HTML page

Posted 01 September 2008 - 10:09 AM

Here's a function I use for extracting all hyperlinks from the data returned from an HTML page

' <summary>
''' method for extracting all URL's from the data being
''' passed to the method. The data being passed will be all
''' the data from a provided URL
''' </summary>
''' <param name="str"></param>
''' <returns></returns>
Public Function ExtractLinks(ByVal str As String) As ArrayList
	Try
		'ArrayList to hold all the links
		Dim linksList As New ArrayList()

		'regex pattern for searching
		Dim pattern As String = "href=""[a-zA-Z./:&\d_-]+"""

		'create a new RegEx object
		Dim reg As New Regex(pattern, RegexOptions.IgnoreCase Or RegexOptions.ExplicitCapture)

		'put all the matches into a MatchCollection
		Dim matches As MatchCollection = reg.Matches(str)

		'loop through all the matches
		For Each match As Match In matches
			For Each group As Group In match.Groups
				'now we do some string manipulation to pull the "href=" off the link
				Dim url As String = group.Value.Replace("href=""", "")
				url = url.Substring(0, url.IndexOf(""""))

				'add the URL to the list

				linksList.Add(url)
			Next
		Next

		'now return the populated ArrayList
		Return linksList
	Catch ex As Exception
		MessageBox.Show(ex.Message)
		Return Nothing
	End Try
End Function



Then you can use the WebClient class of the System.Net Namespace to retrieve the source code from a URL like so

''' <summary>
''' method for retrieving information from a specified URL
''' using the new WebClient Class in .Net 2.0
''' </summary>
''' <param name="url">url to retrieve data from</param>
''' <returns></returns>
Public Function LoadSiteContent(ByVal url As String) As String
	'create a new WebClient object
	Dim client As New WebClient()

	'create a byte array for holding the returned data
	Dim html As Byte() = client.DownloadData(url)

	'use the UTF8Encoding object to convert the byte
	'array into a string
	Dim utf As New UTF8Encoding()

	'return the converted string
	Return utf.GetString(html)
End Function



Hope that helps :)
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1