3 Replies - 936 Views - Last Post: 12 December 2011 - 12:58 PM Rate Topic: -----

Topic Sponsor:

#1 CSLexy  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 06-December 11

Traversing the web using JSoup

Posted 07 December 2011 - 04:29 PM

I'm trying to use JSoup in Eclipse to traverse the web and find a lot of information regarding the FIFA world cup, it's players and participating countries for every FIFA cup. Can anyone show me how? Is it possible to just code it to go to google, input in a query(in English, now sql) and go through each site trying to find that information?
Is This A Good Question/Topic? 0
  • +

Replies To: Traversing the web using JSoup

#2 BetaWar  Icon User is offline

  • #include "soul.h"
  • member icon

Reputation: 772
  • View blog
  • Posts: 6,133
  • Joined: 07-September 06

Re: Traversing the web using JSoup

Posted 07 December 2011 - 04:38 PM

I don't think you'll run into anyone willing to do the code for you (after all it is against the rules), however it is possible to do a number of things with JSoup.

You can create a spider pretty quickly, the big thing is that you will need to parse the data in such a way that your program can understand if it has actually found something relevant or if it is just trash. This is often done with Regex, but can be done with an XML parser (or JSON if you are lucky enough to come across a JSON data file). XML is more likely (as that incorporates most well-formatted HTML).

The big things that you need to worry about is using a bot to search google is against their EULA, and I believe they reserve the right to simply stop serving you pages (basically banning you from their site). As you can expect, that isn't probably something you want to happen. Additionally, a lot of sites out there have license agreements that you, as a user, are held to even if you haven't read them. And the majority of those restrict using bots to scrape their data unless through an authorized means (like an XML or Atom feed).

Assuming you don't care about breaking EULAs then I would suggest you look into Regular Expressions and find a site that you believe will be able to provide the information you are looking for. Then make your bot and let it loose.
Was This Post Helpful? 0
  • +
  • -

#3 blackcompe  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 719
  • View blog
  • Posts: 1,692
  • Joined: 05-May 05

Re: Traversing the web using JSoup

Posted 07 December 2011 - 05:13 PM

JSoup is an HTML parser. It won't find information, but a web crawler such as crawler4j, will. If you want to start with Google, you have to use it's API. I imagine you'd need a data storage solution as well. You might want to check out Lucene too.
Was This Post Helpful? 0
  • +
  • -

#4 JackOfAllTrades  Icon User is offline

  • No Sugar Coding Here!
  • member icon

Reputation: 4678
  • View blog
  • Posts: 20,353
  • Joined: 23-August 08

Re: Traversing the web using JSoup

Posted 12 December 2011 - 12:58 PM

Not sure why a Java HTML Parser question is in the Databases forum.

Moved to Java.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1