9 Replies - 5237 Views - Last Post: 06 August 2010 - 10:56 AM Rate Topic: -----

#1 Guest_sam*


Reputation:

web crawler

Posted 06 August 2010 - 08:58 AM

can any one help me in developing a web crawler in java
i will assure that i will do hard work
please help me
thanks in advance
Is This A Good Question/Topic? 0

Replies To: web crawler

#2 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10488
  • View blog
  • Posts: 38,875
  • Joined: 27-December 08

Re: web crawler

Posted 06 August 2010 - 09:00 AM

Sure. Show us your good faith efforts, and tell us what specific problems you are experiencing, and we will be happy to help. :)
Was This Post Helpful? 0
  • +
  • -

#3 saisumanth  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 14
  • Joined: 16-May 10

Re: web crawler

Posted 06 August 2010 - 09:09 AM

thank you very much for replying
actually i had been going through it since 2 weeks now i can understand what are the functions of web crawler and the logic. But i am little confused of which data structure to be used

bye the by i am saisumanth i prefer sam
Was This Post Helpful? 0
  • +
  • -

#4 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10488
  • View blog
  • Posts: 38,875
  • Joined: 27-December 08

Re: web crawler

Posted 06 August 2010 - 09:11 AM

I would say a Tree would be the best data structure to use here, because of the recursive and tree like nature of the search. You start off at the homepage, the root node. Then you traverse the n children nodes, and the n children nodes for each of them, etc. So it is organized in a tree-like manner.
Was This Post Helpful? 1
  • +
  • -

#5 saisumanth  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 14
  • Joined: 16-May 10

Re: web crawler

Posted 06 August 2010 - 09:17 AM

thank you
which type of traversal to use bfs or dfs
Was This Post Helpful? 0
  • +
  • -

#6 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10488
  • View blog
  • Posts: 38,875
  • Joined: 27-December 08

Re: web crawler

Posted 06 August 2010 - 09:18 AM

That's completely up to you. Either will work. Personally, I prefer Depth-first searching with Trees.
Was This Post Helpful? 0
  • +
  • -

#7 saisumanth  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 14
  • Joined: 16-May 10

Re: web crawler

Posted 06 August 2010 - 09:23 AM

actually i had read in a paper that there should be frontier for storing the new urls
my question is like www.dreamincode.net and http://www.dreaminco...1&#entry1084731 there are urls refer to same website shall i store both the dns names or please tell me solution

can i use hashing with the home url as its key
Was This Post Helpful? 0
  • +
  • -

#8 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10488
  • View blog
  • Posts: 38,875
  • Joined: 27-December 08

Re: web crawler

Posted 06 August 2010 - 09:37 AM

I would check first to see if the link is already contained in the Tree. If not, add it in the appropriate spot. But if you get redundancies in Trees, you are looking at infinite recursion unless you utilize other checks (ie., a depth attribute for your Nodes and checking to see if the depth of the child node > parent node).
Was This Post Helpful? 0
  • +
  • -

#9 saisumanth  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 14
  • Joined: 16-May 10

Re: web crawler

Posted 06 August 2010 - 09:55 AM

actually what i think is the logic is
to start with a seed url and to get the content(html code) from that and then parse it by using html parser and then finding a new url which is referred in anchor tag add to my frontier(which contains new urls) and then keep them in a tree and continue this procedure for new urls
Do you think dis procedure is correct?
if yes how can i escape from getting into endless loops because there is a chance of many websites referring to same website
Was This Post Helpful? 0
  • +
  • -

#10 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10488
  • View blog
  • Posts: 38,875
  • Joined: 27-December 08

Re: web crawler

Posted 06 August 2010 - 10:56 AM

View Postmacosxnerd101, on 06 August 2010 - 12:37 PM, said:

I would check first to see if the link is already contained in the Tree. If not, add it in the appropriate spot. But if you get redundancies in Trees, you are looking at infinite recursion unless you utilize other checks (ie., a depth attribute for your Nodes and checking to see if the depth of the child node > parent node).

That procedure is correct. Re-read my last post for thoughts on the problem.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1