can any one help me in developing a web crawler in java
i will assure that i will do hard work
please help me
thanks in advance
web crawler
Page 1 of 19 Replies - 4349 Views - Last Post: 06 August 2010 - 10:56 AM
Replies To: web crawler
#2
Re: web crawler
Posted 06 August 2010 - 09:00 AM
Sure. Show us your good faith efforts, and tell us what specific problems you are experiencing, and we will be happy to help.
#3
Re: web crawler
Posted 06 August 2010 - 09:09 AM
thank you very much for replying
actually i had been going through it since 2 weeks now i can understand what are the functions of web crawler and the logic. But i am little confused of which data structure to be used
bye the by i am saisumanth i prefer sam
actually i had been going through it since 2 weeks now i can understand what are the functions of web crawler and the logic. But i am little confused of which data structure to be used
bye the by i am saisumanth i prefer sam
#4
Re: web crawler
Posted 06 August 2010 - 09:11 AM
I would say a Tree would be the best data structure to use here, because of the recursive and tree like nature of the search. You start off at the homepage, the root node. Then you traverse the n children nodes, and the n children nodes for each of them, etc. So it is organized in a tree-like manner.
#5
Re: web crawler
Posted 06 August 2010 - 09:17 AM
thank you
which type of traversal to use bfs or dfs
which type of traversal to use bfs or dfs
#6
Re: web crawler
Posted 06 August 2010 - 09:18 AM
That's completely up to you. Either will work. Personally, I prefer Depth-first searching with Trees.
#7
Re: web crawler
Posted 06 August 2010 - 09:23 AM
actually i had read in a paper that there should be frontier for storing the new urls
my question is like www.dreamincode.net and http://www.dreaminco...1&#entry1084731 there are urls refer to same website shall i store both the dns names or please tell me solution
can i use hashing with the home url as its key
my question is like www.dreamincode.net and http://www.dreaminco...1&#entry1084731 there are urls refer to same website shall i store both the dns names or please tell me solution
can i use hashing with the home url as its key
#8
Re: web crawler
Posted 06 August 2010 - 09:37 AM
I would check first to see if the link is already contained in the Tree. If not, add it in the appropriate spot. But if you get redundancies in Trees, you are looking at infinite recursion unless you utilize other checks (ie., a depth attribute for your Nodes and checking to see if the depth of the child node > parent node).
#9
Re: web crawler
Posted 06 August 2010 - 09:55 AM
actually what i think is the logic is
to start with a seed url and to get the content(html code) from that and then parse it by using html parser and then finding a new url which is referred in anchor tag add to my frontier(which contains new urls) and then keep them in a tree and continue this procedure for new urls
Do you think dis procedure is correct?
if yes how can i escape from getting into endless loops because there is a chance of many websites referring to same website
to start with a seed url and to get the content(html code) from that and then parse it by using html parser and then finding a new url which is referred in anchor tag add to my frontier(which contains new urls) and then keep them in a tree and continue this procedure for new urls
Do you think dis procedure is correct?
if yes how can i escape from getting into endless loops because there is a chance of many websites referring to same website
#10
Re: web crawler
Posted 06 August 2010 - 10:56 AM
macosxnerd101, on 06 August 2010 - 12:37 PM, said:
I would check first to see if the link is already contained in the Tree. If not, add it in the appropriate spot. But if you get redundancies in Trees, you are looking at infinite recursion unless you utilize other checks (ie., a depth attribute for your Nodes and checking to see if the depth of the child node > parent node).
That procedure is correct. Re-read my last post for thoughts on the problem.
Page 1 of 1
|
|

New Topic/Question
Reply
MultiQuote










|