Welcome to Dream.In.Code
Become an Expert!

Join 150,194 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 2,034 people online right now. Registration is fast and FREE... Join Now!




web crawler - help choosing language

 
Reply to this topicStart new topic

web crawler - help choosing language, language needs good unicode support

chu
21 Sep, 2008 - 11:35 AM
Post #1

New D.I.C Head
*

Joined: 20 Sep, 2008
Posts: 7


My Contributions
Hi
I'm trying to choose a language to program a focused web crawler in. The purpose of this project is, more than anything, to serve as a learning experience, so things like memory usage and speed are not priorities for this crawler. I also realize that there are some open source crawlers out there, but again, I'm doing this for the learning experience. I'd like help choosing a language based on the following criteria:

-good built in functions or libraries for parsing html and xml. I'm still a relatively novice programmer, so if these can save me some time, it would be helpful.
-good support for the following character encodings: UTF-8, Shift-JIS/x-sjis, EUC-JP

Also, I already have some experience using Java, Python, and PHP, so if any of these languages fits into the criteria mentioned above, that language would be preferred.

From experience, PHP doesn't have very good support of unicode (yet), and it doesn't seem like a very well suited language for programming a web crawler in either.

While I like Python a lot, and I've heard good things about lxml and BeautifulSoup, I'm not too sure about it's unicode support after reading some of the comments here:
http://lowkster.blogspot.com/2008/06/pytho...code-sucks.html

From what I can tell, a lot of web crawlers are written in Java or C/C++. Any recommendations?

This post has been edited by chu: 21 Sep, 2008 - 11:37 AM
User is offlineProfile CardPM
+Quote Post

abgorn
RE: Web Crawler - Help Choosing Language
21 Sep, 2008 - 11:38 AM
Post #2

Hello Crap for Brains
Group Icon

Joined: 5 Jun, 2008
Posts: 912



Thanked: 5 times
Dream Kudos: 50
My Contributions
I think Java would fit it well. It seems to fit your criteria well and Java's a fairly simple and straight forward language.

Does anyone else think this would be good in any other languages?
User is offlineProfile CardPM
+Quote Post

xCraftyx
RE: Web Crawler - Help Choosing Language
22 Sep, 2008 - 03:59 PM
Post #3

New D.I.C Head
Group Icon

Joined: 13 Sep, 2008
Posts: 44



Thanked: 1 times
My Contributions
Here's an article about writing a web crawler in Java if you'd like to try it out: http://java.sun.com/developer/technicalArt...rty/WebCrawler/
User is offlineProfile CardPM
+Quote Post

chu
RE: Web Crawler - Help Choosing Language
23 Sep, 2008 - 09:26 AM
Post #4

New D.I.C Head
*

Joined: 20 Sep, 2008
Posts: 7


My Contributions
Hey, thanks for the replies. I guess I'll try programming the crawler in Java.
User is offlineProfile CardPM
+Quote Post

abgorn
RE: Web Crawler - Help Choosing Language
27 Sep, 2008 - 01:55 AM
Post #5

Hello Crap for Brains
Group Icon

Joined: 5 Jun, 2008
Posts: 912



Thanked: 5 times
Dream Kudos: 50
My Contributions
If you did do it in Java you could do it like this:
http://www.java-tips.org/java-se-tips/java...-in-java-2.html
User is offlineProfile CardPM
+Quote Post

arachnode.net
RE: Web Crawler - Help Choosing Language
6 Jan, 2009 - 09:47 AM
Post #6

New D.I.C Head
*

Joined: 6 Jan, 2009
Posts: 1

QUOTE(chu @ 21 Sep, 2008 - 11:35 AM) *

Hi
I'm trying to choose a language to program a focused web crawler in. The purpose of this project is, more than anything, to serve as a learning experience, so things like memory usage and speed are not priorities for this crawler. I also realize that there are some open source crawlers out there, but again, I'm doing this for the learning experience. I'd like help choosing a language based on the following criteria:

-good built in functions or libraries for parsing html and xml. I'm still a relatively novice programmer, so if these can save me some time, it would be helpful.
-good support for the following character encodings: UTF-8, Shift-JIS/x-sjis, EUC-JP

Also, I already have some experience using Java, Python, and PHP, so if any of these languages fits into the criteria mentioned above, that language would be preferred.

From experience, PHP doesn't have very good support of unicode (yet), and it doesn't seem like a very well suited language for programming a web crawler in either.

While I like Python a lot, and I've heard good things about lxml and BeautifulSoup, I'm not too sure about it's unicode support after reading some of the comments here:
http://lowkster.blogspot.com/2008/06/pytho...code-sucks.html

From what I can tell, a lot of web crawlers are written in Java or C/C++. Any recommendations?


You say you want to do this for a learning exercize... I would recommend trying a port if you really want to learn.

There are a heap of crawlers available that download text, and a few less that do images as well. If you really want to get into the nuts and bolts of crawling, and being able to reconstruct web pages I would suggest checking out http://arachnode.net.

arachnode.net is a complete C# web crawler but is straightforward enough to port to Java.

Learning to extend nutch might also be a good suggestion.

User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 1/9/09 04:26AM

Be Social

Dream.In.Code RSS Feed Dream.In.Code LinkedIn Group Follow Us On Twitter

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month