Hi
I'm trying to choose a language to program a focused web crawler in. The purpose of this project is, more than anything, to serve as a learning experience, so things like memory usage and speed are not priorities for this crawler. I also realize that there are some open source crawlers out there, but again, I'm doing this for the learning experience. I'd like help choosing a language based on the following criteria:
-good built in functions or libraries for parsing html and xml. I'm still a relatively novice programmer, so if these can save me some time, it would be helpful.
-good support for the following character encodings: UTF-8, Shift-JIS/x-sjis, EUC-JPAlso, I already have some experience using Java, Python, and PHP, so if any of these languages fits into the criteria mentioned above, that language would be preferred.
From experience, PHP doesn't have very good support of unicode (yet), and it doesn't seem like a very well suited language for programming a web crawler in either.
While I like Python a lot, and I've heard good things about lxml and BeautifulSoup, I'm not too sure about it's unicode support after reading some of the comments here:
http://lowkster.blogspot.com/2008/06/pytho...code-sucks.htmlFrom what I can tell, a lot of web crawlers are written in Java or C/C++. Any recommendations?
This post has been edited by chu: 21 Sep, 2008 - 11:37 AM