Welcome to Dream.In.Code
Getting Help is Easy!

Join 136,057 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,561 people online right now. Registration is fast and FREE... Join Now!




HELP: parsing unicode web sites

 
Reply to this topicStart new topic

HELP: parsing unicode web sites

andrewwan1980
31 Jul, 2008 - 05:59 AM
Post #1

New D.I.C Head
*

Joined: 7 Jun, 2007
Posts: 15


My Contributions
I need help in parsing unicode webpages & downloading jpeg image files via Perl scripts.

I read http://www.cs.utk.edu/cs594ipm/perl/crawltut.html about using LWP or HTTP or get($url) functions & libraries. But the content returned is always garbled. I have used get($url) on a non-unicode webpage and the content is returned in perfect ascii.

But now I want to parse http://www.tom365.com/movie_2004/html/5507.html and the page I get back is garbled encoded. I have read about Encode but don't know how to use it.

I need a Perl script to parse that above page and extract the URL for the image in this pattern:

<div class="movie"><img src="http://pic.tom365.com/imgs/tongjifan.jpg" class="mp" />

If anyone knows how to do this parsing unicode webpages then I'd be very grateful.

Thank you
User is offlineProfile CardPM
+Quote Post

perfectly.insane
RE: HELP: Parsing Unicode Web Sites
1 Aug, 2008 - 06:15 PM
Post #2

D.I.C Addict
Group Icon

Joined: 22 Mar, 2008
Posts: 558



Thanked: 46 times
Dream Kudos: 25
Expert In: C/C++

My Contributions
That page is not Unicode. It's gb2312, a 2-byte character set of mostly Chinese characters. That is why it is coming out garbled. Unicode usually does not come out garbed. It might have nulls in the case of UCS-2 or UTF-16, but it isn't total garbage like gb2312. You should probably search for a module that decodes this. There also has to be a way to detect this, as there is no way that you'd detect the character set in the meta tag without knowing how to decode it in the first place (unlike with UTF-8).
User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 12/1/08 06:06PM

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month