6 Replies - 2445 Views - Last Post: 20 December 2010 - 12:48 PM Rate Topic: -----

#1 Captain M  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 103
  • Joined: 21-January 07

Get a specific HTML tag from file

Posted 17 December 2010 - 11:06 PM

I'm trying to get an <f6> tag's contents from a web page. I've found code that will the the contents of a hyperlink (<a href>) tag, but I can't figure out how to modify it correctly. This is the code I'm working with:
import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.waylink-english.co.uk/?page=11620&pw=1)
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()



The code is looking for href as an attribute of <a>, but I don't see where it specifies a as the tag to use. As the tag I'm looking for (f6) has no attributes, I just want to get the contents of all f6 tags. Any help would be greatly appreciated!

Is This A Good Question/Topic? 0
  • +

Replies To: Get a specific HTML tag from file

#2 JackOfAllTrades  Icon User is offline

  • Saucy!
  • member icon

Reputation: 6111
  • View blog
  • Posts: 23,671
  • Joined: 23-August 08

Re: Get a specific HTML tag from file

Posted 18 December 2010 - 06:43 AM

You might try using Beautiful Soup instead. It's built for this sort of thing.
Was This Post Helpful? 1
  • +
  • -

#3 Captain M  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 103
  • Joined: 21-January 07

Re: Get a specific HTML tag from file

Posted 18 December 2010 - 08:11 AM

Thanks a lot. I'm at work, but I'll look at it tonight. It looks perfect for what I need.
Was This Post Helpful? 0
  • +
  • -

#4 LinuxFan  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 26
  • View blog
  • Posts: 82
  • Joined: 19-December 10

Re: Get a specific HTML tag from file

Posted 20 December 2010 - 12:07 PM

Alternatively, if you do not wish to use soup,
just change the start_a(self, attributes) function.

For example, if you want to parse image links and get the image urls, use:
#Change 'img' in 'start_img' to the HTML tag you need
def start_img(self, attributes):
    "Process a hyperlink and its 'attributes'."

    for name, value in attributes:
        if name == "src": #Change "src" to the attribute you need
            self.hyperlinks.append(value)



Now, I'm not sure what other html tags happen to work, I've only experimented with sgmllib for a short period, but other html tags most likely work in the same fashon

Hope I helped!
Was This Post Helpful? 0
  • +
  • -

#5 Captain M  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 103
  • Joined: 21-January 07

Re: Get a specific HTML tag from file

Posted 20 December 2010 - 12:18 PM

The problem is that I'm trying to get a custom tag <f6> and it has no attributes. I just want to get the contents of every f6 tag, and output it to a txt file.
Was This Post Helpful? 0
  • +
  • -

#6 LinuxFan  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 26
  • View blog
  • Posts: 82
  • Joined: 19-December 10

Re: Get a specific HTML tag from file

Posted 20 December 2010 - 12:40 PM

Aaah, I'm sorry - I didn't fully read the bottom of the post. Besides, start_a should be changed to 'def unknown_starttag(self, tag, attributes):' and check for tag=='f6', but that's beside the point.

Anyway, sorry for the misinformation :(
Was This Post Helpful? 2
  • +
  • -

#7 Captain M  Icon User is offline

  • D.I.C Head

Reputation: 0
  • View blog
  • Posts: 103
  • Joined: 21-January 07

Re: Get a specific HTML tag from file

Posted 20 December 2010 - 12:48 PM

No, you're good. That was helpful. Thanks!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1