2 Replies - 2816 Views - Last Post: 13 March 2013 - 07:27 AM Rate Topic: -----

#1 Malluce   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 20
  • Joined: 05-December 09

Parsing XML File using DOM

Posted 12 March 2013 - 11:06 PM

I'm trying to parse a XML file using DOM and I've run into some problems when I try to handle nested nodes.
The XML file looks like
<record rid="3247">
    <requestor>[email protected]</requestor>
    <request_summary>John's message</request_summary>

and I'm trying to parse them and store each value in a tuple like (bu_fg, services_catalog_i, request_type...)
This is how my code looks like, I'm trying to recognize each "record" node and go inside them to parse my data:
from xml.dom.minidom import parse
dom = parse('result.xml')
record = dom.getElementsByTagName('record')
for node in record:
    bufg = node.getElementsByTagName("bu_fg")
    print bufg

However, this always prints [<DOM Element: bu_fg at 0x108ae39e0>] and I have no idea why...any suggestion? Thanks!

Is This A Good Question/Topic? 0
  • +

Replies To: Parsing XML File using DOM

#2 baavgai   User is online

  • Dreaming Coder
  • member icon

Reputation: 7183
  • View blog
  • Posts: 14,971
  • Joined: 16-October 07

Re: Parsing XML File using DOM

Posted 13 March 2013 - 07:12 AM

This looked like the perfect opportunity to explore with python. This is what I did:
[[email protected] ~]$ python
Python 2.6.6 (r266:84292, Sep 11 2012, 08:34:23) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parseString('''
... <record rid="3247">
...     <bu_fg>BU</bu_fg>
...     <services_catalog_i>Access</services_catalog_i>
...     <request_type>RTB</request_type>
...     <date_created>1363138047045</date_created>
...     <actual_start_date/>
...     <actual_completed_date/>
...     <duration/>
...     <record_owner>John</record_owner>
...     <request_id>3247</request_id>
...     <requestor>[email protected]</requestor>
...     <assigned_to/>
...     <request_summary>John's message</request_summary>
...     <update_id>1363138047045</update_id>
...   </record>
... ''')
>>> dom
<xml.dom.minidom.Document instance at 0x7f369dbf97a0>
>>> record = dom.getElementsByTagName('record')
>>> node = record[0]
>>> node
<DOM Element: record at 0x7f369dbf9878>
>>> e = node.getElementsByTagName("bu_fg")
>>> e
[<DOM Element: bu_fg at 0x7f369dbf9ab8>]
>>> dir(e)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_get_length', '_set_length', 'append', 'count', 'extend', 'index', 'insert', 'item', 'length', 'pop', 'remove', 'reverse', 'sort']
>>> e.item
<bound method NodeList.item of [<DOM Element: bu_fg at 0x7f369dbf9ab8>]>
>>> e.count
<built-in method count of NodeList object at 0x7f369dc00950>
>>> e.count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: count() takes exactly one argument (0 given)
>>> len(e)
>>> e2 = e[0]
>>> e2
<DOM Element: bu_fg at 0x7f369dbf9ab8>
>>> dir(e2)
['ATTRIBUTE_NODE', 'CDATA_SECTION_NODE', 'COMMENT_NODE', 'DOCUMENT_FRAGMENT_NODE', 'DOCUMENT_NODE', 'DOCUMENT_TYPE_NODE', 'ELEMENT_NODE', 'ENTITY_NODE', 'ENTITY_REFERENCE_NODE', 'NOTATION_NODE', 'PROCESSING_INSTRUCTION_NODE', 'TEXT_NODE', '__doc__', '__init__', '__module__', '__nonzero__', '__repr__', '_attrs', '_attrsNS', '_call_user_data_handler', '_child_node_types', '_get_attributes', '_get_childNodes', '_get_firstChild', '_get_lastChild', '_get_localName', '_get_tagName', '_magic_id_nodes', 'appendChild', 'attributes', 'childNodes', 'cloneNode', 'firstChild', 'getAttribute', 'getAttributeNS', 'getAttributeNode', 'getAttributeNodeNS', 'getElementsByTagName', 'getElementsByTagNameNS', 'getInterface', 'getUserData', 'hasAttribute', 'hasAttributeNS', 'hasAttributes', 'hasChildNodes', 'insertBefore', 'isSameNode', 'isSupported', 'lastChild', 'localName', 'namespaceURI', 'nextSibling', 'nodeName', 'nodeType', 'nodeValue', 'normalize', 'ownerDocument', 'parentNode', 'prefix', 'previousSibling', 'removeAttribute', 'removeAttributeNS', 'removeAttributeNode', 'removeAttributeNodeNS', 'removeChild', 'replaceChild', 'schemaType', 'setAttribute', 'setAttributeNS', 'setAttributeNode', 'setAttributeNodeNS', 'setIdAttribute', 'setIdAttributeNS', 'setIdAttributeNode', 'setUserData', 'tagName', 'toprettyxml', 'toxml', 'unlink', 'writexml']
>>> e2.nodeName
>>> e2.nodeValue
>>> e2.hasChildNodes
<bound method Element.hasChildNodes of <DOM Element: bu_fg at 0x7f369dbf9ab8>>
>>> e2.hasChildNodes()
>>> e2c = e2.childNodes[0]
>>> e2c
<DOM Text node "u'BU'">
>>> dir(e2c)
['ATTRIBUTE_NODE', 'CDATA_SECTION_NODE', 'COMMENT_NODE', 'DOCUMENT_FRAGMENT_NODE', 'DOCUMENT_NODE', 'DOCUMENT_TYPE_NODE', 'ELEMENT_NODE', 'ENTITY_NODE', 'ENTITY_REFERENCE_NODE', 'NOTATION_NODE', 'PROCESSING_INSTRUCTION_NODE', 'TEXT_NODE', '__doc__', '__len__', '__module__', '__nonzero__', '__repr__', '__setattr__', '_call_user_data_handler', '_get_childNodes', '_get_data', '_get_firstChild', '_get_isWhitespaceInElementContent', '_get_lastChild', '_get_length', '_get_localName', '_get_nodeValue', '_get_wholeText', '_set_data', '_set_nodeValue', 'appendChild', 'appendData', 'attributes', 'childNodes', 'cloneNode', 'data', 'deleteData', 'firstChild', 'getInterface', 'getUserData', 'hasChildNodes', 'insertBefore', 'insertData', 'isSameNode', 'isSupported', 'isWhitespaceInElementContent', 'lastChild', 'length', 'localName', 'namespaceURI', 'nextSibling', 'nodeName', 'nodeType', 'nodeValue', 'normalize', 'ownerDocument', 'parentNode', 'prefix', 'previousSibling', 'removeChild', 'replaceChild', 'replaceData', 'replaceWholeText', 'setUserData', 'splitText', 'substringData', 'toprettyxml', 'toxml', 'unlink', 'wholeText', 'writexml']
>>> e2c.nodeValue

This was kind of interesting. The result of node.getElementsByTagName("bu_fg") told me it was a DOM Element, but a lied. It was a NodeList of one. I had to grab the first element out, but I still wasn't done. The Element does not have a text value, it contains child nodes, one of which is a text node.

See if you can worry it out from there.
Was This Post Helpful? 2
  • +
  • -

#3 JackOfAllTrades   User is offline

  • Saucy!
  • member icon

Reputation: 6246
  • View blog
  • Posts: 24,014
  • Joined: 23-August 08

Re: Parsing XML File using DOM

Posted 13 March 2013 - 07:27 AM

ElementTree was a better method for parsing XML, IIRC.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1