Python ElementTree

rhexter · March 16, 2011, 3:58pm

so I’m a little puzzled about ElementTree:

We have a wonderful engineer driven options file that I am finally getting round reading all my settings from.
Said file is Xml.
I need to parse this and have a little more control than minidom gives to locate data sets within this options file.
I’ve been looking at ElementTree.

when I fetch the root node I get no Namespace returned with the first Element, I get this:
<Element Node at 260af80>
not, or similar,
<Element {Atom Syndication Format namespace}title at e2b5d0>
as described in the dive into python docs

The file does not have a namespace:
<Node xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>

Am I sh*t out of luck with ElementTree or can I force an empty global namespace that allows me to use:
myNodes = root.findall(‘//{}Property’)

BCrosbie · March 16, 2011, 5:16pm

It’s been a while since I’ve delved in to XML, so take this with a grain of salt.

I think the way the XML spec is written the content of the node is actually a child of the element. The last time I used this was for HTML DOM parsing. Once I found the node I had to go ‘one more step’ to get to the content of the node.

“Element nodes does not have a text value.
The text of an element node is stored in a child node. This node is called a text node.
The way to get the text of an element, is to get the value of the child node (text node).”

Snippet comes from: http://www.w3schools.com/dom/dom_nodes_get.asp

Don’t know if it will be the same in this case, but since the XML spec is supposed to be rigidly adhered to, it’s probably a good first step.

btribble · March 16, 2011, 5:38pm

BTW, unless there is a reason why you can’t, you should switch to celementree instead. http://effbot.org/zone/celementtree.htm
Same API, but much faster and more compact.

Adam_Pletcher · March 16, 2011, 5:39pm

@rhexter:
Can you post the relevant parts of that XML file? It would make giving specific advice easier.

rhexter · March 17, 2011, 3:40pm

hmm… nah, posting development working files on a forum would be frowned on.

However the issue is within elementtree (and celementtree as well) and specifically around the ‘findall’ searching of elements. You need a namespace to be specified to search.

example:
myNodes = rootNode.findall(‘//{Atom Syndication Format namespace}Property’)

will search for Nodes of type ‘Property’ within the namespace ‘Atom Syndication Format namespace’ within the whole open document ‘//’ starting from the ‘rootNode’.

This is of no use if your xml document has no namespace set, leaving you with the option of setting an namespace or using SAX that can search without requiring an namespace.

I think ‘lxml.etree’ appears to suggest that it will search without the need for the namespace, but i need to try it out.

djTomServo · March 17, 2011, 4:35pm

[QUOTE=rhexter;9684]ithin elementtree (and celementtree as well) and specifically around the ‘findall’ searching of elements. You need a namespace to be specified to search.[/QUOTE]

Are you sure about that? We use cElementTree for all our XML needs and have never run into that limitation…

rhexter · March 17, 2011, 4:38pm

I pretty much get exactly this issue.

http://mail.python.org/pipermail/tutor/2005-December/044026.html

http://effbot.org/zone/element.htm#searching-for-subelements

and my file root Node looks like this:

<?xml version=“1.0”?>
<Node xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>
<Properties>
<Property name=“name” value=“Options” />
…

no namespace… so I can’t search my tree, if this limitation has been removed with celementtree I can’t see where it says it has if its based on elementtree, am i missing something really obvious here? I guess you are using ‘iterparse’ extensively rather than the ‘element.findall()’ ?

djTomServo · March 17, 2011, 4:56pm

[QUOTE=rhexter;9687]I guess you are using ‘iterparse’ extensively rather than the ‘element.findall()’ ?[/QUOTE]

Nope, 100% element.find and findall. I don’t think we use iterparse anywhere, actually.

rhexter · March 17, 2011, 8:16pm

import xml.etree.ElementTree as ET

def genData ():
    '''construct a test xml file'''

    strData = """<functions>
                    <molecular_class>Enzyme: Dehydrogenase</molecular_class>

                    <molecular_function>
                        <title>Catalytic activity</title>
                        <goid>0003824</goid>
                    </molecular_function>

                    <biological_processes>
                        <biological_process>
                            <title>Metabolism</title>
                            <goid>0008152</goid>
                        </biological_process>
                        <biological_process>
                            <title>Energy pathways</title>
                            <goid>0006091</goid>
                        </biological_process>
                    </biological_processes>
                </functions>"""

    return strData

def main():

    strData = genData()
    rootNode = ET.fromstring(strData)
    myNodes = rootNode.findall('//biological_process')

    for node in myNodes:
        print node.tag

if __name__ == '__main__':
    main()

replicates my issue and produces this stack trace:

Traceback (most recent call last):
  File "C:\Users\Genghis\My Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py", line 52, in <module>
    main()
  File "C:\Users\Genghis\My Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py", line 43, in main
    myNodes = rootNode.findall('//biological_process')
  File "C:\Python26\lib\xml\etree\ElementTree.py", line 355, in findall
    return ElementPath.findall(self, path)
  File "C:\Python26\lib\xml\etree\ElementPath.py", line 198, in findall
    return _compile(path).findall(element)
  File "C:\Python26\lib\xml\etree\ElementPath.py", line 176, in _compile
    p = Path(path)
  File "C:\Python26\lib\xml\etree\ElementPath.py", line 77, in __init__
    raise SyntaxError("cannot use absolute path on element")
SyntaxError: cannot use absolute path on element

I am running python 2.6, and executing in the interpreter within PyScripter.

BCrosbie · March 17, 2011, 8:51pm

Let’s see if I can make up for my earlier unhelpful comment.

I ran this using Python 2.7 under IDLE and received the same stack trace you did.

The key error message is:
File “C:\Python27\lib\xml\etree\ElementPath.py”, line 257, in iterfind
raise SyntaxError(“cannot use absolute path on element”)

In the source I changed line 32 from
myNodes = rootNode.findall(’//biological_process’)

to

myNodes = rootNode.findall('.//biological_process')

The addition of the . means to start at the current node and makes the path relative rather than absolute.

Once I made that change the code you posted gives me:
>>>
biological_process
biological_process
>>>

See if that fixes your parsing problem.

rhexter · March 17, 2011, 9:19pm

:rolleyes:
Thank you!

Missing that would explain why then I would run into thinking I would need a namespace to have some sort of scope to search within.

I don’t mind looking stupid about things I don’t know. :laugh:

djTomServo · March 18, 2011, 1:02am

you can throw wildcards as well, so:

myNodes = rootNode.findall('.//biological_process')

could be

myNodes = rootNode.findall('*//biological_process')

BCrosbie · March 18, 2011, 6:05am

Seth’s wildcards are probably a better general purpose option, as long as you want to match as much as possible. I tried 4 different XPath’s just to play around.

.//biological_process
./biological_process
*//biological_process
*/biological_process

.// <- starting at current node, show children at any depth
returned 2

./ <- starting at current node, show direct children
returned 0 because desired node enclosed in <biological_processes>

*// <- starting at any node at current level in hierarchy (i think…) show children at any depth
returned 2

*/ <- match any child of the current node, and show the grandchildren named biological_process. The * matches biological_processes, and then finds the enclosed node

JasonB · March 18, 2011, 8:57am

I use cElementTree all the time …

There are different versions of it with slightly different functionality (although backwards compatible – not forwards.)
http://effbot.org/zone/element-xpath.htm <– great 1 page reference. The 1.3 is built into Python 2.7 for sure … but earlier versions of python have earlier versions of ET.

I can’t remember exactly – but I think you can only use namespaces properly with ElementTree – not cElementTree:
# setting up qualified namespaces used in the document
NS_MAP = {
‘urn:schemas-microsoft-com:office:spreadsheet’:‘ss’,
‘http://www.w3.org/TR/REC-html40’:'html’,
}
etree._namespace_map.update(NS_MAP)

Here is a workaround tho that I used with cElementTree:

    URI = "{urn:schemas-microsoft-com:office:spreadsheet}"
    doc = parse(path)
    root = doc.getroot()
    rows = root.findall(".//" + URI + "Row")

rhexter · March 18, 2011, 10:18am

Thanks guys,

So I now jumped into trying to locate attributes on an element within the tree.

I modified one element:

<biological_process name="ass" value="mine">

Then my search:


myNodes = rootNode.findall(".//biological_process[@name='ass']")

now I get this,

Message	File Name	Line	Position	
Traceback				
    <module>	C:\data\Dropbox\Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py	52		
    main	C:\data\Dropbox\Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py	43		
    findall	C:\Python26\lib\xml\etree\ElementTree.py	355		
    findall	C:\Python26\lib\xml\etree\ElementPath.py	198		
    _compile	C:\Python26\lib\xml\etree\ElementPath.py	176		
    __init__	C:\Python26\lib\xml\etree\ElementPath.py	88		
SyntaxError: unsupported path syntax ([)

The example is suggesting:


[@attrib=’value’]	
(New in 1.3) Selects all elements for which the given attribute has the given value. 
For example, “.//div[@class=’sidebar’]” selects all “div” elements in the tree that has the class “sidebar”. 
In the current release, the value cannot contain quotes.

rhexter · March 18, 2011, 10:36am

Ah…

DATA
    VERSION = '1.2.6'
    __all__ = ['Comment', 'dump', 'Element', 'ElementTree', 'fromstring', ...

Damn

rhexter · March 18, 2011, 1:54pm

Should the 1.06 version of cElementTree allow for finding a parent node?

It appears that Maya2009 shipping with python 2.5 has ElementTree 1.2.6 but the cElementTree is 1.06 which should allow for a parent search…

this is confusing…

JasonB · March 18, 2011, 2:03pm

I can’t remember where I stole part of this – likely from the effbot.org site is …

def getAEAncestors(root, element):
“”"
Iterative search for ancestors.
returns a list of ancestors [ …, grandparent, parent, child]
“”"

#where c,p is child,parent
parent_map = dict((c, p) for p in root.getiterator() for c in p)

ancestors = []
while parent_map.get(element):
    ancestors.append(element.attrib["name"])
    element = parent_map[element]
ancestors.reverse()

return ancestors

The above function should start with the given element, and keep backing out – returning a list of parents to the root.

Notice that in the above example – my elements all had the value “name” in them – you may want to collect something else, like the actual elements and then querry the element name attribute.

rhexter · March 18, 2011, 2:47pm

Thanks, I put this into a recursive step up the tree the number of steps you’d want to go.


def getParent(childParentMap, element, num = 1):
    '''
        Get the parent of the given node
        The parent of the given node will have a child that is the given node 'element'
        Num: the number of steps up the tree you want to go, from where you are, its optional
    '''
    #the value of the key (our element) will be its parent
    child = element
    try:
        element = childParentMap.get(child)
    except KeyError, errObj:
        element = None

    if num > 1 and element != None:
        #assume childParentMap is not dynamically changing
        element = getParent (childParentMap, element, (num-1))

    if num == 1:
        print 'GetParent: ',
        print element

    return element