We have a wonderful engineer driven options file that I am finally getting round reading all my settings from.
Said file is Xml.
I need to parse this and have a little more control than minidom gives to locate data sets within this options file.
I’ve been looking at ElementTree.
when I fetch the root node I get no Namespace returned with the first Element, I get this:
<Element Node at 260af80>
not, or similar,
<Element {Atom Syndication Format namespace}title at e2b5d0>
as described in the dive into python docs
It’s been a while since I’ve delved in to XML, so take this with a grain of salt.
I think the way the XML spec is written the content of the node is actually a child of the element. The last time I used this was for HTML DOM parsing. Once I found the node I had to go ‘one more step’ to get to the content of the node.
“Element nodes does not have a text value.
The text of an element node is stored in a child node. This node is called a text node.
The way to get the text of an element, is to get the value of the child node (text node).”
BTW, unless there is a reason why you can’t, you should switch to celementree instead. http://effbot.org/zone/celementtree.htm
Same API, but much faster and more compact.
hmm… nah, posting development working files on a forum would be frowned on.
However the issue is within elementtree (and celementtree as well) and specifically around the ‘findall’ searching of elements. You need a namespace to be specified to search.
will search for Nodes of type ‘Property’ within the namespace ‘Atom Syndication Format namespace’ within the whole open document ‘//’ starting from the ‘rootNode’.
This is of no use if your xml document has no namespace set, leaving you with the option of setting an namespace or using SAX that can search without requiring an namespace.
I think ‘lxml.etree’ appears to suggest that it will search without the need for the namespace, but i need to try it out.
[QUOTE=rhexter;9684]ithin elementtree (and celementtree as well) and specifically around the ‘findall’ searching of elements. You need a namespace to be specified to search.[/QUOTE]
Are you sure about that? We use cElementTree for all our XML needs and have never run into that limitation…
no namespace… so I can’t search my tree, if this limitation has been removed with celementtree I can’t see where it says it has if its based on elementtree, am i missing something really obvious here? I guess you are using ‘iterparse’ extensively rather than the ‘element.findall()’ ?
import xml.etree.ElementTree as ET
def genData ():
'''construct a test xml file'''
strData = """<functions>
<molecular_class>Enzyme: Dehydrogenase</molecular_class>
<molecular_function>
<title>Catalytic activity</title>
<goid>0003824</goid>
</molecular_function>
<biological_processes>
<biological_process>
<title>Metabolism</title>
<goid>0008152</goid>
</biological_process>
<biological_process>
<title>Energy pathways</title>
<goid>0006091</goid>
</biological_process>
</biological_processes>
</functions>"""
return strData
def main():
strData = genData()
rootNode = ET.fromstring(strData)
myNodes = rootNode.findall('//biological_process')
for node in myNodes:
print node.tag
if __name__ == '__main__':
main()
replicates my issue and produces this stack trace:
Traceback (most recent call last):
File "C:\Users\Genghis\My Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py", line 52, in <module>
main()
File "C:\Users\Genghis\My Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py", line 43, in main
myNodes = rootNode.findall('//biological_process')
File "C:\Python26\lib\xml\etree\ElementTree.py", line 355, in findall
return ElementPath.findall(self, path)
File "C:\Python26\lib\xml\etree\ElementPath.py", line 198, in findall
return _compile(path).findall(element)
File "C:\Python26\lib\xml\etree\ElementPath.py", line 176, in _compile
p = Path(path)
File "C:\Python26\lib\xml\etree\ElementPath.py", line 77, in __init__
raise SyntaxError("cannot use absolute path on element")
SyntaxError: cannot use absolute path on element
I am running python 2.6, and executing in the interpreter within PyScripter.
Let’s see if I can make up for my earlier unhelpful comment.
I ran this using Python 2.7 under IDLE and received the same stack trace you did.
The key error message is:
File “C:\Python27\lib\xml\etree\ElementPath.py”, line 257, in iterfind
raise SyntaxError(“cannot use absolute path on element”)
In the source I changed line 32 from
myNodes = rootNode.findall(’//biological_process’)
Seth’s wildcards are probably a better general purpose option, as long as you want to match as much as possible. I tried 4 different XPath’s just to play around.
.// <- starting at current node, show children at any depth
returned 2
./ <- starting at current node, show direct children
returned 0 because desired node enclosed in <biological_processes>
*// <- starting at any node at current level in hierarchy (i think…) show children at any depth
returned 2
*/ <- match any child of the current node, and show the grandchildren named biological_process. The * matches biological_processes, and then finds the enclosed node
There are different versions of it with slightly different functionality (although backwards compatible – not forwards.) http://effbot.org/zone/element-xpath.htm <– great 1 page reference. The 1.3 is built into Python 2.7 for sure … but earlier versions of python have earlier versions of ET.
I can’t remember exactly – but I think you can only use namespaces properly with ElementTree – not cElementTree:
# setting up qualified namespaces used in the document
NS_MAP = {
‘urn:schemas-microsoft-com:office:spreadsheet’:‘ss’,
‘http://www.w3.org/TR/REC-html40’:'html’,
}
etree._namespace_map.update(NS_MAP)
Here is a workaround tho that I used with cElementTree:
URI = "{urn:schemas-microsoft-com:office:spreadsheet}"
doc = parse(path)
root = doc.getroot()
rows = root.findall(".//" + URI + "Row")
Message File Name Line Position
Traceback
<module> C:\data\Dropbox\Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py 52
main C:\data\Dropbox\Dropbox\Rob\Python\ssgtools\xmltools\eltreeTest.py 43
findall C:\Python26\lib\xml\etree\ElementTree.py 355
findall C:\Python26\lib\xml\etree\ElementPath.py 198
_compile C:\Python26\lib\xml\etree\ElementPath.py 176
__init__ C:\Python26\lib\xml\etree\ElementPath.py 88
SyntaxError: unsupported path syntax ([)
The example is suggesting:
[@attrib=’value’]
(New in 1.3) Selects all elements for which the given attribute has the given value.
For example, “.//div[@class=’sidebar’]” selects all “div” elements in the tree that has the class “sidebar”.
In the current release, the value cannot contain quotes.
I can’t remember where I stole part of this – likely from the effbot.org site is …
def getAEAncestors(root, element):
“”"
Iterative search for ancestors.
returns a list of ancestors [ …, grandparent, parent, child]
“”"
#where c,p is child,parent
parent_map = dict((c, p) for p in root.getiterator() for c in p)
ancestors = []
while parent_map.get(element):
ancestors.append(element.attrib["name"])
element = parent_map[element]
ancestors.reverse()
return ancestors
The above function should start with the given element, and keep backing out – returning a list of parents to the root.
Notice that in the above example – my elements all had the value “name” in them – you may want to collect something else, like the actual elements and then querry the element name attribute.
Thanks, I put this into a recursive step up the tree the number of steps you’d want to go.
def getParent(childParentMap, element, num = 1):
'''
Get the parent of the given node
The parent of the given node will have a child that is the given node 'element'
Num: the number of steps up the tree you want to go, from where you are, its optional
'''
#the value of the key (our element) will be its parent
child = element
try:
element = childParentMap.get(child)
except KeyError, errObj:
element = None
if num > 1 and element != None:
#assume childParentMap is not dynamically changing
element = getParent (childParentMap, element, (num-1))
if num == 1:
print 'GetParent: ',
print element
return element