Monday, February 9, 2009

Python and XML

AH..... XML. We store the shading data for out game assets as xml files, via the Collada schema. It's prettymuch a custom, in-house implementation of what Collada has to offer. In the past I've had to read data from these files, but rarely did I have to edit them via code. I'd written a variety of mel scripts to suck info out, but recently I've been required to edit them as well. Python to the rescue, right?

Weeeellll, sort of. Python seems to have many different methods to interact with xml: xml.dom, xml.dom.minidom, xml.sax, etc. I started out with minidom (the "mini" part lulling me into a false sense of security), but it was nothing but pain. I'm new to this, and no expert, but it seemed like a pile of extra code required to do the simplest things. And, when I'd save an xml file, it'd mangle the formatting: Technically it'd still be a valid xml file, but it would start to put element\tag data on the wrong lines, reducing its human readability. I started querying many differnt development communities, and either no one could really explain what was going on, or didn't understand it.

Enter xml.etree.ElementTree. SO much easier to use than minidom.. much less code, and it just makes more sense. But even it had a similar formatting problem to minidom. It just didn't make any sense.

Finally, I got some feedback: Apparently, those modules will treat the "return\tab\whitespace" chars in the xml files as pseudo-secretive xml 'text' nodes. If you add a new element, you need to jump through a bunch of hoops in relation to these mysterious\hidden text nodes, so when you print\write to your xml, it looks correct. I still can't believe this is the default behavior of those modules.

What it boiled down to was this: If you make a new xml from scratch using minidom or ElementTree, the formatting will be a-ok. But if you edit a pre-existing xml, adding new elements, it you're not careful, you can mangle the formatting.

I documented the whole technical mess over on my Python Wiki, with source code examples for the solution.

2 comments:

Damian Cugley said...

This is often a problem with using XML to represent data structures -- the features of XML designed to be useful for marking up documents get in the way and require extra programming effort. There are other text-based formats for data that may be a better fit, such as JSON.

AkEric said...

Heh, tell me about it. And as an xml parsing noob, it was amazing how many groups of people I asked that really had no concept of what I was talking about. I got quite a few replies saying "is it really that important that it's human readable?" I thought that was part of the point of xml... I must have been asking the wrong groups! ;)