My God, it's Full of XML
22 September 2008
In recent posts I looked at a native XML database called DBXML and we looked at where XML came from.
You may find yourself in the situation that you are given a pile of XML documents, possibly broken, and it is up to you to make sense of them. This post explains some tools that can form your first-aid kit for dealing with problem XML documents.
Shine like a star(let)
xmlstarlet is available from your friendly neighbourhood package manager or from the xmlstarlet website
xmlstarlet is a command line tookit that provides various different XML related helpers. For details on all the xmlstarlet tools, type:
xmlstarlet --help
Brock wrote recently about using xmlstarlet's select tool that allows you get use XPATH expressions to query your XML.
Viewing the element structure
Another handy xmlstarlet tool is the element structure viewer, this provides a friendly, xpath style view into the XML document.
xmlstarlet el filename.xml
This I tend to use the -u option which only shows the unique lines:
xmlstarlet el -u filename.xml
There is also -a for attributes and -v for the attribute values as well.
Checking for well-formed XML documents
The most useful xmlstarlet tool for me has been the XML validator, which tests whether your documents are well formed or not. You use the tool as follows:
xmlstarlet val xmlfile.xml
It also has a number of options, the main one I have used is to validate against a Document Type Definition:
xmlstarlet val -d dtdfile.dtd xmlfile.xml
Tidying up your XML files
Sometimes programs output really ugly looking XML. So when you have made sure your document is well-formed with xmlstarlet, you might want to tidy it up a bit before letting anyone else see it.
Xmltidy is a handy little Java program that loads your XML document into memory and then outputs it in a nice looking form with linebreaks and indentation.
This is especially useful when you have a collection of XML files that are referencing each other. Xmltidy will combine them into a nice looking XML document.
Download the jar file from the xmltidy homepage, and then run:
java -jar xmltidy.jar --input oldfile.xml --output newfile.xml
Dealing with Unicode problems
Some of the most annoying problems with XML files can be when the files encoding is not valid UTF-8 and some program is rejecting XML files.
I found a really nice package called uniutils, which is again available from your friendly neighbourhood package manager or from the uniutils website.
Like xmlstarlet, this gives you various utilities, however the main one I use it for is to check whether my XML files are valid UTF-8 unicode. It gives useful error messages when a file is not unicode. you can then check the file in a text editor and/or hex viewer (e.g. Ghex) to see what the problem is. So to validate an XML file, we simply go:
uniname -V filename.xml
If it has non-unicode characters, you will receive errors such as:
Invalid UTF-8 code encountered at line 215, character 115037, byte 115036. The first byte, value 0x82, with bit pattern 10001100, is not a valid first byte of a UTF-8 sequence because its high bits are 10.
So the character with hex value x82 is not a valid character in the UTF-8 encoding. In Emacs you can look at the character by typing
M-x goto-char 115037
Or you can open your hex editor. In Ghex, you can go to the edit menu and use the "Goto byte" feature to the problem character, for example, if the byte number was 119, then you can go:
That works for one character. If we want to recursively check all XML files within a directory, we can use find:
find . -name '*.xml' -print -exec uniname -V {} \;
So now lets imagine we find that the files have a non-unicode character with the hex value x82 as above, then we might want to replace it with a characters or entity, the following use of find and sed replaces all occurrences of the hex x82 with C:
find . -iname '*.xml' -exec sed -i 's/\x82/\C/g' {} \;
This can help a lot as most XML programs will reject files with inconstant encoding.
Conclusion
These are my tips for dealing with a pile of XML broken files. if you have any tips or suggestions of your own, please share them by leaving a comment below.
In some future posts, we will look at using XML with Python, and with the Django web framework.
Thanks to Andy and Nick for help with this post, and the title was based on Tommi Virtanen's fantastic Europython talk.
If you are a Digg fan, give it some lovin!




1 Andrew West says...
O.k. time to be nit picky (but not on the subject you expect).
All due respect to Virtanen, I think you'll find the original was Arthur C. Clarke, 2001 A Space Odyssey, "Oh my God! It's full of stars"
carry on.
Posted at 8:31 p.m. on September 22, 2008
2 David Jones says...
See also xkcd: http://xkcd.com/224/ My God! It's full of 'car's.
Posted at 2:25 a.m. on September 23, 2008
3 sikanrong says...
I totally just read 2063: Odyssey Two -- f**king amazing blog title
Posted at 9:25 p.m. on October 30, 2008
4 Mike says...
>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val xmlfile.xml
Also check the xmlwf linux command - it does the same thing. It's in the 'expat' package.
Posted at 12:44 p.m. on November 29, 2008