My God, it's Full of XML

22 September 2008

In recent posts I looked at a native XML database called DBXML and we looked at where XML came from.

You may find yourself in the situation that you are given a pile of XML documents, possibly broken, and it is up to you to make sense of them. This post explains some tools that can form your first-aid kit for dealing with problem XML documents.

Shine like a star(let)

xmlstarlet is available from your friendly neighbourhood package manager or from the xmlstarlet website

xmlstarlet is a command line tookit that provides various different XML related helpers. For details on all the xmlstarlet tools, type:

xmlstarlet --help

Brock wrote recently about using xmlstarlet's select tool that allows you get use XPATH expressions to query your XML.

Viewing the element structure

Another handy xmlstarlet tool is the element structure viewer, this provides a friendly, xpath style view into the XML document.

xmlstarlet el filename.xml

This I tend to use the -u option which only shows the unique lines:

xmlstarlet el -u filename.xml

There is also -a for attributes and -v for the attribute values as well.

Checking for well-formed XML documents

The most useful xmlstarlet tool for me has been the XML validator, which tests whether your documents are well formed or not. You use the tool as follows:

xmlstarlet val xmlfile.xml

It also has a number of options, the main one I have used is to validate against a Document Type Definition:

xmlstarlet val -d dtdfile.dtd xmlfile.xml

Tidying up your XML files

Sometimes programs output really ugly looking XML. So when you have made sure your document is well-formed with xmlstarlet, you might want to tidy it up a bit before letting anyone else see it.

Xmltidy is a handy little Java program that loads your XML document into memory and then outputs it in a nice looking form with linebreaks and indentation.

This is especially useful when you have a collection of XML files that are referencing each other. Xmltidy will combine them into a nice looking XML document.

Download the jar file from the xmltidy homepage, and then run:

java -jar xmltidy.jar --input oldfile.xml --output newfile.xml

Dealing with Unicode problems

Some of the most annoying problems with XML files can be when the files encoding is not valid UTF-8 and some program is rejecting XML files.

I found a really nice package called uniutils, which is again available from your friendly neighbourhood package manager or from the uniutils website.

Like xmlstarlet, this gives you various utilities, however the main one I use it for is to check whether my XML files are valid UTF-8 unicode. It gives useful error messages when a file is not unicode. you can then check the file in a text editor and/or hex viewer (e.g. Ghex) to see what the problem is. So to validate an XML file, we simply go:

uniname -V filename.xml

If it has non-unicode characters, you will receive errors such as:

Invalid UTF-8 code encountered at line 215, character 115037, byte 115036. The first byte, value 0x82, with bit pattern 10001100, is not a valid first byte of a UTF-8 sequence because its high bits are 10.

So the character with hex value x82 is not a valid character in the UTF-8 encoding. In Emacs you can look at the character by typing

M-x goto-char 115037

Or you can open your hex editor. In Ghex, you can go to the edit menu and use the "Goto byte" feature to the problem character, for example, if the byte number was 119, then you can go:

http://media.commandline.org.uk/images/posts/gnome/ghex.png

That works for one character. If we want to recursively check all XML files within a directory, we can use find:

find . -name '*.xml' -print -exec uniname -V {} \;

So now lets imagine we find that the files have a non-unicode character with the hex value x82 as above, then we might want to replace it with a characters or entity, the following use of find and sed replaces all occurrences of the hex x82 with C:

find . -iname '*.xml' -exec sed -i 's/\x82/\C/g' {} \;

This can help a lot as most XML programs will reject files with inconstant encoding.

Conclusion

These are my tips for dealing with a pile of XML broken files. if you have any tips or suggestions of your own, please share them by leaving a comment below.

In some future posts, we will look at using XML with Python, and with the Django web framework.

Thanks to Andy and Nick for help with this post, and the title was based on Tommi Virtanen's fantastic Europython talk.

If you are a Digg fan, give it some lovin!

1 Andrew West says...

O.k. time to be nit picky (but not on the subject you expect).

All due respect to Virtanen, I think you'll find the original was Arthur C. Clarke, 2001 A Space Odyssey, "Oh my God! It's full of stars"

carry on.

Posted at 8:31 p.m. on September 22, 2008


2 David Jones says...

See also xkcd: http://xkcd.com/224/ My God! It's full of 'car's.

Posted at 2:25 a.m. on September 23, 2008


3 sikanrong says...

I totally just read 2063: Odyssey Two -- f**king amazing blog title

Posted at 9:25 p.m. on October 30, 2008


4 Mike says...

>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val xmlfile.xml

Also check the xmlwf linux command - it does the same thing. It's in the 'expat' package.

Posted at 12:44 p.m. on November 29, 2008


What do you have to say?

Show Editing Help

Europython

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

picsus

January 5, 2009
Monique, a Leaf fan, originate this plumb persistent to believe. Now, let me regarding out that this was in no way an try to articulate one cooperate is more wisely ...
This week in the world of the Command Line; The Friday Round up!

QuickSilver

January 5, 2009
Nice! Is there anyway to implement a ServerAliveInterval for long processes? This is because my our firewall keeps closing the connection based on inactive connections. Thanks,
SFTP in Python: Really Simple SSH

Tun

January 5, 2009
Hi, Do You know, haw can i get start date for tasks evolution? If exists the similar way to your example: i.get_due() ? I would like to have sth like ...
Three Useful Python Bindings - ClamAV, Apt and Evolution

MurreiM

January 5, 2009
This is great! http://www.youtube.com/MurreiM Buy Alli Orlistat online cheap
Filing cabinets 101 - An introduction to disk partitions

sarah

January 5, 2009
I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed ...
This Week: Freedom not Time-Bombs

jnfrlast

January 4, 2009
Hi! http://www.youtube.com/jnfrlast buy cheap viagra online
Filing cabinets 101 - An introduction to disk partitions

Samuel Huckins

January 4, 2009
Great tips! I have had occasion to do a lot of MySQL instance migrations lately, so here is an improvement for Trick 1: mysqldump <DATABASE_NAME> [mysqldump_options] | gzip -c | ...
Five useful command one liners

George Glass

December 31, 2008
I don't really see the point in trying to make linux user-friendly or take over the desktop. We rule the servers the most important element of the entire game. Let ...
Give Linux a chance

bug

December 31, 2008
@Zeth: The hidden field does block some. Not perfect, but it does release some weight from the filtering system, as those are 100% false comments. Acctually, if you would have ...
On Comment Spam

Zeth

December 31, 2008
Hi Eion, Yes that is an interesting approach also. It is the only approach given by default in the stock Django comments module, though it does not stop all comment ...
On Comment Spam

Bug

December 30, 2008
Well... Sadly, and I guess you hate me for it, I use captcha. But at least it's not an image, so even if you visit using w3m [yey!] you can ...
On Comment Spam

Eion

December 30, 2008
Other than server-side processing of comments, I like to add additional <input>'s and hide them in external css. Most of the time the fields are populated by spam-bots, and if ...
On Comment Spam

Nostoc

December 27, 2008
... Mate possible because of the dull Kg8
Ruy Lopez, Berlin defence, open variation

Nostoc

December 27, 2008
My bad, I meant the picture beneath 15, after close inspection my suggestion would be on 18. Instead of 18 : Qe2, I would have taken that knight with my ...
Ruy Lopez, Berlin defence, open variation

Zeth

December 27, 2008
Nostoc, white takes the rook on 15, the rook is a better kill than a knight.
Ruy Lopez, Berlin defence, open variation

Nostoc

December 26, 2008
I'm not that good at chess, but I have a question. At 15, why doesn't white simply take black's knight in C6 with the bishop? It's an easy kill, since ...
Ruy Lopez, Berlin defence, open variation

Zeth

December 26, 2008
CorkyAgain, good question, I don't have a FreeBSD box available at the moment so I can't comment. On Linux at least watch does as I have described.
Five useful command one liners

CorkyAgain

December 25, 2008
Is the watch command you're describing a Linuxism? On my FreeBSD box, "man watch" seems to be describing something completely different.
Five useful command one liners

Binny V A

December 25, 2008
I have actually setup a site to store just short commands... http://txt.binnyva.com/
Five useful command one liners

Bassam essa

December 25, 2008
i try this line command elinks -source "http://www.e51g.com/" > resulthtml.txt its work done :) thx
Command the Web - an ELinks tutorial