James Henstridge
Most users of the libxml (aka gnome-xml) library tend to use the DOM style tree interface for reading documents. This is generally quite an easy interface to use, but can use quite a bit of memory. An alternative is to use the SAX interface in libxml, which is a port of the Simple API for XML library for Java.
This article is aimed at people who understand and have used the libxml DOM style interface and want to explore the SAX interface. Some examples are biased toward use in GTK+/GNOME programming.
Most people will agree that the most intuitive representation for arbitrary XML data is a tree. Libxml provides a nice API to construct a tree from an XML file, which makes it very easy to use. But in some cases, the tree representation does not match the internal representation you wish to use in your program, so it may not be the best choice. You may end up using libxml to construct the tree, extracting the data and then throw away the tree.
In these cases libxml's SAX API may be a better choice. Note that there are a few drawbacks to the use of this API though:
SAX parsers generally require you to write a bit more code than the DOM interface.
Unless you build a DOM style tree from your application's internal representation for the data, you can't as easily write the XML file back to disk. This is not a concern if your program only reads XML files, and does not write them.
You may have to redesign or rethink your file loading routines.
It is not all bad however. There are benefits to the use of the SAX API:
The DOM tree is not constructed, so there are potentially less memory allocations.
If you convert the data in the DOM tree to another format, the SAX API may help remove the intermediate step.
If you do not need all the XML data in memory, the SAX API allows you to process the data as it is parsed.
As you know, the DOM style interface (ie. what is returned by xmlParseFile or xmlParseMemory) produces a tree of xmlNode structures that can then be walked to extract the data.
The SAX interface works quite differently on the other hand. You instead pass a number of callback routines to the parser in a xmlSAXHandler structure. The parser will then parse the document and call the appropriate callback when certain conditions occur.
Some of the more useful callbacks are startDocument, endDocument, startElement, endElement, getEntity and characters. Most of these are quite self explanatory. The characters callback is called when characters outside a tag are parsed.
As you can see, the code for a SAX style parser seems almost inside out, when compared with the DOM style tree method. For this reason, a state machine aproach is useful.
It is interesting to note that the xmlParseFile function and friends are all implemented in terms of the SAX interface, so no power is lost by using this interface.
To give you some idea of where SAX may be useful, some examples may help.
Consider an XML document structured like follows:
<?xml version="1.0"?> <array> <number>0</number> <number>1</number> ... <number>42</number> </array> |
The file describes an array of numbers. In this case, the most appropriate data structure to represent the information as would be an array -- not a tree. By using the SAX interface, we can write the numbers directly to an array rather than using the DOM tree as an intermediate format.
As I said earlier, state machines make using the SAX interface much easier. At any one time, you are in one state. The startElement and endElement callbacks cause the state to change. You will also want to perform some other actions during state changes -- in this case, build the array.
For this example, we can use four states:
| START |
| INSIDE_ARRAY |
| INSIDE_NUMBER |
| FINISH |
In the startDocument callback, we would initialise any variables that are required during parsing. This would include the array to hold the numbers (a glib GArray would be useful), and a buffer to hold character data (a glib GString is a good choice here). It would also set the initial state to START. the endDocument callback, we can free these variables (we would leave the array though).
The interesting code is in the startElement and endElement callbacks. What they do will depend on the current state, and the element name that is being opened or closed.
For the array example, the startElement function would act as follows:
If we are in the START state and the element name is array, then change to the INSIDE_ARRAY state. Any other names would be an error
If we are in the INSIDE_ARRAY state, and the element name is number, change to the INSIDE_NUMBER state. We would also zero the character data buffer. Any other element would be an error in this state.
If we are in the INSIDE_NUMBER or FINISH states, it is an error.
The characters function would append the character data to the buffer if it was in the INSIDE_NUMBER state. If it is in any other states, the data could be discarded.
The endElement function would act as follows:
If we are in the INSIDE_NUMBER state, the character data buffer should hold the number. We convert the string to an integer and append it to the array. We then change to the INSIDE_ARRAY state.
If we are in the INSIDE_ARRAY state, change to the FINNISH state.
If we are in the START or FINNISH states, an error has occured
Note that we know what to do in the endElement function even without looking at the element name. By using a state machine model like this, we can narrow down the number of possible element names, which reduces string comparisons as well.
Sometimes we have XML files with many subtrees of the same format describing different things. An example of this is the fullIndex.rdf.gz. The file contains repeating sections like follows:
...
<RDF:Description about="ftp://rpmfind.net/linux/redhat/redhat-6.0/i386/RedHat/RPMS/emacs-X11-20.3-15.i386.rpm">
<RPM:Name>emacs-X11</RPM:Name>
<RPM:Summary>The Emacs text editor for the X Window System.</RPM:Summary>
</RDF:Description>
<RDF:Description about="ftp://rpmfind.net/linux/Mandrake/6.0/Mandrake/RPMS/emacs-el-20.3-14mdk.i586.rpm">
<RPM:Name>emacs-el</RPM:Name>
<RPM:Summary>The sources for elisp programs included with Emacs.</RPM:Summary>
</RDF:Description>
... |
One operation that we might want to do is to search for a package by name. For this type of operation, at any one time, we are only concerned with one subtree in the document -- the others need not be in memory. For this reason, a SAX based parser would be a good idea.
With the DOM tree aproach, the memory usage of the program will increase as the size of the index file increases. With the SAX based aproach, the memory usage should be fairly constant despite changes in the size of the index.
This benefit is particularly useful when parsing XML documents of sizes similar to the RDF dumps of the Open Directory Project. The overhead of the DOM tree can become unacceptable when parsing 35 megabyte XML files.
For a working example of this sort of situation, see the XML parser in /gnorpm/find/search.c.
Sometimes it may make sense to use the SAX interface even if the data in the XML file is inherently tree based.
The DOM style interface of libxml is designed to be able to represent any well formed XML document. But the save format of most applications generally uses only a subset of XML. For instance, GLADE does not use attributes for any elements, so you could argue that the support for attributes in the DOM interface is bloat when used with GLADE.
Since the XML you are reading in has some known constraints, it is usually possible to store the information in more compact structures.
I am looking at implementing something like this in libglade. In this situation, the changeover looks like it will decrease the memory requirements of the internal structures by a factor of four, and has increased the speed of the parser slightly.
You can look at the source for the new libglade SAX based parser at /libglade/test/saxp.c. There is also a diagram that describes the transitions between the different states in the parser at /libglade/test/states.dia.
In this particular type of situation, it is worth spending a bit more time deciding between DOM style and SAX interfaces. If you keep with the DOM interface, your program will probably use a bit more memory, but it has the advantage that you can modify the tree, and then write the XML file back to disk with a single function call. If you switch over to a different representation, you will need to either convert the information to the DOM style tree representation and then dump it to a file, or write your own output routines.
Another deciding factor is laziness. If you use the SAX interface, you will need to write some parsing code. On the other hand, the DOM interface does this for you.
In the libglade case, memory usage and speed were considered important, and the XML data did not have to be written back to a file, so choice of a SAX parser was relatively easy.
| Next | ||
| Implementing a SAX Based Parser |