Now you should have a pretty good idea about what a SAX parser is, and how you might implement one using a state machine design. Now it is time to learn how to implement the parser using libxml's C translation of the SAX interface.
As stated before, you use the SAX parser by passing a number of callback routines stored in a xmlSAXHandler structure to one of the SAX parser routines. Here is the prototype for the structure:
typedef struct xmlSAXHandler { internalSubsetSAXFunc internalSubset; isStandaloneSAXFunc isStandalone; hasInternalSubsetSAXFunc hasInternalSubset; hasExternalSubsetSAXFunc hasExternalSubset; resolveEntitySAXFunc resolveEntity; getEntitySAXFunc getEntity; entityDeclSAXFunc entityDecl; notationDeclSAXFunc notationDecl; attributeDeclSAXFunc attributeDecl; elementDeclSAXFunc elementDecl; unparsedEntityDeclSAXFunc unparsedEntityDecl; setDocumentLocatorSAXFunc setDocumentLocator; startDocumentSAXFunc startDocument; endDocumentSAXFunc endDocument; startElementSAXFunc startElement; endElementSAXFunc endElement; referenceSAXFunc reference; charactersSAXFunc characters; ignorableWhitespaceSAXFunc ignorableWhitespace; processingInstructionSAXFunc processingInstruction; commentSAXFunc comment; warningSAXFunc warning; errorSAXFunc error; fatalErrorSAXFunc fatalError; } xmlSAXHandler; typedef xmlSAXHandler *xmlSAXHandlerPtr; |
To start off with, we can set all these functions to NULL. If we use a NULL SAX parser like this, then we will have a parser that only checks the well formedness of a document. By adding a few callbacks, we can get it to do just about anything.
If you do not care about reentrancy of your parser, you can save state between callbacks in global variables. If you choose this approach, you could use one of the following functions:
#include <parser.h> |
xmlDocPtr xmlSAXParseFile(xmlSAXHandlerPtr sax, const char *filename, int recovery);
xmlDocPtr xmlSAXParseMemory(xmlSAXHandlerPtr sax, char *buffer, int size, int recovery);
In these functions, you will most likely ignore the return type (it will probably contain a NULL or garbage). However, if you want to write a reentrant parser, you will need to make some changes. A user_data parameter is passed to all SAX callbacks, which can be used to pass state information between callbacks. Unfortunately, with the current libxml API, there is not an easy way to set this parameter, so you will need to include a rewritten version of xmlSAXParseFile in your program. Daniel says that he will probably change the above functions in a future version to be more useful in this case. Until then, you will probably want to use a function like this:
#include <parser.h>
#include <parserInternals.h>
int myXmlSAXParseFile(xmlSAXHandlerPtr sax, void *user_data, const char *filename) {
int ret = 0;
xmlParserCtxtPtr ctxt;
ctxt = xmlCreateFileParserCtxt(filename);
if (ctxt == NULL) return -1;
ctxt->sax = sax;
ctxt->userData = user_data;
xmlParseDocument(ctxt);
if (ctxt->wellFormed)
ret = 0;
else
ret = -1;
if (sax != NULL)
ctxt->sax = NULL;
xmlFreeParserCtxt(ctxt);
return ret;
} |
This function could then be used like so:
static xmlSAXHandler my_handler { ... }; struct ParserState { RetVal return_val; StatesEnum state; ... }; RetVal parse_xml_file(const char *filename) { struct ParserState my_state; if (myXmlSAXParseFile(&my_handler, &my_state, filename) < 0) { free_ret_val(my_state.return_val); return NULL; } else return my_state.return_val; } |
In this example, we expect the startDocument SAX handler to initialise the ParserState structure passed to it, and the endDocument to free its members, but leaving return_val so that it can be used later.
These callbacks are generally used to perform some initialisation and deinitialisation for your parser callbacks. Their prototypes are as follows:
void startDocument(void *user_data);
void endDocument(void *user_data);
It should be fairly self explanatory how to write these functions.
The characters callback is called when there are characters that are outside of tags get parsed. Its prototype is as follows:
void characters(void *user_data, const CHAR *ch, int len);
The CHAR type is an alias for char. It is used so that it will be easier to add unicode support to the parser at a later date. For all intents and purposes though, you can think of ch as an array of char's. Note that the character data is not necessarily nul terminated. This is so that libxml does not need to copy the character data out of its internal buffers before passing it to the callback.
In your callback, you will probably want to copy the characters to some other buffer so that it can be used from the endElement callback. To optimise this callback a bit, you might adjust the callback so that it only copies the characters if the parser is in a certain state. Note that the characters callback may be called more than once between calls to startElement and endElement.
These callbacks are where most of the state machine logic will go into these two callbacks. Their prototypes are:
void startElement(void *user_data, const CHAR *name, const CHAR **attrs);
void endElement(void *user_data, const CHAR *name);
In these callbacks, the name parameter is the name of the element. The attrs parameter contains the attributes for the start tag. The even indices in the array will be attribute names, the odd indices are the values, and the final index will contain a NULL.
In most parsers, as well as making state transitions in these callbacks, you will probably also collect the data in the XML file. In the startElement callback, you will often allocate structures to hold the data. In the endElement callback, you will usually interpret the character data collected by the characters callback and put the data in one of the structures allocated by startElement. The endElement callback may also free some of the intermediate structures if it is no longer needed.
You may have been wondering how entities (eg <, etc) are handled by the SAX interface. This is done by the getEntity callback:
xmlEntityPtr getEntity(void *user_data, const CHAR *name);
The xmlEntity structure holds some information about the entity. This structure will not be freed by the parser, so it makes sense to create these structures once, and then pass a pointer to the appropriate one when this function is called. After calling getEntity, the expansion of the entity is passed to the characters callback. This way, you do not need to worry about decoding entities anywhere else in your callback routines.
If your XML document only contains the standard entities (<, >, ', " and &), then it is possible to write a very short implementation for this callback:
static xmlEntityPtr my_getEntity(void *user_data, const CHAR *name) {
return xmlGetPredefinedEntity(name);
} |
For most parsers, this will be sufficient.
If there are structural errors in the XML file, the parser will call one of three error callbacks: warning, error or fatalError.
If you want to pass these errors to the standard glib logging functions, you might want to use an implementation something like this:
static void my_warning(void *user_data, const char *msg, ...) {
va_list args;
va_start(args, msg);
g_logv("XML", G_LOG_LEVEL_WARNING, msg, args);
va_end(args);
}
static void my_error(void *user_data, const char *msg, ...) {
va_list args;
va_start(args, msg);
g_logv("XML", G_LOG_LEVEL_CRITICAL, msg, args);
va_end(args);
}
static void my_fatalError(void *user_data, const char *msg, ...) {
va_list args;
va_start(args, msg);
g_logv("XML", G_LOG_LEVEL_ERROR, msg, args);
va_end(args);
} |
Note that libxml is not a validating parser, so only structural errors will be picked up. So any validation of the format will have to be done by your parser routines.
With most applications, you will want to add to the XML file format as you add features to the application. For this reason, you will want to code your callbacks so that they don't barf on an unknown or unexpected tag.
With the DOM style interface, if you come to a node with an unexpected name, you will usually ignore it, and the subtree under it. It is probably a good idea to use a similar process for a SAX based parser.
To implement this sort of error recovery, we will need an extra state for the parser -- UNKNOWN. We will also need to pass two extra variables in the user_data parameter to the callbacks -- prev_state and unknown_depth.
When we hit an unknown element in the startElement callback, we can save the current state to prev_state, and then change the state to UNKNOWN, and set unknown_depth to 1. If startElement is called while in the UNKNOWN state, we increment the unknown_depth variable.
In the endElement callback, if we are in the UNKNOWN state, decrement unknown_depth. If unknown_depth is zero, change the state to prev_state. The characters callback should probably return immediately if in the UNKNOWN state as well.
Using this sort of logic, it should be possible to ignore unknown sections of the document quite easily. The UNKNOWN state is also useful when writing the parser. This way you can test out portions of the parser before it is complete.