Action against software patentsGnome2 LogoW3C logoRed Hat Logo
Made with Libxslt Logo

The XSLT C library for Gnome

Library internals

Main Menu
Related links
API Indexes

Table of contents

Introduction

This document describes the processing of libxslt, the XSLTC library developed for the Gnomeproject.

Note: this documentation is by definition incomplete and I am not good atspelling, grammar, so patches and suggestions are really welcome.

Basics

XSLT is a transformation language. It takes an input document and astylesheet document and generates an output document:

the XSLT processing model

Libxslt is written in C. It relies on libxml, the XML C library for Gnome, forthe following operations:

  • parsing files
  • building the in-memory DOM structure associated with the documentshandled
  • the XPath implementation
  • serializing back the result document to XML and HTML. (Text is handleddirectly.)

Keep it simple stupid

Libxslt is not very specialized. It is built under the assumption that allnodes from the source and output document can fit in the virtual memory ofthe system. There is a big trade-off there. It is fine for reasonably sizeddocuments but may not be suitable for large sets of data. The gain is that itcan be used in a relatively versatile way. The input or output may never beserialized, but the size of documents it can handle are limited by the sizeof the memory available.

More specialized memory handling approaches are possible, like buildingthe input tree from a serialization progressively as it is consumed,factoring repetitive patterns, or even on-the-fly generation of the output asthe input is parsed but it is possible only for a limited subset of thestylesheets. In general the implementation of libxslt follows the followingpattern:

  • KISS (keep it simple stupid)
  • when there is a clear bottleneck optimize on top of this simpleframework and refine only as much as is needed to reach the expectedresult

The result is not that bad, clearly one can do a better job but morespecialized too. Most optimization like building the tree on-demand wouldneed serious changes to the libxml XPath framework. An easy step would be toserialize the output directly (or call a set of SAX-like output handler tokeep this a flexible interface) and hence avoid the memory consumption of theresult.

The libxml nodes

DOM-like trees, as used and generated by libxml and libxslt, arerelatively complex. Most node types follow the given structure except a fewvariations depending on the node type:

description of a libxml node

Nodes carry a nameand the node typeindicates the kind of node it represents, the most common ones are:

  • document nodes
  • element nodes
  • text nodes

For the XSLT processing, entity nodes should not be generated (i.e. theyshould be replaced by their content). Most nodes also contains the following"navigation" informations:

  • the containing document
  • the parentnode
  • the first childrennode
  • the lastchildren node
  • the previous sibling
  • the following sibling (next)

Elements nodes carries the list of attributes in the properties, anattribute itself holds the navigation pointers and the children list (theattribute value is not represented as a simple string to allow usage ofentities references).

The nspoints to the namespace declaration for thenamespace associated to the node, nsDefis the linked listof namespace declaration present on element nodes.

Most nodes also carry an _privatepointer which can beused by the application to hold specific data on this node.

The XSLT processing steps

There are a few steps which are clearly decoupled at the interfacelevel:

  1. parse the stylesheet and generate a DOM tree
  2. take the stylesheet tree and build a compiled version of it (thecompilation phase)
  3. take the input and generate a DOM tree
  4. process the stylesheet against the input tree and generate an outputtree
  5. serialize the output tree

A few things should be noted here:

  • the steps 1/ 3/ and 5/ are optional
  • the stylesheet obtained at 2/ can be reused by multiple processing 4/(and this should also work in threaded programs)
  • the tree provided in 2/ should never be freed using xmlFreeDoc, but byfreeing the stylesheet.
  • the input tree 4/ is not modified except the _private field which maybe used for labelling keys if used by the stylesheet

The XSLT stylesheet compilation

This is the second step described. It takes a stylesheet tree, and"compiles" it. This associates to each node a structure stored in the_private field and containing information computed in the stylesheet:

a compiled XSLT stylesheet

One xsltStylesheet structure is generated per document parsed for thestylesheet. XSLT documents allow includes and imports of other documents,imports are stored in the importslist (hence keeping thetree hierarchy of includes which is very important for a proper XSLTprocessing model) and includes are stored in the doclistlist. An imported stylesheet has a parent link to allow browsing of thetree.

The DOM tree associated to the document is stored in doc.It is preprocessed to remove ignorable empty nodes and all the nodes in theXSLT namespace are subject to precomputing. This usually consist ofextracting all the context information from the context tree (attributes,namespaces, XPath expressions), and storing them in an xsltStylePreCompstructure associated to the _privatefield of the node.

A couple of notable exceptions to this are XSLT template nodes (more onthis later) and attribute value templates. If they are actually templates,the value cannot be computed at compilation time. (Some preprocessing couldbe done like isolation and preparsing of the XPath subexpressions but it'snot done, yet.)

The xsltStylePreComp structure also allows storing of the precompiled formof an XPath expression that can be associated to an XSLT element (more onthis later).

The XSLT template compilation

A proper handling of templates lookup is one of the keys of fast XSLTprocessing. (Given a node in the source document this is the process offinding which templates should be applied to this node.) Libxslt follows thehint suggested in the 5.2Patternssection of the XSLT Recommendation, i.e. it doesn't evaluate itas an XPath expression but tokenizes it and compiles it as a set of rules tobe evaluated on a candidate node. There usually is an indication of the nodename in the last step of this evaluation and this is used as a key check forthe match. As a result libxslt builds a relatively more complex set ofstructures for the templates:

The templates related structure

Let's describe a bit more closely what is built. First the xsltStylesheetstructure holds a pointer to the template hash table. All the XSLT patternscompiled in this stylesheet are indexed by the value of the the targetelement (or attribute, pi ...) name, so when a element or an attribute "foo"needs to be processed the lookup is done using the name as a key.

Each of the patterns is compiled into an xsltCompMatch structure. It holdsthe set of rules based on the tokenization of the pattern stored in reverseorder (matching is easier this way). It also holds some information about theprevious matches used to speed up the process when one iterates over a set ofsiblings. (This optimization may be defeated by trashing when runningthreaded computation, it's unclear that this is a big deal in practice.)Predicate expressions are not compiled at this stage, they may be at run-timeif needed, but in this case they are compiled as full XPath expressions (theuse of some fixed predicate can probably be optimized, they are not yet).

The xsltCompMatch are then stored in the hash table, the clash list isitself sorted by priority of the template to implement "naturally" the XSLTpriority rules.

Associated to the compiled pattern is the xsltTemplate itself containingthe information required for the processing of the pattern including, ofcourse, a pointer to the list of elements used for building the patternresult.

Last but not least a number of patterns do not fit in the hash tablebecause they are not associated to a name, this is the case for patternsapplying to the root, any element, any attributes, text nodes, pi nodes, keysetc. Those are stored independently in the stylesheet structure as separatelinked lists of xsltCompMatch.

The processing itself

The processing is defined by the XSLT specification (the basis of thealgorithm is explained in the Introductionsection). Basically it works by taking the root of the input document andapplying the following algorithm:

  1. Finding the template applying to it. This is a lookup in the templatehash table, walking the hash list until the node satisfies all the stepsof the pattern, then checking the appropriate(s) global templates to seeif there isn't a higher priority rule to apply
  2. If there is no template, apply the default rule (recurse on thechildren)
  3. else walk the content list of the selected templates, for each of them:
    • if the node is in the XSLT namespace then the node has a _privatefield pointing to the preprocessed values, jump to the specificcode
    • if the node is in an extension namespace, look up the associatedbehavior
    • otherwise copy the node.

    The closure is usually done through the XSLTapply-templatesconstruct recursing by applying theadequate template on the input node children or on the result of anassociated XPath selection lookup.

Note that large parts of the input tree may not be processed by a givenstylesheet and that on the opposite some may be processed multiple times.(This often is the case when a Table of Contents is built).

The module transform.cis the one implementing most of thislogic. xsltApplyStylesheet()is the entry point, itallocates an xsltTransformContext containing the following:

  • a pointer to the stylesheet being processed
  • a stack of templates
  • a stack of variables and parameters
  • an XPath context
  • the template mode
  • current document
  • current input node
  • current selected node list
  • the current insertion points in the output document
  • a couple of hash tables for extension elements and functions

Then a new document gets allocated (HTML or XML depending on the type ofoutput), the user parameters and global variables and parameters areevaluated. Then xsltProcessOneNode()which implements the1-2-3 algorithm is called on the root element of the input. Step 1/ isimplemented by calling xsltGetTemplate(), step 2/ isimplemented by xsltDefaultProcessOneNode()and step 3/ isimplemented by xsltApplyOneTemplate().

XPath expression compilation

The XPath support is actually implemented in the libxml module (where itis reused by the XPointer implementation). XPath is a relatively classicexpression language. The only uncommon feature is that it is working on XMLtrees and hence has specific syntax and types to handle them.

XPath expressions are compiled using xmlXPathCompile().It will take an expression string in input and generate a structurecontaining the parsed expression tree, for example the expression:

/doc/chapter[title='Introduction']

will be compiled as

Compiled Expression : 10 elements
  SORT
    COLLECT  'child' 'name' 'node' chapter
      COLLECT  'child' 'name' 'node' doc
        ROOT
      PREDICATE
        SORT
          EQUAL =
            COLLECT  'child' 'name' 'node' title
              NODE
            ELEM Object is a string : Introduction
              COLLECT  'child' 'name' 'node' title
                NODE

This can be tested using the testXPathcommand (in thelibxml codebase) using the --treeoption.

Again, the KISS approach is used. No optimization is done. This could bean interesting thing to add. MichaelKay describesa lot of possible and interesting optimizations done inSaxon which would be possible at this level. I'm unsure they would providemuch gain since the expressions tends to be relatively simple in general andstylesheets are still hand generated. Optimizations at the interpretationsounds likely to be more efficient.

XPath interpretation

The interpreter is implemented by xmlXPathCompiledEval()which is the front-end to xmlXPathCompOpEval()the functionimplementing the evaluation of the expression tree. This evaluation followsthe KISS approach again. It's recursive and callsxmlXPathNodeCollectAndTest()to collect nodes set whenevaluating a COLLECTnode.

An evaluation is done within the framework of an XPath context stored inan xmlXPathContextstructure, in the framework of atransformation the context is maintained within the XSLT context. Its contentfollows the requirements from the XPath specification:

  • the current document
  • the current node
  • a hash table of defined variables (but not used by XSLT)
  • a hash table of defined functions
  • the proximity position (the place of the node in the current nodelist)
  • the context size (the size of the current node list)
  • the array of namespace declarations in scope (there also is a namespacehash table but it is not used in the XSLT transformation).

For the purpose of XSLT an extrapointer has been addedallowing to retrieve the XSLT transformation context. When an XPathevaluation is about to be performed, an XPath parser context is allocatedcontaining and XPath object stack (this is actually an XPath evaluationcontext, this is a remain of the time where there was no separate parsing andevaluation phase in the XPath implementation). Here is an overview of the setof contexts associated to an XPath evaluation within an XSLTtransformation:

The set of contexts associated

Clearly this is a bit too complex and confusing and should be refactoredat the next set of binary incompatible releases of libxml. For example thexmlXPathCtxt has a lot of unused parts and should probably be merged withxmlXPathParserCtxt.

Description of XPath Objects

An XPath expression manipulates XPath objects. XPath defines the defaulttypes boolean, numbers, strings and node sets. XSLT adds the result treefragment type which is basically an unmodifiable node set.

Implementation-wise, libxml follows again a KISS approach, thexmlXPathObject is a structure containing a type description and the variouspossibilities. (Using an enum could have gained some bytes.) In the case ofnode sets (or result tree fragments), it points to a separate xmlNodeSetobject which contains the list of pointers to the document nodes:

An Node set object pointing to

The XPath API(andits 'internal'part) includes a number of functions to create, copy, compare, convert orfree XPath objects.

XPath functions

All the XPath functions available to the interpreter are registered in thefunction hash table linked from the XPath context. They all share the samesignature:

void xmlXPathFunc (xmlXPathParserContextPtr ctxt, int nargs);

The first argument is the XPath interpretation context, holding theinterpretation stack. The second argument defines the number of objectspassed on the stack for the function to consume (last argument is on top ofthe stack).

Basically an XPath function does the following:

  • check nargsfor proper handling of errors or functionswith variable numbers of parameters
  • pop the parameters from the stack using obj =valuePop(ctxt);
  • do the function specific computation
  • push the result parameter on the stack using valuePush(ctxt,res);
  • free up the input parameters withxmlXPathFreeObject(obj);
  • return

Sometime the work can be done directly by modifying in-situ the top objecton the stack ctxt->value.

The XSLT variables stack frame

Not to be confused with XPath object stack, this stack holds the XSLTvariables and parameters as they are defined through the recursive calls ofcall-template, apply-templates and default templates. This is used to definethe scope of variables being called.

This part seems to be the most urgent attention right now, first it isdone in a very inefficient way since the location of the variables andparameters within the stylesheet tree is still done at run time (it reallyshould be done statically at compile time), and I am still unsure that myunderstanding of the template variables and parameter scope is actuallyright.

This part of the documentation is still to be written once this part ofthe code will be stable. TODO

Extension support

There is a separate document explaining how theextension support works.

Further reading

Michael Kay wrote areally interesting article on Saxon internalsand the work he did onperformance issues. I wishes I had read it before starting libxslt design (Iwould probably have avoided a few mistakes and progressed faster). A lot ofthe ideas in his papers should be implemented or at least tried inlibxslt.

The libxml documentation, especially the I/O interfacesand the memory management.

TODOs

redesign the XSLT stack frame handling. Far too much work is done atexecution time. Similarly for the attribute value templates handling, atleast the embedded subexpressions ought to be precompiled.

Allow output to be saved to a SAX like output (this notion of SAX like APIfor output should be added directly to libxml).

Implement and test some of the optimization explained by Michael Kayespecially:

  • static slot allocation on the stack frame
  • specific boolean interpretation of an XPath expression
  • some of the sorting optimization
  • Lazy evaluation of location path. (this may require more changes butsounds really interesting. XT does this too.)
  • Optimization of an expression tree (This could be done as a completelyindependent module.)

Error reporting, there is a lot of case where the XSLT specificationspecify that a given construct is an error are not checked adequately bylibxslt. Basically one should do a complete pass on the XSLT spec again andadd all tests to the stylesheet compilation. Using the DTD provided in theappendix and making direct checks using the libxml validation API sounds agood idea too (though one should take care of not raising errors forelements/attributes in different namespaces).

Double check all the places where the stylesheet compiled form might bemodified at run time (extra removal of blanks nodes, hint on thexsltCompMatch).

Daniel Veillard