Action against software patentsGnome2 LogoW3C LogoRed Hat Logo
Made with Libxml2 Logo

The XML C parser and toolkit of Gnome

Encodings support

Main Menu
Related links

If you are not really familiar with Internationalization (usual shortcutisI18N) , Unicode, characters and glyphs, I suggest you read a presentationbyTim Bray on Unicode and why you should care about it.

If you don't understand why it does not make sense to have astringwithout knowing what encoding it uses, then as Joel Spolsky said please do notwriteanother line of code until you finish reading that article.. It isaprerequisite to understand this page, and avoid a lot of problemswithlibxml2, XML or text processing in general.

Table of Content:

  1. What does internationalization supportmean?
  2. The internal encoding, howandwhy
  3. How is it implemented ?
  4. Default supported encodings
  5. How to extend theexistingsupport

What does internationalization support mean ?

XML was designed from the start to allow the support of any charactersetby using Unicode. Any conformant XML parser has to support the UTF-8andUTF-16 default encodings which can both express the full unicode ranges.UTF8is a variable length encoding whose greatest points are to reuse thesameencoding for ASCII and to save space for Western encodings, but it is abitmore complex to handle in practice. UTF-16 use 2 bytes per character(andsometimes combines two pairs), it makes implementation easier, but looksabit overkill for Western languages encoding. Moreover the XMLspecificationallows the document to be encoded in other encodings at thecondition thatthey are clearly labeled as such. For example the following isa wellformedXML document encoded in ISO-8859-1 and using accentuated lettersthat weFrench like for both markup and content:

<?xml version="1.0" encoding="ISO-8859-1"?>
<très>là</très>

Having internationalization support in libxml2 means the following:

  • the document is properly parsed
  • informations about it's encoding are saved
  • it can be modified
  • it can be saved in its original encoding
  • it can also be saved in another encoding supported by libxml2(forexample straight UTF8 or even an ASCII form)

Another very important point is that the whole libxml2 API, withtheexception of a few routines to read with a specific encoding or save toaspecific encoding, is completely agnostic about the original encoding ofthedocument.

It should be noted too that the HTML parser embedded in libxml2 nowobeythe same rules too, the following document will be (as of 2.2.2) handledinan internationalized fashion by libxml2 too:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
                      "http://www.w3.org/TR/REC-html40/loose.dtd">
<html lang="fr">
<head>
  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
</head>
<body>
<p>W3C crée des standards pour le Web.</body>
</html>

The internal encoding, how and why

One of the core decisions was to force all documents to be converted toadefault internal encoding, and that encoding to be UTF-8, here aretherationales for those choices:

  • keeping the native encoding in the internal form would force thelibxmlusers (or the code associated) to be fully aware of the encoding oftheoriginal document, for examples when adding a text node to adocument,the content would have to be provided in the document encoding,i.e. theclient code would have to check it before hand, make sure it'sconformantto the encoding, etc ... Very hard in practice, though in somespecificcases this may make sense.
  • the second decision was which encoding. From the XML spec only UTF8andUTF16 really makes sense as being the two only encodings for whichthereis mandatory support. UCS-4 (32 bits fixed size encoding) couldbeconsidered an intelligent choice too since it's a direct Unicodemappingsupport. I selected UTF-8 on the basis of efficiency andcompatibilitywith surrounding software:
    • UTF-8 while a bit more complex to convert from/to (i.e.slightlymore costly to import and export CPU wise) is also far morecompactthan UTF-16 (and UCS-4) for a majority of the documents I seeit usedfor right now (RPM RDF catalogs, advogato data, variousconfigurationfile formats, etc.) and the key point for today'scomputerarchitecture is efficient uses of caches. If one nearlydouble thememory requirement to store the same amount of data, thiswill trashcaches (main memory/external caches/internal caches) and mytake isthat this harms the system far more than the CPU requirementsneededfor the conversion to UTF-8
    • Most of libxml2 version 1 users were using it with straightASCIImost of the time, doing the conversion with an internalencodingrequiring all their code to be rewritten was a seriousshow-stopperfor using UTF-16 or UCS-4.
    • UTF-8 is being used as the de-facto internal encoding standardforrelated code like the pangoupcoming Gnome text widget, anda lot of Unix code (yet another placewhere Unix programmer base takesa different approach from Microsoft- they are using UTF-16)

What does this mean in practice for the libxml2 user:

  • xmlChar, the libxml2 data type is a byte, those bytes must beassembledas UTF-8 valid strings. The proper way to terminate an xmlChar *stringis simply to append 0 byte, as usual.
  • One just need to make sure that when using chars outside the ASCIIset,the values has been properly converted to UTF-8

How is it implemented ?

Let's describe how all this works within libxml, basically theI18N(internationalization) support get triggered only during I/O operation,i.e.when reading a document or saving one. Let's look first at thereadingsequence:

  1. when a document is processed, we usually don't know the encoding,asimple heuristic allows to detect UTF-16 and UCS-4 from encodingswherethe ASCII range (0-0x7F) maps with ASCII
  2. the xml declaration if available is parsed, including theencodingdeclaration. At that point, if the autodetected encoding isdifferentfrom the one declared a call to xmlSwitchEncoding() isissued.
  3. If there is no encoding declaration, then the input has to be ineitherUTF-8 or UTF-16, if it is not then at some point when processingtheinput, the converter/checker of UTF-8 form will raise an encodingerror.You may end-up with a garbled document, or no document at all !Example:
    ~/XML -> ./xmllint err.xml 
    err.xml:1: error: Input is not proper UTF-8, indicate encoding !
    <très>là</très>
       ^
    err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
    <très>là</très>
       ^
  4. xmlSwitchEncoding() does an encoding name lookup, canonicalize it,andthen search the default registered encoding converters for thatencoding.If it's not within the default set and iconv() support has beencompiledit, it will ask iconv for such an encoder. If this fails then theparserwill report an error and stops processing:
    ~/XML -> ./xmllint err2.xml 
    err2.xml:1: error: Unsupported encoding UnsupportedEnc
    <?xml version="1.0" encoding="UnsupportedEnc"?>
                                                 ^
  5. From that point the encoder processes progressively the input (itisplugged as a front-end to the I/O module) for that entity. Itcapturesand converts on-the-fly the document to be parsed to UTF-8. Theparseritself just does UTF-8 checking of this input and processittransparently. The only difference is that the encoding informationhasbeen added to the parsing context (more precisely to theinputcorresponding to this entity).
  6. The result (when using DOM) is an internal form completely in UTF-8withjust an encoding information on the document node.

Ok then what happens when saving the document (assuming youcollected/builtan xmlDoc DOM like structure) ? It depends on the functioncalled,xmlSaveFile() will just try to save in the original encoding,whilexmlSaveFileTo() and xmlSaveFileEnc() can optionally save to agivenencoding:

  1. if no encoding is given, libxml2 will look for an encodingvalueassociated to the document and if it exists will try to save tothatencoding,

    otherwise everything is written in the internal form, i.e. UTF-8

  2. so if an encoding was specified, either at the API level or onthedocument, libxml2 will again canonicalize the encoding name, lookupfor aconverter in the registered set or through iconv. If not foundthefunction will return an error code
  3. the converter is placed before the I/O buffer layer, as another kindofbuffer, then libxml2 will simply push the UTF-8 serialization tothroughthat buffer, which will then progressively be converted and pushedontothe I/O layer.
  4. It is possible that the converter code fails on some input, forexampletrying to push an UTF-8 encoded Chinese character through theUTF-8 toISO-8859-1 converter won't work. Since the encoders areprogressive theywill just report the error and the number of bytesconverted, at thatpoint libxml2 will decode the offending character,remove it from thebuffer and replace it with the associated charRefencoding &#123; andresume the conversion. This guarantees that anydocument will be savedwithout losses (except for markup names where thisis not legal, this isa problem in the current version, in practice avoidusing non-asciicharacters for tag or attribute names). A special "ascii"encoding nameis used to save documents to a pure ascii form can be usedwhenportability is really crucial

Here are a few examples based on the same test document:

~/XML -> ./xmllint isolat1 
<?xml version="1.0" encoding="ISO-8859-1"?>
<très>là</très>
~/XML -> ./xmllint --encode UTF-8 isolat1 
<?xml version="1.0" encoding="UTF-8"?>
<très>là  </très>
~/XML -> 

The same processing is applied (and reuse most of the code) for HTMLI18Nprocessing. Looking up and modifying the content encoding is a bitmoredifficult since it is located in a <meta> tag under the<head>,so a couple of functions htmlGetMetaEncoding() andhtmlSetMetaEncoding() havebeen provided. The parser also attempts to switchencoding on the fly whendetecting such a tag on input. Except for that theprocessing is the same(and again reuses the same code).

Default supported encodings

libxml2 has a set of default converters for the followingencodings(located in encoding.c):

  1. UTF-8 is supported by default (null handlers)
  2. UTF-16, both little and big endian
  3. ISO-Latin-1 (ISO-8859-1) covering most western languages
  4. ASCII, useful mostly for saving
  5. HTML, a specific handler for the conversion of UTF-8 to ASCII withHTMLpredefined entities like &copy; for the Copyright sign.

More over when compiled on an Unix platform with iconv support the fullsetof encodings supported by iconv can be instantly be used by libxml. On alinuxmachine with glibc-2.1 the list of supported encodings and aliases fill3 fullpages, and include UCS-4, the full set of ISO-Latin encodings, and thevariousJapanese ones.

To convert from the UTF-8 values returned from the API to anotherencodingthen it is possible to use the function provided from the encoding modulelike UTF8Toisolat1, or usethePOSIX iconv()APIdirectly.

Encoding aliases

From 2.2.3, libxml2 has support to register encoding names aliases.Thegoal is to be able to parse document whose encoding is supported butwherethe name differs (for example from the default set of names acceptedbyiconv). The following functions allow to register and handle new aliasesforexisting encodings. Once registered libxml2 will automatically lookupthealiases when handling a document:

  • int xmlAddEncodingAlias(const char *name, const char *alias);
  • int xmlDelEncodingAlias(const char *alias);
  • const char * xmlGetEncodingAlias(const char *alias);
  • void xmlCleanupEncodingAliases(void);

How to extend the existing support

Well adding support for new encoding, or overriding one of theencoders(assuming it is buggy) should not be hard, just write input andoutputconversion routines to/from UTF-8, and register themusingxmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they willbecalled automatically if the parser(s) encounter such an encodingname(register it uppercase, this will help). The description of theencoders,their arguments and expected return values are described in theencoding.hheader.

Daniel Veillard