LinuxDoc-Tools User's Guide: Writing Documents With LinuxDoc-Tools

3. Writing Documents With LinuxDoc-Tools

For the most part, writing documents using LinuxDoc-Tools is very simple, and rather like writing HTML. However, there are some caveats to watch out for. In this section we'll give an introduction on writing SGML documents. See the file example.sgml for a SGML example document (and tutorial) which you can use as a model when writing your own documents. Here we're just going to discuss the various features of LinuxDoc-Tools, but the source is not very readable as an example. Instead, print out the source (as well as the formatted output) for example.sgml so you have a real live case to refer to.

3.1 Basic Concepts

Looking at the source of the example document, you'll notice right off that there are a number of ``tags'' marked within angle brackets (< and >). A tag simply specifies the beginning or end of an element, where an element is something like a section, a paragraph, a phrase of italicized text, an item in a list, and so on. Using a tag is like using an HTML tag, or a LaTeX command such as \item or \section{...}.

As a simple example, to produce this boldfaced text, you would type


As a simple example, to produce <bf>this boldfaced text</bf>, ...

in the source. <bf> begins the region of bold text, and </bf> ends it. Alternately, you can use the abbreviated form


As a simple example, to produce <bf/this boldfaced text/, ...

which encloses the bold text within slashes. (Of course, you'll need to use the long form if the enclosed text contains slashes, such as the case with Unix filenames).

There are other things to watch out with respect to special characters (that's why you'll notice all of these bizarre-looking ampersand expressions if you look at the source; I'll talk about those shortly).

In some cases, the end-tag for a particular element is optional. For example, to begin a section, you use the <sect> tag, however, the end-tag for the section (which could appear at the end of the section body itself, not just after the name of the section!) is optional and implied when you start another section of the same depth. In general you needn't worry about these details; just follow the model used in the tutorial (example.sgml).

3.2 Special Characters

Obviously, the angle brackets are themselves special characters in the SGML source. There are others to watch out for. For example, let's say that you wanted to type an expression with angle brackets around it, as so: <foo>. In order to get the left angle bracket, you must use the < element, which is a ``macro'' that expands to the actual left-bracket character. Therefore, in the source, I typed


angle brackets around it, as so: <tt>&lt;foo&gt;</tt>.

Generally, anything beginning with an ampersand is a special character. For example, there's &percnt; to produce %, | to produce |, and so on. For every special character that might otherwise confuse LinuxDoc-Tools if typed by itself, there is an ampersand "entity" to represent it. The most commonly used are:

Use & for the ampersand (&),
Use < for a left bracket (<),
Use > for a right bracket (>),
Use &etago; for a left bracket with a slash (</)
Use &dollar; for a dollar sign ($),
Use &num; for a hash (#),
Use &percnt; for a percent (%),
Use &tilde; for a tilde (~),
Use `` and '' for quotes, or use &dquot; for ".
Use  for a soft hyphen (that is, an indication that this is a good place to break a word for horizontal justification).

Here is a complete list of the entities recognized by 0.1. Note that not all back-ends will be able to make anything useful from every entity -- if you see parantheses with nothing between them in the list, it means that the back-end that generated what you're looking at has no replacement for the entity. The ``common'' ones listed above are pretty reliable.

&half (1/2): vertical 1/2 fraction
&frac12 (1/2): typeset 1/2 fraction
&frac14 (1/4): typeset 1/4 fraction
&frac34 (3/4): typeset 3/4 fraction
&frac18 (1/8): typeset 1/8 fraction
&frac38 (3/8): typeset 3/8 fraction
&frac58 (5/8): typeset 5/8 fraction
&frac78 (7/8): typeset 7/8 fraction
&sup1 (^1): superscript 1
&sup2 (^2): superscript 2
&sup3 (^3): superscript 3
&plus (+): plus sign
&plusmn (±): plus-or-minus sign
&lt (<): less-than sign
&equals (=): equals sign
&gt (>): greater-than sign
&divide (÷): division sign
&times (×): multiplication sign
&curren (¤): currency symbol
&pound (Ł): symbol for ``pounds''
&dollar ($): dollar sign
&cent (¢): cent sign
&yen (¥): yen sign
&num (#): number or hash sign
&percnt (%): percent sign
&amp (&): ampersand
&ast (*): asterisk
&commat (@): commercial-at sign
&lsqb ([): left square bracket
&bsol (\): backslash
&rsqb (]): right square bracket
&lcub ({): left curly brace
&horbar (―): horizontal bar
&verbar (|): vertical bar
&rcub (}): right curly brace
&micro (µ): greek mu (micro prefix)
&ohm (Ω): greek capital omega (Ohm sign)
&deg (°): small superscript circle sign (degree sign)
&ordm (º): masculine ordinal
&ordf (ª): feminine ordinal
&sect (§): section sign
&para (¶): paragraph sign
&middot (·): centered dot
&larr (←): left arrow
&rarr (->): right arrow
&uarr (↑): up arrow
&darr (↓): down arrow
&copy (©): copyright
&reg (®): r-in-circle marl
&trade (™): trademark sign
&brvbar (¦): broken vertical bar
&not (¬): logical-negation sign
&sung (♪): sung-note sign
&excl (!): exclamation point
&iexcl (¡): inverted exclamation point
&quot ("): double quote
&apos ('): apostrophe (single quote)
&lpar ((): left parenthesis
&rpar ()): right parenthesis
&comma (,): comma
&lowbar (_): under-bar
&hyphen (‐): hyphen
&period (.): period
&sol (/): solidus
&colon (:): colon
&semi (;): semicolon
&quest (?): question mark
&iquest (¿): interrobang
&laquo («): left guillemot
&raquo (»): right guillemot
&lsquo (‘): left single quote
&rsquo (’): right single quote
&ldquo (“): left double quote
&rdquo (”): right double quote
&nbsp ( ): non-breaking space
&shy (): soft hyphen

3.3 Verbatim and Code Environments

While we're on the subject of special characters, we might as well mention the verbatim ``environment'' used for including literal text in the output (with spaces and indentation preserved, and so on). The verb element is used for this; it looks like the following:


<verb>
 Some literal text to include as example output.
</verb>

The verb environment doesn't allow you to use everything within it literally. Specifically, you must do the following within verb environments.

Use &ero; to get an ampersand,
Use &etago; to get </,
Don't use \end{verbatim} within a verb environment, as this is what LaTeX uses to end the verbatim environment. (In the future, it should be possible to hide the underlying text formatter entirely, but the parser doesn't support this feature yet.)

The code environment is much just like the verb environment, except that horizontal rules are added to the surrounding text, as so:

Here is an example code environment.

You should use the tscreen environment around any verb environments, as so:


<tscreen><verb>
Here is some example text.  
</verb></tscreen>

tscreen is an environment that simply indents the text and sets the sets the default font to tt. This makes examples look much nicer, both in the LaTeX and plain text versions. You can use tscreen without verb, however, if you use any special characters in your example you'll need to use both of them. tscreen does nothing to special characters. See example.sgml for examples.

The quote environment is like tscreen, except that it does not set the default font to tt. So, you can use quote for non-computer-interaction quotes, as in:


<quote>
Here is some text to be indented, as in a quote.
</quote>

which will generate:

Here is some text to be indented, as in a quote.

3.4 Overall Document Structure

Before we get too in-depth with details, we're going to describe the overall structure of an LinuxDoc-Tools document. Look at example.sgml for a good example of how a document is set up.

The Preamble

In the document ``preamble'' you set up things such as the title information and document style:


<!doctype linuxdoc system>

<article>

<title>Linux Foo HOWTO
<author>Norbert Ebersol, <tt/norb@baz.com/
<date>v1.0, 9 March 1994
<abstract>
This document describes how to use the <tt/foo/ tools to frobnicate
bar libraries, using the <tt/xyzzy/ relinker.
</abstract>

<toc>

The elements should go more or less in this order. The first line tells the SGML parser to use the linuxdoc DTD. We'll explain that in the later section on How LinuxDoc-Tools Works; for now just treat it as a bit of necessary magic. The <article> tag forces the document to use the ``article'' document style.

The title, author, and date tags should be obvious; in the date tag include the version number and last modification time of the document.

The abstract tag sets up the text to be printed at the top of the document, before the table of contents. If you're not going to include a table of contents (the toc tag), you probably don't need an abstract.

Sectioning And Paragraphs

After the preamble, you're ready to dive into the document. The following sectioning commands are available:

sect: For top-level sections (i.e. 1, 2, and so on.)
sect1: For second-level subsections (i.e. 1.1, 1.2, and so on.)
sect2: For third-level subsubsections.
sect3: For fourth-level subsubsubsections.
sect4: For fifth-level subsubsubsubsections.

These are roughly equivalent to their LaTeX counterparts section, subsection, and so on.

After the sect (or sect1, sect2, etc.) tag comes the name of the section. For example, at the top of this document, after the preamble, comes the tag:


<sect>Introduction

And at the beginning of this section (Sectioning and paragraphs), there is the tag:


<sect2>Sectioning And Paragraphs

After the section tag, you begin the body of the section. However, you must start the body with a <p> tag, as so:


<sect>Introduction
<p>
This is a user's guide to the LinuxDoc-Tools document processing...

This is to tell the parser that you're done with the section title and are ready to begin the body. Thereafter, new paragraphs are started with a blank line (just as you would do in TeX). For example,


Here is the end of the first paragraph.

And we start a new paragraph here.

There is no reason to use <p> tags at the beginning of every paragraph; only at the beginning of the first paragraph after a sectioning command.

Ending The Document

At the end of the document, you must use the tag:


</article>

to tell the parser that you're done with the article element (which embodies the entire document).

3.5 Internal Cross-References

Now we're going to move onto other features of the system. Cross-references are easy. For example, if you want to make a cross-reference to a certain section, you need to label that section as so:


<sect1>Introduction<label id="sec-intro">

You can then refer to that section somewhere in the text using the expression:


See section <ref id="sec-intro" name="Introduction"> for an introduction.

This will replace the ref tag with the section number labeled as sec-intro. The name argument to ref is necessary for groff and HTML translations. The groff macro set used by LinuxDoc-Tools does not currently support cross-references, and it's often nice to refer to a section by name instead of number.

For example, this section is Cross-References.

Some back-ends may get upset about special characters in reference labels. In particular, latex2e chokes on underscores (though the latex back end used in older versions of this package didn't). Hyphens are safe.

3.6 Web References

There is also a url element for Universal Resource Locators, or URLs, used on the World Wide Web. This element should be used to refer to other documents, files available for FTP, and so forth. For example,


You can get the Linux HOWTO documents from 
<url url="http://sunsite.unc.edu/mdw/HOWTO/" 
   name="The Linux HOWTO INDEX">.

The url argument specifies the actual URL itself. A link to the URL in question will be automatically added to the HTML document. The optional name argument specifies the text that should be anchored to the URL (for HTML conversion) or named as the description of the URL (for LaTeX and groff). If no name argument is given, the URL itself will be used.

A useful variant of this is htmlurl, which suppresses rendering of the URL part in every context except HTML. What this is useful for is things like a person's email addresses; you can write


<htmlurl url="mailto:esr@snark.thyrsus.com"
      name="esr@snark.thyrsus.com">

and get ``esr@snark.thyrsus.com'' in text output rather than the duplicative ``esr@snark.thyrsus.com <mailto:esr@snark.thyrsus.com>'' but still have a proper URL in HTML documents.

3.7 Fonts

Essentially, the same fonts supported by LaTeX are supported by LinuxDoc-Tools. Note, however, that the conversion to plain text (through groff) does away with the font information. So, you should use fonts as for the benefit of the conversion to LaTeX, but don't depend on the fonts to get a point across in the plain text version.

In particular, the tt tag described above can be used to get constant-width ``typewriter'' font which should be used for all e-mail addresses, machine names, filenames, and so on. Example:


Here is some <tt>typewriter text</tt> to be included in the document.

Equivalently:


Here is some <tt/typewriter text/ to be included in the document.

Remember that you can only use this abbreviated form if the enclosed text doesn't contain slashes.

Other fonts can be achieved with bf for boldface and em for italics. Several other fonts are supported as well, but we don't suggest you use them, because we'll be converting these documents to other formats such as HTML which may not support them. Boldface, typewriter, and italics should be all that you need.

3.8 Lists

There are various kinds of supported lists. They are:

itemize for bulleted lists such as this one.
enum for numbered lists.
descrip for ``descriptive'' lists.

Each item in an itemize or enum list must be marked with an item tag. Items in a descrip are marked with tag. For example,


<itemize>
<item>Here is an item.
<item>Here is a second item.
</itemize>

Looks like this:

Here is an item.
Here is a second item.

Or, for an enum,


<enum>
<item>Here is the first item.
<item>Here is the second item.
</enum>

You get the idea. Lists can be nested as well; see the example document for details.

A descrip list is slightly different, and slightly ugly, but you might want to use it for some situations:


<descrip>
<tag/Gnats./ Annoying little bugs that fly into your cooling fan.
<tag/Gnus./ Annoying little bugs that run on your CPU.
</descrip>

ends up looking like:

Gnats.: Annoying little bugs that fly into your cooling fan.
Gnus.: Annoying little bugs that run on your CPU.

3.9 Conditionalization

The overall goal of LinuxDoc-tools is to be able to produce from one set of masters output that is semantically equivalent on all back ends. Nevertheless, it is sometimes useful to be able to produce a document in slightly different variants depending on back end and version. LinuxDoc-Tools supports this through the <#if> and <#unless> bracketing tags.

These tags allow you to selectively include and uninclude portions of an SGML master in your output, depending on filter options set by your driver. Each tag may include a set of attribute/value pairs. The most common are ``output'' and ``version'' (though you are not restricted to these) so a typical example might look like this:


Some <#if output=latex2e version=drlinux>conditional</#if> text.

Everything from this <#if> tag to the following </#if> would be considered conditional, and would not be included in the document if either the filter option ``output'' were set to something that doesn't match ``latex2e'' or the filter option ``version'' were set to something that doesn't match ``drlinux''. The double negative is deliberate; if no ``output'' or ``version'' filter options are set, the conditional text will be included.

Filter options are set in one of two ways. Your format driver sets the ``output'' option to the name of the back end it uses; thus, in particular, ``linuxdoc -B latex'' sets ``output=latex2e'', Or you may set an attribute-value pair with the ``-D'' option of your format driver. Thus, if the above tag were part of a file a file named ``foo.sgml'', then formatting with either


% linuxdoc -B latex -D version=drlinux foo.sgml


% linuxdoc -B latex foo.sgml

would include the ``conditional'' part, but neither


% linuxdoc -B html -D version=drlinux foo.sgml

nor


% linuxdoc -B latex -D private=book foo.sgml

would do so.

So that you can have conditionals depending on one or more of several values matching, values support a simple alternation syntax using ``|''. Thus you could write:


Some <#if output="latex2e|html" version=drlinux>conditional</#if> text.

and formatting with either ``-B latex'' or ``-B html'' will include the ``conditional'' text (but formatting with, say, ``-B txt'' will not).

The <#unless> tag is the exact inverse of <#if>; it includes when <#if>; would exclude, and vice-versa.

Note that these tags are implemented by a preprocessor which runs before the SGML parser ever sees the document. Thus they are completely independent of the document structure, are not in the DTD, and usage errors won't be caught by the parser. You can seriously confuse yourself by conditionalizing sections that contain unbalanced bracketing tags.

The preprocessor implementation also means that standalone SGML parsers will choke on LinuxDoc-Tools documents that contain conditionals. However, you can validity-check them with ``linuxdoc -B check''.

Also note that in order not to mess up the source line numbers in parser error messages, the preprocessor doesn't actually throw away everything when it omits a conditionalized section. It still passes through any newlines. This leads to behavior that may suprise you if you use <if> or <unless> within a <verb> environment, or any other kind of bracket that changes SGML's normal processing of whitespace.

These tags are called ``#if'' and ``#unless'' (rather than ``if'' and ``unless'') to remind you that they are implemented by a preprocessor and you need to be a bit careful about how you use them.

3.10 Index generation

To support automated generation of indexes for book publication of SGML masters, LinuxDoc-Tools supports the <idx> and <cdx> tags. These are bracketing tags which cause the text between them to be saved as an index entry, pointing to the page number on which it occurs in the formatted document. They are ignored by all backends except LaTeX, which uses them to build a .ind file suitable for processing by the TeX utility makeindex.

The two tags behave identically, except that <idx> sets the entry in a normal font and <cdx> in a constant-width one.

If you want to add an index entry that shouldn't appear in the text itself, use the <nidx> and <ncdx> tags.

3.11 Controlling justification

In order to get proper justification and filling of paragraphs in typeset output, LinuxDoc-Tools includes the  entity. This becomes an optional or `soft' hyphen in back ends like latex2e for which this is neaningful.

The bracketing tag <file> can be used to surround filenames in running text. It effectively inserts soft hyphens after each slash in the filename.

One of the advantages of using the <url> and <htmlurl> tags is that they do likewise for long URLs.