Using XML, Part 1
For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled.
– Richard Feynman
When Feynman said this in the conclusion of his Minority Report Appendix to the Rogers Commission Report  on the Space Shuttle Challenger disaster, it was in reaction to what he saw as NASA’s unrealistic assessment of the reliability of the Space Shuttle. However, his conclusion applies equally well to any technology, especially those, like XML, that have been heavily touted.
While I’m a big supporter of XML technology—I’m writing this book using the latest available version of DocBook XML, version 5.0, which is still a Candidate Release as I write—I’m also the first to point out that XML is not a panacea. It will not cure male pattern baldness, feed the hungry, or make Britney Spears sing in tune. It also won’t organize your documentation, eliminate your production backend, or allow you to hire fewer, less skilled, or cheaper writers. But, it does provide the best way to markup nearly all technical documentation.
The Origins of XML
Before I justify that statement, let’s take a quick romp through XML’s history. XML is the latest in a line of markup languages that originated at IBM in the late 1960s. Charles Goldfarb, Edward Mosher, and Raymond Lorie developed GML (originally named using their initials, then later “generalized” to Generalized Markup Language), which became part of several IBM document processing products.
Various kinds of text markup had been common for many years, and GML was just one of many. The critical differentiator for GML was that the developers wanted to, “… restrict markup within the document to identification of the document’s structure and other attributes.” Goldfarb and his colleagues recognized that if you embed processing commands into your content, you will be forced to update your content when you need to change the processing or if you need to process the content in another environment. Therefore, they replaced explicit processing commands with mnemonic tags that described the content. This allowed them to process the same content with different applications or on different hardware.
GML evolved into SGML, Standard Generalized Markup Language, which is an international standard, “ISO 8879:1986 Information processing—Text and office systems—Standard Generalized Markup Language (SGML)” . Once again, Charles Goldfarb was closely involved; so closely that he is often referred to as the “father of SGML.”
SGML is a “meta-language,” that is, a language used to define other languages. Unlike previous markup languages, you don’t use SGML directly. Instead, you create an SGML “Document Type Definition” (DTD) that defines a set of tags and a grammar.
In practice, most SGML documents are created using one of the many standard DTDs that have been created for particular purposes. The best known of these is HTML. In the area of technical documentation, the best known and most widely used is DocBook, which this book uses.
SGML was a huge advance from previously available markup languages. By providing independence from any single vendor’s processing applications, it spawned a wave of editing and processing applications. And, because several important standards were quickly built on its foundations (HTML and DocBook being among the earliest), it became a great choice whenever interoperability among organizations was needed. SGML also provides the means to create custom grammars for nearly any imaginable subject domain.
Despite its power, SGML in its full glory is complex and can be hard for both humans and software to work with. In the mid-1990s, a working group led by Jon Bosak set out to develop “SGML for the Web,” by which they meant a subset of SGML that would be easier to work with and better suited to the Internet. The result was XML (Extensible Markup Language), which is a W3C Recommendation. With some minor exceptions, XML is a proper subset of SGML that sheds the less used and more difficult to implement features of SGML.
Unlike SGML, which was primarily used by documentation specialists for a relatively narrow range of applications, XML hit a sweet spot. It was complex enough to create interesting structures and applications, and simple enough that vendors and the open source community could build useful tools and applications to manage data marked up in XML grammars.
This has led to XML being used in an amazing array of different applications, many of them far from anything imagined when GML was originally conceived. A quick Internet search will yield hundreds of XML initiatives covering topics from aviation to weather. There are more than a dozen markup languages for representing music, dozens more for various health care related languages, and too many business information languages to count. To get a sense of the variety, check out the Cover Pages (http://xml.coverpages.org), which is an extensive resource for XML.
For our purposes, the most interesting XML languages are DocBook, which has moved from being an SGML standard to being an XML standard, and DITA, Darwin Information Typing Architecture, both of which are standards from the Organization for the Advancement of Structured Information Standards (OASIS).
What is it that makes XML such an important technology? The key is three concepts that lie behind both SGML and XML: the Document Type Definition (DTD), descriptive markup, and data independence.  I’ll take a look at these three concepts in turn, using the following example, which is a fragment of DocBook XML taken from the beginning of this section:
Example 7.1. DocBook Example
<epigraph role="quote"> <attribution> <personname> <firstname>Richard</firstname> <surname>Feynman</surname> </personname> </attribution> <para> For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled. </para> </epigraph>
A Document Type Definition (DTD) defines an SGML grammar. With the introduction of XML, DTDs are still used, but the most common means of defining an XML grammar is a schema. There are several means for defining a schema, and good technical reasons for choosing one over another in particular situations, but for the purposes of this discussion they are more or less equivalent to each other and to a DTD, and I will use DTD and schema interchangeably.
To see what a DTD does, let’s start with the first line of Example 7.1, “DocBook Example”. This line, <epigraph role=”quote”>, defines the beginning of an element called “epigraph”. Elements are the basic structure in XML. Every XML element is matched with a closing tag. The closing tag for epigraph is: </epigraph>. There is one exception; an element with no content can be represented with a single tag that looks like this: <a/>.
Elements can be nested, but can’t straddle (i.e., “<a><b></b></a>” is ok, but “<a><b></a></b>” is not).
An XML document is a tree structure, with a single element at the top level surrounding the entire document. In context, our epigraph example is itself nested inside a <section> element, which is nested inside a <chapter> element, which is in turn nested inside a <book> element. The <book> element is the top level element for the document that represents this book. In practice, you can work with sub-trees for editing purposes, and in fact, the various parts of this book are separate XML documents, usually at the chapter or section level, which are combined to create the full book.
In our example, the nesting continues downward from epigraph, which has two elements, <attribution> and <para> nested inside it. Nested inside<para> is content, in this case the Feynman quotation. Nested inside <attribution> is <personname>, which has nested within it <surname> and <firstname>, which contain Professor Feynman’s name.
The DocBook schema, which this example conforms to, defines each of these elements, plus many more, and the allowable ways in which they can be combined. It also defines a set of attributes, which are name/value pairs placed inside element tags to provide additional information about the element. There is an example of an attribute named “role” inside the start tag for epigraph. This role attribute has the value “quote,”, which, somewhat redundantly, tells us that this epigraph is a quote.
As you might guess, there are numerous details—attributes may be optional or required, the order of elements in a particular context may or may not be specified, text may or may not be allowed inside an element, and so forth—but the basic idea is that the schema tells you how to construct the framework of a document and populate it with content. And, it tells applications what they can expect to find when they open an XML document that conforms to a particular schema.
Schemas are the mechanism by which XML allows us to define domain specific languages. This is important because it enables XML to be used for a wide range of applications. Broadening the applicability of this technology has resulted in there being many more tools and applications that process XML than there ever would have been if its applicability had been restricted to a single language.
At the same time, this flexibility means that languages can be devised that meet the specific needs of narrow interests, but still take advantage of the wealth of tools built around XML. Overall, it’s a win both for vendors, who get more potential customers, and for users, who get the benefit of standard tools that will work on their customized schemas.
If you wanted to emphasize a piece of text in Microsoft Word, you’d probably just set the font style to italic. Or, in HTML you could use markup like this: <i>this is important</i>. In either case, you might think you’ve identified some text to be emphasized, but in fact you haven’t. Instead, you’ve simply identified some text to be rendered in italics. If you decide at some later time that you want emphasized text to be rendered in some other way, e.g., in bold or in quotations, you need to go back and change the markup for every instance of emphasized text to the new style. But, you can’t do that blindly; what if you also used italics for book titles, or variable names, or any of hundreds of other possible uses? You’d need to look at every occurrence of italics in the content to be sure you only changed those occurrences where italics were used for emphasis.
Descriptive Markup, also called “Semantic Markup” tags the meaning of various pieces of content without regard to how they might be rendered in various contexts. For example, instead of using italics explicitly, suppose you used markup that describes the semantics, like this: <emphasis>this is important</emphasis>. This markup doesn’t constrain the rendering in any way, and because it is unique to text you want to emphasize, it won’t be used to markup other content. Your processing engine can be set up to render this tag any way you want, and when you change the rendering, you don’t need to touch the content.
Descriptive Markup is not an inherent trait of XML languages. It is possible to create an XML grammar that specifies rendering in excruciating detail—XSL-FO, which is used to define detailed formatting for print media, is one example. However, separating rendering from markup is a critical part of languages like DocBook and DITA, making it possible to target the same content to different outputs and to enable higher level processing like generating tables of contents, indexes, etc..
Consider Example 7.1, “DocBook Example” again. The tags here, for example <surname>, clearly identify what the content is, but do not say anything about how that content should be rendered. If you decide that a person’s name should be rendered in the Palatino typeface, then decide later that it should be rendered in Arial, you do not need to touch the content. You can go further and render Professor Feynman’s name as “Richard Feynman” or as “Feynman, Richard.” Take it another step further and you can generate a Table of Epigraphs for the front matter of a book. If you had simply left the name unmarked, or marked it up with rendering specific markup, you could not have done any of these things.
And, just as important, it means that you can update the rendering software without changing the content. If you made a similar change to a document encoded in HTML or Microsoft Word, you would need to change the source document itself.
Data independence means that the data is independent of the software and hardware used to process it. This also means that data can be shared easily between different environments. Just as descriptive markup separates the content of a document from instructions on how to process it, data independence separates the document from the applications that process it. Data independence comes from XML and SGML being international standards with an open architecture. It means that vendors can participate in the development of the standard, and that they can build applications on the standard without concern that it will arbitrarily change under their feet.
Many of the most widely used word processing and publishing packages are proprietary and closed. For example, Adobe FrameMaker and Microsoft Word have native formats that, while they are not completely hidden from view, are proprietary and under the sole control of the companies that own them. To move content from one of these products to another product requires conversion, which can be very expensive. On the other hand, moving XML content, especially if it uses an industry standard schema, is usually trivially easy. It’s not at all uncommon for an XML shop to have authors working on the same document using different editing software. As long as the software treats the XML according to the standard, this is easy to do.
Data independence gives you freedom in two dimensions. It makes it possible for you to use any tools that can handle XML data, and it lets you update your tool set over time. Unless you write press releases for pop stars, the chances are that you re-use content in one way or another. Either you update content as the service changes, or you borrow content from one product to use with another. Hmmm, maybe press releases aren’t that different either, “Mega-star <insert name here> has released <his/her> latest ground-breaking album, <insert album name here>”
In any case, if you code your content using Microsoft Word, or Adobe FrameMaker, your choice of tools will be diminished and your ability to move easily to a new environment will be made much more difficult. I.e., to a greater or lesser degree you’re “locked in” to your toolset if you pick a proprietary tool. If you code your content using XML, you can use anything from open source tools, like emacs for editing or saxon for processing, to high end commercial tools, like Arbortext Adept Editor.
 The Mythical Man-Month. Essays on Software Engineering, 20th Anniversary Edition. Addison-Wesley. 0-201-83595-9.
 Managing Your Documentation Projects. John Wiley & Sons, Inc.. 0-471-59099-1.
 Power and Influence. Beyond Formal Authority. The Free Press. 0-02-918330-8.
 ISO 8879:1986 Information processing—Text and office systems—Standard Generalized Markup Language (SGML). 1986.
 Extensible Markup Language (XML) 1.0 (Fourth Edition). 16 August 2006. http://www.w3.org/TR/xml/.
 The Roots of SGML. A Personal Recollection. http://www.sgmlsource.com/history/roots.htm.
 Out of the Crisis. Massachusetts Institute of Technology, Center for Advanced Engineering Study. 0-911379-01-0.
 Now, Discover Your Strengths. The Free Press. 0-7432-0114-0.
 The Elements of Style. MacMillan Publishing Company. 0-02-418220-6.
 Closing Keynote, XML 2006. XML 2006 Conference. December 5-1, 2006. Boston, MA. Idealliance. . http://2006.xmlconference.org/proceedings/162/presentation.html.
 A Gentle Introduction to SGML. http://www.isgmlug.org/sgmlhelp/g-index.htm .
 Report of the Presidential Commission on the Space Shuttle Challenger Accident. http://history.nasa.gov/rogersrep/genindex.htm.
- The audience is the group of people who will be using your product.See Also Product.
- The tangible things that writers deliver to the project. For example, User Guides, Administrator Guides, Manuals, etc.
- The people who will be designing and building the product..See Also Product.
- The environment is the set of tools, processes, and personnel that the writer works with.
- Generalized Markup Language. An early markup language developed at IBM by Charles Goldfarb, Edward Mosher, and Raymond Lorie. Information can be found at: GML (Wikipedia).
- The product is whatever you’re writing about, even if it’s not a product. It could be a service, software, hardware, an airplane, or a toaster.
- The schedule is the timeline and milestones for the project.
- Standard Generalized Markup Language. A high-level description can be found at: SGML (Wikipedia), or SGML (w3.org/MarkUp/SGML)
- The tasks are the set of things the audience will be doing with the product.See Also Audience, Product.
- Extensible Markup Language. A high-level description can be found at: XML (Wikipedia), or XML (w3.org/XML)
Copyright © 2007 Richard L. Hamilton