Home » Help » Introduction to SGML and XML

Introduction to SGML and XML

SGML is an acronym for "Standard Generalized Markup Language," an international standard for representing information. The term "markup" comes from the publishing world, originally referring to the instructions that an editor would give to a typesetter when producing a printed text from a manuscript. Today we use "markup" to mean the instructions that accompany an electronic text. Unlike traditional markup instructions from editor to typesetter, which had primarily to do with the way the text would appear on the printed page, SGML is used principally to convey information about the function and content of the text rather than its appearance.

Why is this important? Although computers are very powerful in certain ways, they have no ability to interpret text the way the human mind does. We can look at four documents and understand that the first is a personal letter, the second a prescription for medicine, the third a bank statement, and the fourth a multiple-choice quiz. Similarly, when we read the personal letter, we can tell that the writer is using a phrase ironically or inserting a quotation in a foreign language. Computers can't make judgements like this unless we add descriptive information in a systematic way, and that's what SGML is all about: adding labels (often called tags) to sections of text to explain what they are and how they function.

The "generalized" part of SGML means that the markup is not designed for any specific application such as Microsoft Word or a browser like Netscape, or any particular type of computer hardware or system. A document marked up in SGML can be read by human beings and shared between computers in a standard and straightforward way. This allows us to create very large collections of electronic documents which are still manageable because we can use the markup to identify just the ones we're interested in at the moment: only the bank statements, for example, or only the paragraphs within a document that include foreign terms.

SGML is sometimes called a "meta-language" because it isn't a single language, but rather a set of principles that guides the formation of various markup languages, all of which are instances of SGML designed to handle different kinds of texts. For example, Text Encoding Initiative (TEI) is for texts in the humanities and Encoded Archival Description (EAD) is for archival finding aids. But both sets of rules, called Document Type Definitions (DTDs), were written using the SGML "grammar."

The Extensible Markup Language (XML) a simplified dialect or subset of SGML which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML was designed for ease of implementation, and for interoperability with both SGML and HTML. For most practical purposes, the differences between SGML and XML are minor and TEI-encoded texts can easily be transferred from SGML to XML.

Extensible Style Language (XSL) is used to transform XML files into other types of documents. An XSL stylesheet can, for example, facilitate the creation of an HTML display of a source XML file.

If you're interested in learning more about SGML, the web page contains a collection of introductory essays:

http://www.oasis-open.org/cover/general.html#overview

More information on XML may be found at:

http://www.oasis-open.org/cover/xml.html#overview

http://www.oasis-open.org/cover/xmlIntro.html

Return to DLG Digitization Guide