Digital Library of Georgia

Introduction to SGML and XML

SGML is an acronym for "Standard Generalized Markup Language," an international standard for representing information. The term "markup" comes from the publishing world, originally referring to the instructions that an editor would give to a typesetter when producing a printed text from a manuscript. Today we use "markup" to mean the instructions that accompany an electronic text. Unlike traditional markup instructions from editor to typesetter, which had primarily to do with the way the text would appear on the printed page, SGML is used principally to convey information about the function and content of the text rather than its appearance.

Why is this important? Although computers are very powerful in certain ways, they have no ability to interpret text the way the human mind does. We can look at four documents and understand that the first is a personal letter, the second a prescription for medicine, the third a bank statement, and the fourth a multiple-choice quiz. Similarly, when we read the personal letter, we can tell that the writer is using a phrase ironically or inserting a quotation in a foreign language. Computers can't make judgements like this unless we add descriptive information in a systematic way, and that's what SGML is all about: adding labels (often called tags) to sections of text to explain what they are and how they function.

The "generalized" part of SGML means that the markup is not designed for any specific application such as Microsoft Word or a browser like Netscape, or any particular type of computer hardware or system. A document marked up in SGML can be read by human beings and shared between computers in a standard and straightforward way. This allows us to create very large collections of electronic documents which are still manageable because we can use the markup to identify just the ones we're interested in at the moment: only the bank statements, for example, or only the paragraphs within a document that include foreign terms.

SGML is sometimes called a "meta-language" because it isn't a single language, but rather a set of principles that guides the formation of various markup languages, all of which are instances of SGML designed to handle different kinds of texts. For this project, we'll be working with a set of tags and rules for using them known as TEI (Text Encoding Initiative).

When you mark up a document using the TEI version of SGML, you are "publishing" it in a certain sense. It will become part of a database of electronic documents within the Digital Library of Georgia for everyone in the world to see. For this reason it is very important to proofread the text carefully, do the markup correctly and consistently, and ask questions whenever you are uncertain how to proceed. The editing program you will use has many features to help you avoid making mistakes, and your work will be reviewed by at least one other person, but there is no substitute for getting it right the first time.

The Extensible Markup Language (XML) a simplified dialect or subset of SGML which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML was designed for ease of implementation, and for interoperability with both SGML and HTML. For most practical purposes, the differences between SGML and XML are minor and TEI-encoded texts can easily be transferred from SGML to XML.

If you're interested in learning more about SGML, the web page contains a collection of introductory essays:

http://www.oasis-open.org/cover/general.html#overview

More information on XML may be found at:

http://www.oasis-open.org/cover/xml.html#overview

http://www.oasis-open.org/cover/xmlIntro.html

Return to DLG Digitization Guide