Digital Library of Georgia

Digital Library of Georgia Digitization Guide

Version 1.0 July 2001

This guide is designed to give an overview of the digitization process for historical documents such as manuscripts, photographs, books, printed materials, and other flat paper items. It is intended to cover the basics of digitization projects in a concise manner. Links to more in-depth information are found throughout the guide, and a list of resources appears at the end.

Table of Contents

Intro: What is Digitization?
Step 1: Questions to Ask
Step 2: Selection
Step 3: Equipment for Digitization
Step 4: Scanning
Step 5: Text Encoding
Step 6: Metadata
Step 7: Quality Control
Step 8: Delivery Methods
Selected Resources

What is Digitization?

Scanning is easy. Building a website is relatively easy. However, once you start to consider a digitization project of hundreds or thousands of items, you quickly discover that you are playing an entirely different ballgame.

"Digitization" refers to all of the steps involved in the process of making collections of historical materials available online. This includes all of the following steps and more depending on what you're digitizing:

As you can see, there are many things to be considered, and that the digitization process is not as simple as merely "scanning." This guide is designed to help with the process and point to other resources on digitizing historical materials.

Step 1. Questions to Ask.

Before you move to the technical details of your project or this document, first consider the following questions:

For help in considering who your audience is, consider the audiences identified for the Digital Library of Georgia. Criteria for selecting materials should be guided by the Collection Development Policy for the Digital Library of Georgia.

One of the most important things to consider is allowing an appropriate amount of time for your project. While the some typically think, "digitizing stuff is easy and fast," the steps mentioned in the introductory section should give an idea of the real time and effort involved. The digitization process is more of a management issue than a technical one. Time for administration, reports, personnel management, file manipulation and organization, backups, and building an appropriate and usable user interface MUST be factored into any time estimate. Always be realistic in estimating how much time it will take - it will always be more than you think.

The Kentuckiana Digital Library, part of the Kentucky Virtual Library, provides an excellent Project Planning Guide for thinking about these questions and planning projects:

http://www.kyvl.org/kentuckiana/bpguide/projectplans.shtml

Step 2. Selection.

At the beginning of the selection process, consider the Collection Policy Guidelines for the Digital Library of Georgia. These guidelines raise topics and questions that should be considered in any digitization project.

Copyright is an extremely important aspect of selection. In order for items to be digitized and distributed for the public on the Internet, you should be sure that they are in the public domain and/or that you are making a fair use of the materials under copyright. Refer to the chart of When Works Pass Into the Public Domain by Laura N. Gassaway for more information:

http://www.unc.edu/~unclng/public-d.htm

Step 3. Equipment for Digitization.

Computers. Computer hardware and software is constantly changing. Fortunately, currently produced computers are fast enough to handle almost any scanning and other digitization tasks. Here are some minimum guidelines (current as of 7/2001) for computer hardware:

Scanners. Flatbed scanners vary widely by model and manufacturer. Like computers, many basic scanners now have enough quality to do adequate scanning. Generally, look for a model from a major manufacturer and consider reviews published by computer magazines and websites before purchase. A large bed scanner (13 x 17 is a normal size) is desirable for digitizing large special collections materials.

Look for a scanner with a minimum of 600 dpi optical resolution as opposed to interpolated resolution.

Note about Digital Cameras. Digital cameras are still fairly new on the market and are still generally for "snapshot" photography. For this reason most digital cameras are inadequate for reproduction of special collections materials. The Digital Library of Georgia has had some success with commercial digital cameras, but even the best of these (capturing 4 megapixels) can only do very small items to the same degree of resolution and quality as a flatbed scanner. Digital camera set-ups that are adequate for reproducing archival materials, such as those made by PhaseOne or JenOptik, are very expensive and require substantial expertise to use.

Software. For most scanning applications, Adobe Photoshop is the de facto standard and is recommended. Photoshop is more expensive than some other imaging packages, but the quality of images produced by Photoshop has been shown to be greater than those produced by other software.

Scanners also come with a driver application that operates the scanner and sends the image to Photoshop. The features included in these applications are dependent on the scanner manufacturer, but they usually include the ability to preview, descreening functionality for halftone (dot-based) images, and color correction.

While Photoshop can handle batch processing and conversions (such as creating derivative images from master images), another piece of software such as Debabelizer may be desired for intensive processing.

Storage. Currently the norm for storage is CD-R media (storage on CD-RW media is not recommended). The best CD-R media for long-term storage is "Gold" media which should cost about $1 per disc. Stay away from very cheap CD-R media. Save your master images (and derivatives to be safe) onto CD-R and make a 2nd copy for backup purposes. Store your media in jewel cases rather than envelopes, and label them for appropriate identification of their contents.

When storing CD-R media long term, the approximate lifespan before data loss occurs is approximately 5-10 years. For ideal preservation of digital materials, plan to recopy all of your CD-R media after 5 years.

While DVD-based storage is up and coming, as of mid-2001 standards are not currently developed enough to use this format for long-term archival storage.

Step 4. Scanning

The process for scanning documents is:

  1. Create a master image, saved as a TIFF (Tagged Image File Format) File
  2. Generate derivative or access images from this master image
  3. Save the master images onto CD-R disks and mount the access images on a web server for public access

For manuscripts and printed materials use the following guidelines:

* Note that the descreening functionality provided by your scanner's image capture software may need to be used for materials containing halftone screens made up of visible dots (as found in newspapers and magazines) or detailed engravings or other art.

Photographs:

You may need to use a background made of white or black paper or cardboard to ensure that colors are correctly reproduced and that ink does not show through from the other side of multi-page documents.

Color. When installing your scanner and computer, be sure that the monitor and scanner are calibrated according to the manufacturer's instructions, so that colors are consistent from one image to another and match a standardized color bar. Most scanners come with a color target and instructions on calibrating the color settings. After calibrating, be sure not to change the settings of your monitor. It is a good idea to check the calibration of the scanner and monitor before each scanning session.

Scanners usually automatically read and adjust for color during the scanning process. Keep in mind that the final image should look like an exact duplicate of the original object on the screen. If scans come out too yellow, too dark, or too light, adjust the scan settings and try again.

Filenaming. If the photographs or manuscripts have an identifier or number associated with them, use this as the filename. For the sake of consistency, use only lowercase letters in filenames. Always use three character file extensions (.tif, .jpg, .gif). Generally the shorter the filename the better.

For example, if a photograph collection uses the identification number 55-JBC-2 for an individual photograph, the filename for that image can be 55-jbc-2.tif.

If you need to make up a file naming system, a good method is to use a lowercase letter or two followed by at least 3 numbers (if you expect more than 999 images, use 4 numbers). So for the fictional "Georgia Railroad Photograph Collection" we might use:

gr001.tif
gr002.tif
etc.

For multi-page documents like letters with fewer than 26 pages, adding a letter at the end is a good option:

br035a.tif
br035b.tif
br035c.tif

For books or long manuscripts, follow the file naming scheme shown above for photographs.

Always keep the same number of characters in the filename and do not vary the length. This makes automatic processing and manipulation of the images by programs much easier. Overall, be consistent! It will make things much more manageable.

Cropping. Cropping depends on the material being scanned. For most images, any background is cropped out leaving just the object. Photographs can be cropped to the edges of the photograph, and printed books can be cropped to the edge of the page. It is desirable for items such as diaries to crop outside the edge of the object, so that your image looks like a picture of a book rather than disembodied pages.

Image Derivatives:

Several common types of derivative images may be created from master images.

GIF (Graphics Interchange Format) This format is only currently used for creating thumbnails and 1 bit bitonal (black & white) images.

JPEG (Joint Photographic Experts Group) This format is widely used for creating medium and high resolution images for Web delivery.

PDF (Portable Document Format) From Adobe, this compressed format requires users to have the Adobe Acrobat Reader software installed on their machine (a common default on newer machines and browsers). It offers the benefit of re-sizing on screen and easy printing of documents. This format is most commonly used for printed documents.

DjVu From LizardTech, the DjVu format allows for higher compression ratios than PDF files while providing many of the same benefits of re-sizing, image rotation, and easy printing. It requires a free plug-in, available from LizardTech.

MrSID Also from LizardTech, the MrSID format is most commonly used for large oversized items like maps and posters. Server software may be used with this image format to deliver JPEG images to the user's browser and allow them to zoom into and resize images while maintaining high levels of quality.

The most common practice for creating derivatives is to lower the resolution of the master TIFF image to 150dpi and 72 dpi, letting the height and width of the image adjust to this resolution automatically in Photoshop (the image size on the screen should be reduced and not the same as the master image). This creates a roughly a "1x" and "2x" magnification of the original.

For Digital Library of Georgia projects, it is standard practice to offer a 72dpi jpeg and a DjVu image for manuscript materials. The MrSID format will also be used for large format and photographic materials.

To create thumbnails, the width of the master image is reduced to 100-200dpi before saving as GIF or JPEG format (JPEG recommended). The most common reduction is 150 pixels wide.

The JPEG format allows for various levels of compression. For most purposes, save JPEG derivatives at "High Quality" in Photoshop and examine them closely for artifacts of the compression process. These usually show up as "squiggles" around letters or sharp edges in the image. If these appear, decrease the level of compression (by increasing the quality value in Photoshop) when saving.

Step 5. Text Encoding

Documents, books, and other written and printed materials may also be transferred into text format as part of the digitization process. This can be done either through OCR (Optical Character Recognition) Technology or by hand transcription, commonly called "rekeying."

The resulting text may then be presented on a web page or encoded as SGML or XML for inclusion in a searchable database. See the Introduction to SGML and XML for more information.

Because text transcription and encoding is very expensive and time consuming, individual projects will want to consider whether or not to make the investment. If the documents have high historical value and searchability is an important criteria of digitization, then transcription and encoding are the preferred method of achieving this goal.

TEI (Text Encoding Initiative) Lite is the most commonly used SGML/XML DTD (Document Type Definition) for encoding historical documents.

More information on TEI may be found at The TEI Consortium Homepage. The Digital Library of Georgia recommends the use of TEI Text Encoding in Libraries Guidelines for Best Encoding Practices.

Step 6. Metadata

As a first step in collecting metadata, either copy the MARC record for the collection, obtain the finding aid, or collect the information that would otherwise be needed to create a collection-level MARC record.

Metadata for the Digital Library of Georgia is based on the Dublin Core Metadata Format. This format consists of a set of repeatable 15 elements identified as basic descriptive elements for electronic resources.

The following core elements are the desirable minimum for Digital Library of Georgia projects:

Title

The title of the item or resource. (See suggested guidelines)

Creator

The creator of the item, in Library of Congress authority form.

Subject

Subjects covered by the item. Use of controlled vocabularies such as Library of Congress Subject Headings, Library of Congress Thesaurus of Graphic Materials, the Art and Architecture Thesaurus, etc., is recommended.

Description

A short description of the item. (See suggested guidelines)

Publisher

Enter the name of the repository where the original materials are held.

Identifier

Filename, image number, PURL, or other URI (Uniform Resource Identifier) for the image.

Date

The date the image or transcription was created. Note the date of the creation of the material goes in "Coverage."

Coverage(Temporal)

The date of the creation of the material in the form YYYY-MM-DD. So for a letter written on November 20, 1864, enter the date as 1864-11-20.

Coverage(Spatial)

The name of the Georgia county and/or city in which the item originated. If none is indicated, "Georgia" will be sufficient. Use Kenneth K. Krakow, Georgia Place Names: Their History and Origins 2nd ed. (Macon: Winship Press, 1994), and the Thesaurus of Geographic Names where possible for a controlled list. If from another state, list the city, county, and state as available.

Format

The format of the image or digital file, expressed in the MIME format. For example: image/tiff, image/jpeg, text/html, etc.

Source

A reference to a resource from which the present resource is derived in whole or in part. For archival collections, this should be the name of the collection from which the item originates.

The following additional Dublin Core elements are optional, but their use is encouraged:

Contributor

An entity responsible for making contributions to the content of the resource.

Relation

A reference to a related resource.

Language

A language of the intellectual content of the resource. Values should be taken from the ISO 639-2 standard, and expressed as three letter codes (i.e., eng, fre).

Type

The nature or genre of the content of the resource. Use standardized vocabularies such as the Library of Congress Thesaurus for Graphic Materials II: Genre and Physical Characteristic Terms (TGM II) or the Art and Architecture Thesaurus.

Rights

Information about rights held in and over the resource, such as a statement of copyright or other property rights.

Descriptions of all the Dublin Core Metadata Elements may be found in the Official Dublin Core Guidelines at:

http://dublincore.org/documents/dces/

General Input Guidelines

Format. See the Recommended Guidelines for Titles and Descriptions. Also refer to the rules for archival description outlined in Archives, Personal Papers, and Manuscripts by Steven L. Hensen for the proper format of titles, authors, descriptions, and other information.

Punctuation. Avoid extraneous punctuation or ending punctuation unless it is part of the content of the resource.

Abbreviations. Use only common or accepted abbreviations (such as "St." for "Saint"), designations of function (such as "ed." for "Editor"), terms used with dates (such as "b." "fl." or "ca."), compound words, or distinguishing terms added to names of persons if they are abbreviated on the source (such as "Mrs."). Abbreviations should not be used if they would make the record entry unclear. See Appendix B or the Anglo-American Cataloging Rules 2nd ed. Revised (AACR2R) for more information.

Capitalization. In general, capitalize the first word (of a title, for example) and proper names (place, personal, and corporate names) only. When using a thesaurus, follow the capitalization format used in its vocabulary. Capitalize content in the description field according to normal rules of writing. Do not enter content in all caps except for acronyms.

Initial Articles. Leave off all initial articles such as: the, a, an, le, la, los, el, der, die, das, etc. in the title field.

Keywords. Keywords or words that can identify the resource with precision and are relevant to the resource should be used to aid in resource retrieval and discovery. Subject terms taken from a controlled vocabulary are also recommended.

Purpose. Keep in mind that the purpose of metadata is to allow for effective resource discovery by a variety of audiences. Continually ask the question, "Will this help someone find this resource?" as you assign metadata terms. Be sure to bring out any aspected related to Georgia as much as possible.

Recording Metadata. Depending on the final presentation, metadata can be entered and collected in a variety of ways. The simplest of these is to create a spreadsheet or database in a desktop software application such as Microsoft Excel or Access. More advanced methods include the CORC system from OCLC, full MARC records, or XML implementations.

What if I can't find all this data? Try to capture as much data as you can for every item. If working with a photographic collection (where not having information about each item may be common), try at the very minimum for a caption or description of each photograph if that is all that is available without extensive research. Also be sure to capture all of the information about the collection that would be needed for an archival MARC record, if one does not already exist.

Step 7. Quality Control

For large projects effective quality control is essential. Recommended practices include keeping a checklist or log of images scanned and records created, checking each image and record produced for quality, and thorough checking of filenames and links. Building quality control into the digitization process is very important, since even the most well trained and most conscientious person doing digitization will be prone to errors. Often having "another set of eyes" to look things over alleviates many problems created due to human error. Automated methods of quality control, usually custom programs or scripts, are also encouraged if possible.

Step 8. Delivery Methods

Delivery methods for online projects span a wide range. Basic HTML Web pages are the simplest way to put digitized materials online. More advanced methods include Web-accessible databases, and XML or SGML solutions. The setup chosen will depend on the technical expertise available and the scope and goals of the individual project. The Digital Library of Georgia provides support for delivery of digital collections under GALILEO, and contribution of digitized materials for inclusion in the DLG is recommended.

Selected Resources

Archives, Personal Papers, and Manuscripts compiled by Steven L. Hensen, Society of American Archivists, 1989.

Colorado Digitization Project Guidelines
http://coloradodigital.coalliance.org/glines.html

Digital Library Production Guide, Kentuckiana Digital Library
http://www.kyvl.org/kentuckiana/bpguide/about.shtml

Dublin Core Metadata Initiative
http://purl.org/DC/

Guides to Quality in Visual Resource Imaging, Research Libraries Group
http://www.rlg.org/visguides/

Handbook for Digital Projects: A Management Tool for Preservation and Access, Maxine K. Sitts, Ed., Northeast Document Conservation Center, 2000.

North Carolina ECHO (Exploring Cultural Heritage Online) Digitization Guide
http://www.ncecho.org/Guide/index.htm

Moving Theory into Practice: Digital Imaging for Libraries and Archives by Anne. R. Kenney and Oya Y. Rieger, Research Libraries Group, 2000.

Return to DLG Home