Home » About the Digital Library of Georgia » Digital Library of Georgia Digitization Guide

Digital Library of Georgia Digitization Guide

Version 2.0 Sept. 2004

This guide is designed to give an overview of the digitization process for historical documents such as manuscripts, photographs, books, printed materials, and other flat paper items. It is intended to cover the basics of digitization projects in a concise manner. Links to more in-depth information are found throughout the guide, and a list of resources appears at the end.

Table of Contents

What is Digitization?

Scanning is easy. Building a website is relatively easy. However, once you start to consider a digitization project of hundreds or thousands of items, you quickly discover that you are playing an entirely different ballgame.

"Digitization" refers to all of the steps involved in the process of making collections of historical materials available online. This includes all of the following steps and more depending on what you're digitizing:

  • Ask a lot of questions first
  • Get financial resources, staff, equipment, build technical knowledge
  • Project management
  • Selection of collections/materials
  • Scan or digitize materials
  • Transcribe, markup, index
  • Create metadata
  • Quality control
  • Process images
  • Mount it all on the web
  • More quality control
  • Preserve and maintain archival media and online collections

As you can see, there are many things to be considered, and that the digitization process is not as simple as merely "scanning." This guide is designed to help with the process and point to other resources on digitizing historical materials.

Top

Step 1. Questions to Ask.

Before you move to the technical details of your project or this document, first consider the following questions:

  • Why do you want to digitize materials?
  • Who is your audience?
  • Who owns the materials?
  • Who is the copyright holder for the intellectual property contained in the materials?
  • What is your timeframe for the project?
  • How is the project being funded?
  • Who will be responsible for different stages of the project?
  • How will you digitize the materials?
  • How will you describe the materials and what metadata scheme will be used?
  • How will you provide access to the collection?
  • How will you preserve and maintain the collection and digitized materials?

For help in considering who your audience is, consider the audiences identified for the Digital Library of Georgia. Criteria for selecting materials for the DLG are guided by its Collection Development Policy.

Allow an appropriate amount of time for your project. The digitization process requires considerable effort, and it is at least as much of a management issue as a technical one. Time for administration, reports, personnel management, file manipulation and organization, backups, quality control, and building an appropriate and usable user interface MUST be factored into any time estimate. Be realistic in projecting how much time the work will take - it always will be more than you think.

The Kentuckiana Digital Library, part of the Kentucky Virtual Library, provides an excellent Project Planning Guide for thinking about these questions and planning projects:

Top

Step 2. Selection.

At the beginning of the selection process, consider the Collection Policy Guidelines for the Digital Library of Georgia. These guidelines raise topics and questions that should be considered in any digitization project.

Copyright is an extremely important aspect of selection. In order for items to be digitized and distributed for the public on the Internet, you should be sure that they are in the public domain, that you are making a fair use of the materials under U.S. copyright law, or that you have secured written permission of the copyright holder. Refer to the chart of When Works Pass Into the Public Domain by Laura N. Gassaway for more information:

Top

Step 3. Equipment for Digitization.

  • Computers. Computer hardware and software are constantly changing. Fortunately, currently produced computers are fast enough to handle most scanning and other digitization tasks. Select PCs with the upper range of RAM, disk storage, and processor speed available at the time of purchase. Also important is a CD-R and CD-ROM drives (for creating and copying CDs) and a monitor that is 17 inches or larger.
     
  • Scanners. Flatbed scanners vary widely by model and manufacturer. Like computers, many basic scanners now have enough quality to do adequate scanning. Generally, look for a model from a major manufacturer and consider reviews published by computer magazines and websites before purchase. A large bed scanner (13 x 17 is a normal size) is desirable for digitizing large special collections materials.

    Look for a scanner with a minimum of 600 dpi optical resolution as opposed to interpolated resolution.
     
  • Note about Digital Cameras. Most consumer-level digital cameras are inadequate for reproduction of special collections materials. Even the best of them can only do very small items to the same degree of resolution and quality as a flatbed scanner. Digital camera set-ups that are adequate for reproducing archival materials, such as those made by PhaseOne or BetterLight, are very expensive and require substantial expertise to use.
     
  • Software. For most scanning applications, Adobe Photoshop is the de facto standard and is recommended. Photoshop is more expensive than some other imaging packages, but the quality of images produced by Photoshop has been shown to be greater than those produced by other software.

    Scanners also come with a driver application that operates the scanner and sends the image to Photoshop. The features included in these applications are dependent on the scanner manufacturer, but they usually include the ability to preview, descreening functionality for halftone (dot-based) images, and color correction.

    While Photoshop can handle batch processing and conversions (such as creating derivative images from master images), another piece of software such as Debabelizer may be desired for intensive processing.
     
  • Storage. The most economical form of offline storage is CD-R media (storage on CD-RW media is not recommended). The best CD-R media for long term storage is "Gold" media which should cost about $1 per disc. Stay away from very cheap CD-R media. Save your master images (and derivatives to be safe) onto CD-R and make a second copy for backup purposes. Store your media in jewel cases rather than envelopes, and label them for appropriate identification of their contents.

    When storing CD-R media long term, the approximate lifespan before data loss occurs is approximately 5-10 years. For ideal preservation of digital materials, plan to recopy all of your CD-R media after 5 years.

    While DVD-based storage is up and coming, as of 004 standards are not currently developed enough to use this format for long-term archival storage.

    A more expensive, but preferable method for long term storage of master files is using hard drive arrays backed up to magnetic tape. Hard drive arrays offer the advantages of centralized storage, facilitating retrieval, backup, migration, and reformatting. This is the method currently used by the Digital Library of Georgia.
Top

Step 4. Scanning

The process for scanning documents is:

  • Create a master image, saved as a TIFF (Tagged Image File Format) File
  • Generate derivative or access images from this master image
  • Save the master images onto CD-R disks or other long term storage and mount the access images on a web server for public access

For manuscripts and printed materials use the following guidelines:

  • 300 dpi minimum
  • 24-bit RGB color
  • Sharpen lightly if needed using unsharp mask filter (be careful not to use too much!)
  • Save as TIFF

    * Note that the descreening functionality provided by your scanner's image capture software may need to be used for materials containing halftone screens made up of visible dots (as found in newspapers and magazines) or detailed engravings or other art.

Photographs:

  • 300 dpi minimum
  • B&W: 8-bit grayscale
  • Color: 24-bit RGB color
  • Cabinet Cards/Cartes d'visite: 24-bit RGB color
  • Sharpen lightly if needed using unsharp mask filter (be careful not to use too much!)
  • Save as TIFF

You may need to use a background made of white or black paper or cardboard to ensure that colors are correctly reproduced and that ink does not show through from the other side of multi-page documents.

Color. When installing your scanner and computer, be sure that the monitor and scanner are calibrated according to the manufacturer's instructions, so that colors are consistent from one image to another and match a standardized color bar. Most scanners come with a color target and instructions on calibrating the color settings. After calibrating, be sure not to change the settings of your monitor. It is a good idea to check the calibration of the scanner and monitor before each scanning session.

Scanners usually automatically read and adjust for color during the scanning process. Keep in mind that the final image should look like an exact duplicate of the original object on the screen. If scans come out too yellow, too dark, or too light, adjust the scan settings and try again.

Filenaming. If the photographs or manuscripts have an identifier or number associated with them, use this as the filename. For the sake of consistency, use only lowercase letters in filenames. Always use three character file extensions (.tif, .jpg, .gif). Generally the shorter the filename the better.

For example, if a photograph collection uses the identification number 55-JBC-2 for an individual photograph, the filename for that image can be 55-jbc-2.tif.

If you need to make up a file naming system, a good method is to use a lowercase letter or two followed by at least 3 numbers (if you expect more than 999 images, use 4 numbers). So for the fictional "Georgia Railroad Photograph Collection" we might use:
gr001.tif gr002.tif etc.

For multi-page documents like letters with fewer than 26 pages, adding a letter at the end is a good option:
br035a.tif br035b.tif br035c.tif

For books or long manuscripts, follow the file naming scheme shown above for photographs.

Always keep the same number of characters in the filename and do not vary the length. This makes automatic processing and manipulation of the images by programs much easier. Overall, be consistent! It will make things much more manageable.

Cropping. Cropping depends on the material being scanned. For most images, any background is cropped out leaving just the object. Photographs can be cropped to the edges of the photograph, and printed books can be cropped to the edge of the page. It is desirable for items such as diaries to crop outside the edge of the object, so that your image looks like a picture of a book rather than disembodied pages.

Image Derivatives:

Several common types of derivative images may be created from master images.

  • GIF (Graphics Interchange Format) This format is only currently used for creating thumbnails and 1 bit bitonal (black & white) images.
     
  • JPEG (Joint Photographic Experts Group) This format is widely used for creating medium and high resolution images for Web delivery.
     
  • PDF (Portable Document Format) From Adobe, this compressed format requires users to have the Adobe Acrobat Reader software installed on their machine (a common default on newer machines and browsers). It offers the benefit of re-sizing on screen and easy printing of documents. This format is most commonly used for printed documents.
     
  • DjVu From LizardTech, the DjVu format allows for higher compression ratios than PDF files while providing many of the same benefits of re-sizing, image rotation, and easy printing. It requires a free plug-in, available from LizardTech.
     
  • MrSID Also from LizardTech, the MrSID format is most commonly used for large oversized items like maps and posters. Server software may be used with this image format to deliver JPEG images to the user's browser and allow them to zoom into and resize images while maintaining high levels of quality. Some repositories are beginning to adopt the open JPEG2000 format, instead.

The most common practice for creating derivatives is to lower the resolution of the master TIFF image to 150dpi and 72 dpi, letting the height and width of the image adjust to this resolution automatically in Photoshop (the image size on the screen should be reduced and not the same as the master image). This creates a roughly a "1x" and "2x" magnification of the original.

For Digital Library of Georgia projects, it is standard practice to offer a 72dpi jpeg and sometimes a DjVu image for manuscript materials. The MrSID format is used for large format and photographic materials.

To create thumbnails, the width of the master image is reduced to 100-200dpi before saving as GIF or JPEG format (JPEG recommended). The most common reduction is 150 pixels wide.

The JPEG format allows for various levels of compression. For most purposes, save JPEG derivatives at "High Quality" in Photoshop and examine them closely for artifacts of the compression process. These usually show up as "squiggles" around letters or sharp edges in the image. If these appear, decrease the level of compression (by increasing the quality value in Photoshop) when saving.

Top

Step 5. Text Encoding

Documents, books, and other written and printed materials may also be transferred into text format as part of the digitization process. This can be done either through OCR (Optical Character Recognition) technology or by hand transcription, commonly called "rekeying."

The resulting text may then be presented on a web page or encoded as SGML or XML for inclusion in a searchable database. See the Introduction to SGML and XML for more information.

Because text transcription and encoding is very expensive and time consuming, individual projects will want to consider whether or not to make the investment. If the documents have high historical value and searchability is an important criteria of digitization, then transcription and encoding are the preferred method of achieving this goal.

TEI (Text Encoding Initiative) Lite is the most commonly used SGML/XML DTD (Document Type Definition) for encoding historical documents. More information on TEI may be found at The TEI Consortium Homepage. The Digital Library of Georgia recommends the use of TEI Text Encoding in Libraries Guidelines for Best Encoding Practices.

Top

Step 6. Metadata

As a first step in collecting metadata, either copy the MARC record for the collection, obtain the finding aid, or collect the information that otherwise would be needed to create a collection-level MARC record.

Metadata for the Digital Library of Georgia is based on the Dublin Core metadata format. This format consists of a set of 15 repeatable elements identified as basic descriptive elements for electronic resources.

The following core elements are the minimum for Digital Library of Georgia projects:

Title
The title of the item or resource. (See suggested guidelines)
 
Creator
The creators of the item, in Library of Congress authority form.
 
Subject
Subjects covered by the item. Use of controlled vocabularies such as Library of Congress Subject Headings, Library of Congress Thesaurus of Graphic Materials, the Art and Architecture Thesaurus, etc., is recommended. Any personal name formation should follow RDA.
 
Description
A short description of the item. (See suggested guidelines)
 
Publisher
Enter the name of the repository responsible for publishing the item online.
 
Identifier
Filename, image number, PURL, or other URI (Uniform Resource Identifier) for the image.
 
Date
The date the image or transcription was created. Note the date of the creation of the material goes in "Coverage."
 
Coverage (Temporal)
The date of the creation of the material in the form YYYY-MM-DD. So for a letter written on November 20, 1864, enter the date as 1864-11-20. Use ISO 8601b to form all dates.
 
Coverage (Spatial)
The name of the Georgia county and/or city in which the item originated. If none is indicated, "Georgia" will be sufficient. Use Kenneth K. Krakow, Georgia Place Names: Their History and Origins 2nd ed. (Macon: Winship Press, 1994), and the Library of Congress Name Authority file where possible for a controlled list.
 
Format
The format of the image or digital file, expressed in the MIME format. For example: image/tiff, image/jpeg, text/html, etc.
 
Source
A reference to a resource from which the present resource is derived in whole or in part. For archival collections, this should be the name of the collection from which the item originates.
Contributor
The institution that holds the original object. Additional authors and contributors to the intellectual content of an object should be placed in the Creator field.
 
Type
The nature or genre of the content of the resource. At a minimum, you must use the Dublin Core Type Vocabulary. Optionally, you may add additional materials type using one of the following controlled vocabularies: Library of Congress Thesaurus for Graphic Materials II: Genre and Physical Characteristic Terms (TGM II) or the Art and Architecture Thesaurus.
 

The following additional Dublin Core elements are optional, but their use is encouraged:

Relation
A reference to a related resource.
 
Language
A language of the intellectual content of the resource. Values should be taken from the ISO 639-2 standard, and expressed as three letter codes (i.e., eng, fre).
 
Rights
Information about rights held in and over the resource, such as a statement of copyright or other property rights.

Descriptions of all the Dublin Core Metadata Elements may be found in the Official Dublin Core Guidelines at:

General Input Guidelines

  • Format. See the Recommended Guidelines for Titles and Descriptions. Also refer to the rules for archival description outlined in Archives, Personal Papers, and Manuscripts by Steven L. Hensen for the proper format of titles, authors, descriptions, and other information.
     
  • Punctuation. Avoid extraneous punctuation or ending punctuation unless it is part of the content of the resource.
     
  • Abbreviations. Avoid using abbreviations unless you are transcribing an element from the piece.
     
  • Capitalization. In general, capitalize the first word (of a title, for example) and proper names (place, personal, and corporate names) only. When using a thesaurus, follow the capitalization format used in its vocabulary. Capitalize content in the description field according to normal rules of writing. Do not enter content in all caps except for acronyms.
     
  • Initial Articles. Leave off all initial articles such as: the, a, an, le, la, los, el, der, die, das, etc. in the title field.
     
  • Keywords. Keywords or words that can identify the resource with precision and are relevant to the resource should be used to aid in resource retrieval and discovery. Subject terms taken from a controlled vocabulary are also recommended.
     
  • Purpose. Keep in mind that the purpose of metadata is to allow for effective resource discovery by a variety of audiences. Continually ask the question, "Will this help someone find this resource?" as you assign metadata terms. Be sure to bring out any aspects related to Georgia as much as possible.
     
  • Recording Metadata. Depending on the final presentation, metadata can be entered and collected in a variety of ways. The simplest of these is to create a spreadsheet or database in a desktop software application such as Microsoft Excel or Access. More advanced methods include full MARC records or XML implementations.

What if I can't find all this data? Try to capture as much data as you can for every item. If working with a photographic collection (where not having information about each item may be common), try at the very minimum for a caption or description of each photograph if that is all that is available without extensive research. Also be sure to capture all of the information about the collection that would be needed for an archival MARC record, if one does not already exist.

Top

Step 7. Quality Control

For large projects effective quality control is essential. Recommended practices include keeping a checklist or log of images scanned and records created, checking each image and record produced for quality, and thorough checking of filenames and links. Building quality control into the digitization process is very important, since even the most well trained and most conscientious person doing digitization will be prone to errors. Often having "another set of eyes" to look things over alleviates many problems created due to human error. Automated methods of quality control, usually custom programs or scripts, are also encouraged if possible.

Top

Step 8. Delivery Methods

Delivery methods for online projects span a wide range. Basic HTML Web pages are the simplest way to put digitized materials online. More advanced methods include Web-accessible databases, and XML or SGML solutions. The setup chosen will depend on the technical expertise available and the scope and goals of the individual project. The Digital Library of Georgia provides support for delivery of digital collections under GALILEO, and contribution of digitized materials for inclusion in the DLG is recommended.

Top

Selected Resources

Top