Digital Library of Georgia Digitization Guide
Version 2.0 Sept. 2004
This guide is designed to give an overview of the digitization process for historical documents such as manuscripts, photographs, books, printed materials, and other flat paper items. It is intended to cover the basics of digitization projects in a concise manner. Links to more in-depth information are found throughout the guide, and a list of resources appears at the end.
Table of Contents
- Intro: What is Digitization?
- Step 1: Questions to Ask
- Step 2: Selection
- Step 3: Equipment for Digitization
- Step 4: Scanning
- Step 5: Text Encoding
- Step 6: Metadata
- Step 7: Quality Control
- Step 8: Delivery Methods
- Selected Resources
What is Digitization?
Scanning is easy. Building a website is relatively easy. However, once you start to consider a digitization project of hundreds or thousands of items, you quickly discover that you are playing an entirely different ballgame.
"Digitization" refers to all of the steps involved in the process of making collections of historical materials available online. This includes all of the following steps and more depending on what you're digitizing:
- Ask a lot of questions first
- Get financial resources, staff, equipment, build technical knowledge
- Project management
- Selection of collections/materials
- Scan or digitize materials
- Transcribe, markup, index
- Create metadata
- Quality control
- Process images
- Mount it all on the web
- More quality control
- Preserve and maintain archival media and online collections
As you can see, there are many things to be considered, and that the digitization process is not as simple as merely "scanning." This guide is designed to help with the process and point to other resources on digitizing historical materials.Top
Step 1. Questions to Ask.
Before you move to the technical details of your project or this document, first consider the following questions:
- Why do you want to digitize materials?
- Who is your audience?
- Who owns the materials?
- Who is the copyright holder for the intellectual property contained in the materials?
- What is your timeframe for the project?
- How is the project being funded?
- Who will be responsible for different stages of the project?
- How will you digitize the materials?
- How will you describe the materials and what metadata scheme will be used?
- How will you provide access to the collection?
- How will you preserve and maintain the collection and digitized materials?
For help in considering who your audience is, consider the audiences identified for the Digital Library of Georgia. Criteria for selecting materials for the DLG are guided by its Collection Development Policy.
Allow an appropriate amount of time for your project. The digitization process requires considerable effort, and it is at least as much of a management issue as a technical one. Time for administration, reports, personnel management, file manipulation and organization, backups, quality control, and building an appropriate and usable user interface MUST be factored into any time estimate. Be realistic in projecting how much time the work will take - it always will be more than you think.
The Kentuckiana Digital Library, part of the Kentucky Virtual Library, provides an excellent Project Planning Guide for thinking about these questions and planning projects:Top
Step 2. Selection.
At the beginning of the selection process, consider the Collection Policy Guidelines for the Digital Library of Georgia. These guidelines raise topics and questions that should be considered in any digitization project.
Copyright is an extremely important aspect of selection. In order for items to be digitized and distributed for the public on the Internet, you should be sure that they are in the public domain, that you are making a fair use of the materials under U.S. copyright law, or that you have secured written permission of the copyright holder. Refer to the chart of When Works Pass Into the Public Domain by Laura N. Gassaway for more information:Top
Step 3. Equipment for Digitization.
- Computers. Computer hardware and software are
constantly changing. Fortunately, currently produced computers are fast
enough to handle most scanning and other digitization tasks. Select PCs
with the upper range of RAM, disk storage, and processor speed
available at the time of purchase. Also important is a CD-R and CD-ROM
drives (for creating and copying CDs) and a monitor that is 17 inches
- Scanners. Flatbed scanners vary widely by model
and manufacturer. Like computers, many basic scanners now have enough
quality to do adequate scanning. Generally, look for a model from a
major manufacturer and consider reviews published by computer magazines
and websites before purchase. A large bed scanner (13 x 17 is a normal
size) is desirable for digitizing large special collections materials.
Look for a scanner with a minimum of 600 dpi optical resolution as opposed to interpolated resolution.
- Note about Digital Cameras. Most consumer-level
digital cameras are inadequate for reproduction of special collections
materials. Even the best of them can only do very small items to the
same degree of resolution and quality as a flatbed scanner. Digital
camera set-ups that are adequate for reproducing archival materials,
such as those made by PhaseOne or BetterLight, are very expensive and
require substantial expertise to use.
- Software. For most scanning applications, Adobe
Photoshop is the de facto standard and is recommended. Photoshop is
more expensive than some other imaging packages, but the quality of
images produced by Photoshop has been shown to be greater than those
produced by other software.
Scanners also come with a driver application that operates the scanner and sends the image to Photoshop. The features included in these applications are dependent on the scanner manufacturer, but they usually include the ability to preview, descreening functionality for halftone (dot-based) images, and color correction.
While Photoshop can handle batch processing and conversions (such as creating derivative images from master images), another piece of software such as Debabelizer may be desired for intensive processing.
- Storage. The most economical form of offline
storage is CD-R media (storage on CD-RW media is not recommended). The
best CD-R media for long term storage is "Gold" media which should cost
about $1 per disc. Stay away from very cheap CD-R media. Save your
master images (and derivatives to be safe) onto CD-R and make a second
copy for backup purposes. Store your media in jewel cases rather than
envelopes, and label them for appropriate identification of their
When storing CD-R media long term, the approximate lifespan before data loss occurs is approximately 5-10 years. For ideal preservation of digital materials, plan to recopy all of your CD-R media after 5 years.
While DVD-based storage is up and coming, as of 004 standards are not currently developed enough to use this format for long-term archival storage.
A more expensive, but preferable method for long term storage of master files is using hard drive arrays backed up to magnetic tape. Hard drive arrays offer the advantages of centralized storage, facilitating retrieval, backup, migration, and reformatting. This is the method currently used by the Digital Library of Georgia.
Step 4. Scanning
The process for scanning documents is:
- Create a master image, saved as a TIFF (Tagged Image File Format) File
- Generate derivative or access images from this master image
- Save the master images onto CD-R disks or other long term storage and mount the access images on a web server for public access
For manuscripts and printed materials use the following guidelines:
- 300 dpi minimum
- 24-bit RGB color
- Sharpen lightly if needed using unsharp mask filter (be careful not to use too much!)
- Save as TIFF
* Note that the descreening functionality provided by your scanner's image capture software may need to be used for materials containing halftone screens made up of visible dots (as found in newspapers and magazines) or detailed engravings or other art.
- 300 dpi minimum
- B&W: 8-bit grayscale
- Color: 24-bit RGB color
- Cabinet Cards/Cartes d'visite: 24-bit RGB color
- Sharpen lightly if needed using unsharp mask filter (be careful not to use too much!)
- Save as TIFF
You may need to use a background made of white or black paper or cardboard to ensure that colors are correctly reproduced and that ink does not show through from the other side of multi-page documents.
Color. When installing your scanner and computer, be sure that the monitor and scanner are calibrated according to the manufacturer's instructions, so that colors are consistent from one image to another and match a standardized color bar. Most scanners come with a color target and instructions on calibrating the color settings. After calibrating, be sure not to change the settings of your monitor. It is a good idea to check the calibration of the scanner and monitor before each scanning session.
Scanners usually automatically read and adjust for color during the scanning process. Keep in mind that the final image should look like an exact duplicate of the original object on the screen. If scans come out too yellow, too dark, or too light, adjust the scan settings and try again.
Filenaming. If the photographs or manuscripts have an identifier or number associated with them, use this as the filename. For the sake of consistency, use only lowercase letters in filenames. Always use three character file extensions (.tif, .jpg, .gif). Generally the shorter the filename the better.
For example, if a photograph collection uses the identification number 55-JBC-2 for an individual photograph, the filename for that image can be 55-jbc-2.tif.
If you need to make up a file naming system, a good method is to use
a lowercase letter or two followed by at least 3 numbers (if you expect
more than 999 images, use 4 numbers). So for the fictional "Georgia
Railroad Photograph Collection" we might use:
gr001.tif gr002.tif etc.
For multi-page documents like letters with fewer than 26 pages,
adding a letter at the end is a good option:
br035a.tif br035b.tif br035c.tif
For books or long manuscripts, follow the file naming scheme shown above for photographs.
Always keep the same number of characters in the filename and do not vary the length. This makes automatic processing and manipulation of the images by programs much easier. Overall, be consistent! It will make things much more manageable.
Cropping. Cropping depends on the material being scanned. For most images, any background is cropped out leaving just the object. Photographs can be cropped to the edges of the photograph, and printed books can be cropped to the edge of the page. It is desirable for items such as diaries to crop outside the edge of the object, so that your image looks like a picture of a book rather than disembodied pages.
Several common types of derivative images may be created from master images.
- GIF (Graphics Interchange Format) This format is
only currently used for creating thumbnails and 1 bit bitonal (black
& white) images.
- JPEG (Joint Photographic Experts Group) This
format is widely used for creating medium and high resolution images
for Web delivery.
- PDF (Portable Document Format) From Adobe, this
compressed format requires users to have the Adobe Acrobat Reader
software installed on their machine (a common default on newer machines
and browsers). It offers the benefit of re-sizing on screen and easy
printing of documents. This format is most commonly used for printed
- DjVu From LizardTech, the DjVu format allows for
higher compression ratios than PDF files while providing many of the
same benefits of re-sizing, image rotation, and easy printing. It
requires a free plug-in, available from LizardTech.
- MrSID Also from LizardTech, the MrSID format is most commonly used for large oversized items like maps and posters. Server software may be used with this image format to deliver JPEG images to the user's browser and allow them to zoom into and resize images while maintaining high levels of quality. Some repositories are beginning to adopt the open JPEG2000 format, instead.
The most common practice for creating derivatives is to lower the resolution of the master TIFF image to 150dpi and 72 dpi, letting the height and width of the image adjust to this resolution automatically in Photoshop (the image size on the screen should be reduced and not the same as the master image). This creates a roughly a "1x" and "2x" magnification of the original.
For Digital Library of Georgia projects, it is standard practice to offer a 72dpi jpeg and sometimes a DjVu image for manuscript materials. The MrSID format is used for large format and photographic materials.
To create thumbnails, the width of the master image is reduced to 100-200dpi before saving as GIF or JPEG format (JPEG recommended). The most common reduction is 150 pixels wide.
The JPEG format allows for various levels of compression. For most purposes, save JPEG derivatives at "High Quality" in Photoshop and examine them closely for artifacts of the compression process. These usually show up as "squiggles" around letters or sharp edges in the image. If these appear, decrease the level of compression (by increasing the quality value in Photoshop) when saving.Top
Step 5. Text Encoding
Documents, books, and other written and printed materials may also be transferred into text format as part of the digitization process. This can be done either through OCR (Optical Character Recognition) technology or by hand transcription, commonly called "rekeying."
The resulting text may then be presented on a web page or encoded as SGML or XML for inclusion in a searchable database. See the Introduction to SGML and XML for more information.
Because text transcription and encoding is very expensive and time consuming, individual projects will want to consider whether or not to make the investment. If the documents have high historical value and searchability is an important criteria of digitization, then transcription and encoding are the preferred method of achieving this goal.
TEI (Text Encoding Initiative) Lite is the most commonly used SGML/XML DTD (Document Type Definition) for encoding historical documents. More information on TEI may be found at The TEI Consortium Homepage. The Digital Library of Georgia recommends the use of TEI Text Encoding in Libraries Guidelines for Best Encoding Practices.Top
Step 6. Metadata
As a first step in collecting metadata, either copy the MARC record for the collection, obtain the finding aid, or collect the information that otherwise would be needed to create a collection-level MARC record.
Metadata for the Digital Library of Georgia is based on the Dublin Core metadata format. This format consists of a set of 15 repeatable elements identified as basic descriptive elements for electronic resources.
The following core elements are the minimum for Digital Library of Georgia projects:
- The title of the item or resource. (See suggested guidelines)
- The creators of the item, in Library of Congress authority form.
- Subjects covered by the item. Use of controlled vocabularies such
as Library of Congress Subject Headings, Library of Congress Thesaurus
of Graphic Materials, the Art
and Architecture Thesaurus, etc., is recommended. Any personal name formation should follow RDA.
- A short description of the item. (See
- Enter the name of the repository responsible for publishing the item online.
- Filename, image number, PURL, or other URI (Uniform Resource
Identifier) for the image.
- The date the image or transcription was created. Note the date of
the creation of the material goes in "Coverage."
- Coverage (Temporal)
- The date of the creation of the material in the form YYYY-MM-DD. So
for a letter written on November 20, 1864, enter the date as
1864-11-20. Use ISO 8601b to form all dates.
- Coverage (Spatial)
- The name of the Georgia county and/or city in which the item
originated. If none is indicated, "Georgia" will be
sufficient. Use Kenneth K. Krakow, Georgia Place Names: Their
History and Origins 2nd ed. (Macon: Winship Press, 1994), and the
Library of Congress Name Authority file where possible for a controlled list.
- The format of the image or digital file, expressed in the
MIME format. For example: image/tiff, image/jpeg, text/html,
- A reference to a resource from which the present resource is derived in whole or in part. For archival collections, this should be the name of the collection from which the item originates.
- The institution that holds the original object. Additional authors and contributors to the intellectual content of an object should be placed in the Creator field.
- The nature or genre of the content of the resource. At a minimum, you must use
Dublin Core Type Vocabulary. Optionally, you may add additional materials type using one of the following controlled vocabularies:
Library of Congress
Thesaurus for Graphic Materials II: Genre and Physical Characteristic
Terms (TGM II) or the
Art and Architecture Thesaurus.
The following additional Dublin Core elements are optional, but their use is encouraged:
- A reference to a related resource.
- A language of the intellectual content of the resource. Values
should be taken from the ISO 639-2 standard, and expressed as three
letter codes (i.e., eng, fre).
- Information about rights held in and over the resource, such as a statement of copyright or other property rights.
Descriptions of all the Dublin Core Metadata Elements may be found in the Official Dublin Core Guidelines at:
General Input Guidelines
- Format. See the
Recommended Guidelines for Titles and
Also refer to the rules for archival description outlined in Archives,
Personal Papers, and Manuscripts by Steven L. Hensen for the proper
format of titles, authors, descriptions, and other information.
- Punctuation. Avoid extraneous punctuation or
ending punctuation unless it is part of the content of the resource.
- Abbreviations. Avoid using abbreviations unless you are transcribing an element from the piece.
- Capitalization. In general, capitalize the first
word (of a title, for example) and proper names (place, personal, and
corporate names) only. When using a thesaurus, follow the
capitalization format used in its vocabulary. Capitalize content in the
description field according to normal rules of writing. Do not enter
content in all caps except for acronyms.
- Initial Articles. Leave off all initial articles
such as: the, a, an, le, la, los, el, der, die, das, etc. in the title
- Keywords. Keywords or words that can identify the
resource with precision and are relevant to the resource should be used
to aid in resource retrieval and discovery. Subject terms taken from a
controlled vocabulary are also recommended.
- Purpose. Keep in mind that the purpose of metadata
is to allow for effective resource discovery by a variety of audiences.
Continually ask the question, "Will this help someone find this
resource?" as you assign metadata terms. Be sure to bring out any
aspects related to Georgia as much as possible.
- Recording Metadata. Depending on the final presentation, metadata can be entered and collected in a variety of ways. The simplest of these is to create a spreadsheet or database in a desktop software application such as Microsoft Excel or Access. More advanced methods include full MARC records or XML implementations.
What if I can't find all this data? Try to capture as much data as you can for every item. If working with a photographic collection (where not having information about each item may be common), try at the very minimum for a caption or description of each photograph if that is all that is available without extensive research. Also be sure to capture all of the information about the collection that would be needed for an archival MARC record, if one does not already exist.Top
Step 7. Quality Control
For large projects effective quality control is essential. Recommended practices include keeping a checklist or log of images scanned and records created, checking each image and record produced for quality, and thorough checking of filenames and links. Building quality control into the digitization process is very important, since even the most well trained and most conscientious person doing digitization will be prone to errors. Often having "another set of eyes" to look things over alleviates many problems created due to human error. Automated methods of quality control, usually custom programs or scripts, are also encouraged if possible.Top
Step 8. Delivery Methods
Delivery methods for online projects span a wide range. Basic HTML Web pages are the simplest way to put digitized materials online. More advanced methods include Web-accessible databases, and XML or SGML solutions. The setup chosen will depend on the technical expertise available and the scope and goals of the individual project. The Digital Library of Georgia provides support for delivery of digital collections under GALILEO, and contribution of digitized materials for inclusion in the DLG is recommended.Top
- Describing Archives: A Content Standard, Society of
American Archivists, 2004.
- Colorado Digitization Program Digitization Resources
- Digital Library Production Guide, Kentuckiana Digital Library
- Dublin Core Metadata Initiative
- Guides to Quality in Visual Resource Imaging, Research Libraries
- Handbook for Digital Projects: A Management Tool for
Preservation and Access, Maxine K. Sitts, Ed., Northeast Document
Conservation Center, 2000.
- The NINCH Guide to Good Practice in the Digital Representation and
Management of Cultural Heritage Material
- North Carolina ECHO (Exploring Cultural Heritage Online) Digitization Guidelines http://www.library.cornell.edu/preservation/tutorial/