Will you be able to access your important digital documents twenty years from now?
Data from NASA’s Viking missions to Mars in the 1970s was nearly lost to history. It was stored on magnetic tape that began to dry out and crack. Realizing the problem, NASA completed the painstaking task of transferring this data onto CDs in the ‘90s. Unfortunately, the software used to view the images was created especially for the mission and is no longer supported; meaning that the carefully restored information on the CDs contained data and imagery that could not be readily accessed. Recovering just 3,000 of more than 56,000 images took two years.
As NASA’s extreme example illustrates, digital information is remarkably fragile and is susceptible to software and hardware obsolescence, file corruption, or storage media degradation.
Not everything needs to be—or should be—saved in perpetuity. A formal digital archive should be distinguished from files that are backed up on a server or an external hard drive. In this context, a digital archive contains only the files that specifically require ongoing preservation and access.
“Memory institutions” like archival repositories, libraries, and government agencies have been struggling with digital preservation issues for many years. As a result, a number of standards, tools, and procedures are being developed and archivists at History Associates have been involved in some of these activities.
We recently conducted a pilot program with the Robert C. Byrd Center for Legislative Studies (Byrd CLS) at Shepherd University to develop an organized approach to assessing digital files and make recommendations for preserving and organizing the material. Like many congressional papers repositories, the Byrd CLS received terabytes of digital material along with paper records. Much of this data resided on CDs and on hard drives from office computers and it was not further organized or processed.
In the pilot program, we used a number of available tools to assess a 255 GB sample of electronic records. We determined file extracted metadata, and assigned a “fixity” value to each record. A fixity value is calculated from the file’s sequence of binary code and can validate the integrity of a file over time. We provided the Byrd CLS with an Archival Information Package (AIP), which contained both the content files and metadata generated throughout the assessment, along with our recommendations for processing the material.
As an indicator of how rapidly technology has changed, the records we assessed were created between 1990 and 2010 and were comprised of 124 unique file formats. Also, roughly 14,000 files were in indeterminate formats—we could not identify them by either embedded format signature or file extension (for more detailed information, please read our blog series on our Byrd CLS project).
Not all organizations will need to arrange and preserve entire contents of a computer hard drive, but once the material to be archived has been identified, we recommended some basic preservation activities to the Byrd CLS that can apply to any digital archive.
We based our recommendations on the Levels of Digital Preservation guidelines developed over the past several years by the National Digital Stewardship Alliance (NDSA). These recommendations were designed to help organizations start or enhance their digital archives:
- Archive your files in a reliable storage system such as a server or cloud-based service that does not rely on removable media like CDs or USB flash drives. In the Byrd CLS case, much of the archival material was saved onto portable hard drives, which run the risk of becoming inaccessible over time. Multiple copies, stored in separate geographical locations should also be created, to guard against total data loss as a result of a natural disaster.
- Migrate files in “at-risk” formats into a more stable and open format. It is beneficial to constrain the number of file formats you’ll need to support. Some file types, like WordPerfect and RealAudio, are declining in popularity and may eventually become obsolete. Review the materials you need to archive and develop a policy for preferred file formats for each content type. A number of guiding examples exist, including the Library of Congress Sustainability of Digital Formats and the U.S. National Archives and Records Administration format guidance for the transfer of electronic records. In Senator Byrd’s example, we recommended file formats to use in order to reduce the current 124 file formats to a more manageable number.
- Assign a “fixity” value to the files early in the process. Determining authenticity of an electronic record is difficult, but tools like Exactfile can calculate a “fixity” value—a unique identifier based on the file’s sequence of binary code. If the file is changed in any way, the calculated fixity value would change; hence it provides a mechanism for detecting change through either corruption, media degradation/errors, or by malicious means. We provided the Byrd CLS with fixity information for each individual file so that they can use it to periodically review the files to confirm that they have not been altered.
Preserving digital content is an ongoing challenge that is not likely to be solved any time soon. However, these simple steps can help to prevent data loss through inevitable technology changes. Of course if you need assistance to assess or process your digital materials, our digital archivists are on hand to help.