Arranging and preserving digital materials has been an ongoing challenge for repositories that hold the collections of U.S. senators and representatives. Our digital archives team recently joined forces with the Byrd Center for Legislative Studies (Byrd CLS) to conduct a pilot project on a selection of Senator Byrd’s digital files. Our goal for the project is to develop a set of recommended best practices for the proper archiving of congressional electronic records, which we will share with the wider archival community.
Capturing the Content
The team met in late April to collect a selection of 2.9 million files residing on two portable hard drives and a number of CDs and DVDs. As a first step, we took a set of disk images of each of the storage media so that we had an exact replica of the complete file systems to work with. A disk image includes not just the content we may be interested in, but software files, operating system files, and unallocated space on the disk. It essentially captures the files within their entire “ecosystem” and represents them as a single file. Taking a disk image of the media ensures that important file system information including file creation and last modified time stamps are preserved, and also ensures that the original files remain unchanged. Depending on the storage size, rotation speed and data transfer speed from the disk, the process of taking an image can take a while. In our case creating an image of a 1TB portable drive with a USB 2.0 connection took approximately 16 hours to complete.
Our pilot program includes an assessment of the materials, followed by an analysis to determine items of archival value and a recommended arrangement scheme. Throughout the process, we will document the tools and procedures used so our final report will detail what worked well – and what did not – for future reference. The assessment phase consists of a number of elements:
Extraction and Virus Scan
Back at History Associates we mounted the collected disk images to a local workstation and began the process of extracting the files. Since disk images are single files, we had to extract each individual file (all 2.9 million of them) from the image file before we could continue our assessment. Think of this step as unpacking the image file. Once the files had been extracted, we performed a virus scan of all the files, and thankfully no infections were found. If any infected files had been found we would have quarantined them immediately, and implemented the most appropriate method where possible for removing the virus without damaging any digital content.
Asserting and Maintaining Integrity
What is an authentic electronic record? Demonstrating or determining the authenticity of an electronic record is a challenging feat, but a fundamental component is to create and maintain a fixity value for each individual file as early in the archival process as possible. Fixity in simple terms is a unique fingerprint for a file based on the sequence of 1’s and 0’s present in the file. In our case it enables us to verify that the files were not altered or corrupted in any way throughout our subsequent analysis activities. i.e., none of the bits have been changed by our actions of opening, reading, and extracting information from the files. The generation of fixity information also provides an ongoing mechanism for Byrd Center staff to monitor the integrity of their digital content and identify any files that have become altered or corrupted. There are a number of algorithms that can be used to create fixity values. For this engagement we chose to use MD5 and SHA1 which are both widely used in the digital preservation domain.
Once we were confident that we had a clean and exact replica of the original content, and had calculated fixity values for every digital file, our detailed analysis could begin.
File Format Analysis
What do you have? For any digital preservation program it is important to understand what types of digital files are being managed. This means we have to be able to identify the specific file format of each file.
One approach is to examine the file extension of a file to give us some indication of its format. For example consider the file “MyDocument.pdf.” This has the extension “.pdf” which leads us to determine that this file is an Adobe PDF file. However this is not completely foolproof. One can quite easily rename the file and change the extension of the example file to “.doc,” which would suggest that it is now a Microsoft Word file, when in fact it is still an Adobe PDF file.
A more sophisticated approach is to look for the presence of a format signature inside the file itself. This is a short sequence of bytes, normally at the start of a file that identifies its format. A format signature can identify not only the format, but a particular version of that format, e.g., Adobe Acrobat v1.6. However not all files contain such signatures so in some cases we have to rely on the extension as a tentative guide.
A number of tools exist that can crawl through a file system and capture the file format of each file. Some tools look for known file extensions and can act as a quick guide, where others actually search within the file for known format signatures. For the Senator Byrd collection we used a combination of both types of tools.
The vast majority of the files we analyzed were TIFF and PDF representations of scanned microfilm – 2.6 million files from a total of 2.9 million files. Of the remaining 295,000 files, we were able to identify 112 different file formats covering a diverse set of content types, such as text, images, audio, video, email, databases, web content, spreadsheets and software files. A total of 13,550 files could not be identified by either signature or extension. This represents a 95% success rate for identification. We inspected a sample of the unidentified files using a hex editor, and many appeared to be textual in nature. Indeed they would open in Microsoft Word. This leads us to assume
that given their age they are word perfect files, where staffers had
used their initials as the extension.
Current best practices suggest that an archive should adopt a policy of maintaining digital content in a small number of “open”, “well adopted” and “sustainable” formats. An example would be the use of uncompressed “TIFF” format for all images. In the case of the Byrd Center materials, we found a large number of format types used to represent similar content types. For example there were 15 different file formats used to represent images. A process of file format migration or normalization could be used to reduce this number to a more manageable one. Of course there are a number of cost benefit decisions to consider on both the archival and technical sides when developing a format migration or normalization policy. We will save that discussion for another more in depth posting.
The threat of format obsolescence should be continuously considered by any digital preservation program. This can be mitigated somewhat by the adoption of the best practices described above. However, archives often accession digital materials created many years ago using both proprietary and somewhat extinct file formats. The Byrd Center materials certainly fall into this category. There are a number of files in formats that are highly proprietary and have fallen from popular use, including:
- Various versions of Word Perfect (~170,000)
- RealAudio and RealMedia
- Paradox Database
- Lotus Freelance
- Lotus 123
Although this represents a small percentage of the total materials, further consideration should be made to the preservation requirements for these materials over time.
In the next installment we will be reporting on some other aspects including file duplicates, extent of personally identifiable information (PII) and where this material fits within the Bryd Center existing holdings.