relate the table's entres to one another, have we affected content? Suppose the CD in my artic contains a treasure map depicted by the visual patterns of word and line spacings in my original digital version of this ar acle. Because these patterns are artifacts of the formatung algorithms of my software. they will be visible only when the digital version is viewed using my original program. If we need to view a complex document as its author viewed it, we have little choice but to run the software that generated it. What chance will my grandchildren have of finding that software 50 years from now? If I include a copy of the program on the CD. they must still find the oper ating system software that allows the program to run on some computer. Storing a copy of the operating system on the CD may help. but the computer hardware required to run it will have long since become obsolete. What land of digital Rosetta Stone can I leave to provide the key to understanding the tents of my disk? Migrating Bits con Tuments from being o prevent digital doc lost, we must first pre FORGEH SHAKESPEARE LIBRARY SHAKESPEARES Though yet heauen knowes it is but as a tombe And in fresh numbers number all your graces, age to come would fay this Poet lies, But were fome childe of yours aliue that time, -18. rime. Shall I compare thee to a Summers day? Thou art more louely and more temperate: So long as men a breath or eyes can fes, DEuouring time blunt thou the Lyons pawes, And make the earth deuoure he owne sweet brood, SHAKESPEARE'S first printed edition of sonnet 18 (1609) exemplifies serve their bit streams. That means copying the bits onto new forms of media to ensure their accessibility. The approach is analogous to preserving text. which must be transcribed periodically. Both activities require ongoing effort: future access depends on an unbroken chain of such migrations frequent enough to prevent media from becom. ing physically unreadable or obsolete before they are copied. A single break in this chain renders digital information inaccessible, short of heroic effort. Giv. en the current lack of permanence of media and the rate at which their forms evolve, migration may need to be as frequent as once every few years. Conservative estimates suggest that data on digital magnetic tape should be copied once a year to guarantee that none of In the long run, we might be able to An ancient text can be preserved either by translating it into a modern language or by copying it in its orginal dialect. Translation is attractive because it avoids the need to retain knowledge of the text's original language, yet few scholars would praise their predecessors for taking this approach. Not only does translation lose information, it also makes it impossible to determine what information has been lost, because the onginal is discarded. (In extreme cases, translation can completely undermine content: imag. ine blindly translating both languages in a bilingual dictionary into a third language. Conversely, copying text in its original language (sav ing the bit stream) guarantees that nothing will be lost. Of course, this approach assumes that knowledge of the originai language is retained. Archivists have dentfied two analogous strategies for preserving digital documents. The first is to translate them into standard forms that are independent of any computer system. The second approach is to extend the longevity of computer systems and their original software to keep documents readable. Unfortunately, both strategies have senous shortcomings. On the surface. it appears preferable to translate digital documents into standard forms that would remain readable in the future, obviating the need to run obsolete software. Proponents of this approach offer the relational database (introduced in the 1970s by E. F. Codd, now at Codd & Date, Inc., in San Jose, Calif.) as a paradigmatic example. Such a database consists of tables representing relations among entines. A database of employees might contain a table having columns for employee names and their departments. A second table in the database might have department names in its first column, department sizes in its second column and the name of the department head in a third. The relational model defines a set of formal operations that make it possible to combine the relations in these tables-for example, to find the name of an employee's department head. SCIENTIFIC AMERICAN January 1995 45° UNDERSTANDING A BIT STREAM demands knowledge of the format used to create the stream. If all the numbers in a monthly checking account statement were strung together-with nothing to distinguish check numbers, dates and dollar amounts-the resulting sequence of digits would be impossible to understand. Because all relanonal database systems implement this same underlying model, any such database can in prin. aple be translated into a standard tabular form acceptable to any other system. Files represented this way could be copied to new media as necessary. and the standard would ensure readability forever. Flaws of Translation egrettably, this approach is flawed lational databases are less standardized than they appear. Commercial relational database systems distinguish themselves from one another by offering features that extend the relational model in nonstandard ways. Moreover, the limitations of such databases are already leading to the adoption of new models. The tables in a relational database cannot transparently show structure. That is. the database could not immediately make it clear that a corporation consisted of one headquarters. five national offices. 25 divisions and 100 departments. Various object-oriented database models (which can represent structure directly) are evolving to sausfy this need. Such rapid evolution is neither accidental nor undesirable. It is the hallmark of informacon technology. Furthermore, far from being a representative example, relational databases are practically unique. No other type of digital document has nearly so formal a basis for standardization. Word processors, graphics programs. spreadsheets and hypermedia programs each create far more varied documents. The incompatibility of word-processing files exemplifies this problem. It did not arise simply because companies were trying to distinguish their products in the mar ketplace. Rather it is a direct outgrowth of the technology's tendency to adapt INTENDED OF 0111 = 7) INTENDED 7-8IT DATA BYTES BIT STREAM: 01111100000000101010100000000100000111101110 UNINTENDED 5-BIT KEY (VALUE OF 01111 = 15) UNINTENDED 15-8IT DATA BYTE CODE KEY may be used to indicate how a bit stream is organized. Here the first four bits stand for the integer 7, meaning that the remaining bytes are each seven bits long. Yet there is no way to tell the length of the code key from the bit stream itself. If we were to read the first five bits as the code key, we would erroneously conclude that the remaining bytes were 15 bits long. LAURIE itself to the emerging needs of users. As yet. no common application is ready to be standardized. We do not have an accepted, formal understanding of the ways that humans manipulate information. It is therefore premature to attempt to enumerate the most important kinds of digital appications, let alone to circumscribe their capabilities through standards. Forcing users to accept the limitations imposed by such standards or restricang all digital documents to contain nothing but text as a lowest common denominator would be futile. The information revolution denives its momentum precisely from the attraction of new capabilities. Defining long-term standards for digital documents may become feasible when information science rests on a more formal foundation. but such standards do not yet offer a solution. Translating a document into successive short-term standards offers faise hope. Successive translation avoids the need for ultimate standards, but each translation introduces new losses. Would a modern version of Homer's Лiad have the same literary impact if it had been translated through a senes of interme diate languages rather than from the earliest surviving texts in ancient Greek? in theory, translating a document through a sequence of standards should enable scholars to reconstruct the original document. Yet that requires each ranslation to be reversible without loss, which is rarely the case. Finally, transiation suffers from a fatal flaw. Unlike English and ancient Greek, whose expressive power and semantics are roughly equivalent, digital documents are evolving so rapidly that shifts in the forms of documents must inevitably anse. New forms do not necessanly subsume their predecessors or provide compatibility with previous formats. Old documents cannot always be translated into unprecedented forms in meaningful ways, and translating a current file back into a previous form is frequently impossible. For example, many older, hierarchical databases were completely redesigned to fit the relational mode!, just as relational databases are now being restructured to fit emerging object-oriented models. Shifts of this kind make it difficult or meaningless to translate old documents into new standard forms. The alternative to translating a digital document is to view it by using the program that produced it. In theory, we might not actually have to run this software. If we could describe its behavior in a way that does not depend on any particular computer system. future generations could re-create the behavior of the software and thereby read the document. But informanon science cannot yet describe the behavior of software in sufficient depth for this approach to work, nor is it likely to be able to do so in the near future. To replicate the behavior of a program, there is currently little choice but to run it. For this reason, we must save the programs that generate our digital documents, as well as all the system software required to run those programs. Although this task is monumental, it is theoreucally feasible. Authors often include an appropriate application program and operating system to help recipients read a digital document. Some applications and system software may remain ubiquitous. so that authors would need only to refer readers to those programs. Free, public-domain software is already widely available on the internet. Moreover, when propretary programs become obsolete, their copyright restcaons may expire, making them available to future users. How can we provide the hardware to run anaquated systems and appucation software? A number of specialized museums and "retro-compuang' clubs are attempting to maintain computers in working condition after they become obsolete. Despite a certain undeniable charm born of its technological bravado, this method is ultimately futile. The cost of repairing or replacing worn out components (and retaining the expertise to do so must inevitably outweigh the demand for any outmoded computer. Fortunately, software engineers can write programs called emulators, which mimic the behavior of hardware. Assuming that computers will become far more powerful than they are today, they should be able to emulate obsolete systems on demand. The main drawback of emulation is that it requires detailed specifications for the outdated hardware. To be readable for posterity, these specifications must be saved in a digital form independent of any particular software, to prevent having to emulate one system to read the specifications needed to emulate another. Saving Bits of History If digital documents and their pro grams are to be saved. their migration must not modify their bit streams, because programs and their files can be corrupted by the slightest change. If such changes are unavoidable. they must be reversible without loss. Moreover, one must record enough detail about each transformation to allow reconstruction of the onginal encoding of the bit stream. Although bit streams INTERPRETING A BIT STREAM correctly is impossible without contextual information. This eight-bit sequence can be interpreted in at least six different ways. can be designed to be immune to any expected change. future migraton may introduce unexpected alterations. For example, aggressive data compression may convert a bit stream into an approximation of itself, precluding a precise reconstrucaon of the onginai. Similariy, encrypпon makes it impossible to recover in onginal bit stream without the decrypton key. Ideally, bit streams should be sealed in virtual envelopes: the contents would be preserved verbatim, and contextual information associated with each envelope would describe those contents and their transformation history. This information must itself be stored digitally (to ensure its survival), but it must be encoded in a form that humans can read more simply than they can the bit stream itself, so that it can serve as a bootstrap. Therefore, we must adopt bootstrap standards for encoding con textual information: a simple, text-only standard should suffice. Whenever a bit stream is copied to new media, its assocated context may be translated into an updated bootstrap standard. (Ireversible translation would be acceptable here, because only the semantic content of the original context need be retained.) These standards can also be used to encode the hardware specifications needed to construct emulators. Where does this leave my grandchildren? If they are fortunate. their CD may still be readable by some existing disk drive. or they may be resourceful enough to construct one, using information in my letter. If I include all the relevant software on the disk. along with complete. easily decoded specifications for the required hardware, they should be able to generate an emulator to run the original software that will display my document. I wish them luck. FURTHER READING TEXT AND TECHNOLOGY: READING AND WRITING IN THE ELECTRONIC AGE. Jay David Bolter in Library Resources and Technical Services. Vol. 31. No. I. pages 1223: January-March 1987. TAKING A BYTE OUT OF HISTORY: THE ARCHIVAL PRESERVATION OF FEDERAL COMPUTER RECORDS. Report 101-978 of the U.S. House of Representanves Committee on Government Operations. November 6, 1990. ARCHIVAL MANAGEMENT OF ELECTRONIC RECORDS. Edited by David Bearman. Archives and Museum Informatcs. Pittsburgh, 1991. UNDERSTANDING ELECTRONIC INCUNABULA: A FRAMEWORK FOR RESEARCH ON ELECTRONIC RECORDS. Margaret Hedstrom in Amencan Archivist, Vol. 54, No. 3. pages 334-354: Summer 1991. ARCHIVAL THEORY AND INFORMATION TECHNOLOGIES: THE IMPACT OF INFORMATION TECHNOLOGIES ON ARCHIVAL PRINCIPLES AND PRACTICES. Charles M. Dollar. Edited by Oddo Bucci. Information and Documentation Series No. 1. University of Macerata. Italy, 1992. SCHOLARLY COMMUNICATION AND INFORMATION TECHNOLOGY: EXPLORING THE LMPACT OF CHANGES IN THE RESEARCH PROCESS ON ARCHIVES. Avra Michelson and Jeff Rothenberg in Amencan Archivist. Vol. 55. No. 2, pages 236-315: Spring 1992. SCIENTIFIC AMERICAN January 1995 47 Attachment 3 RESOLUTION ON THE U.S. CONGRESSIONAL SERIAL SET AND WHEREAS, The U. S. Congressional Serial Set and the bound Congressional Record together comprise a significant portion of the official historical record of Congress; and WHEREAS, The U. S. Congressional Serial Ser has been produced since 1813 in a bound, numbered edition, and includes Senate and House documents, congressional committee reports, presidential and other executive publications, treaty materials, and selected reports of nongovernmental organizations; and WHEREAS, The bound Congressional Record has been produced since 1873 as the official record of the proceedings and debates of Congress in a uniform, numbered edition, superseding its predecessors, the Annals of Congress (1789-1824), the Register of Debates (1824-1837), and the Congressional Globe (1833-1873); and WHEREAS, The U.S. Congressional Serial Ser and the bound Congressional Record are important historical materials for the legal and research communities, particularly for the compilation of legislative histories needed to determine legislative intent in interpreting federal statutes; and WHEREAS, The U.S. Congressional Serial Set and the bound Congressional Record are available through the Federal Depository Library Program, providing ready no-fee access to the official version of these important titles in nearly every Congressional district; and WHEREAS, The U.S. Congressional Serial Set and the print bound Congressional Record, as official, authoritative records of the deliberations of Congress, are produced on acid free permanent paper to ensure their preservation for future research and scholarship; and WHEREAS, The production and dissemination of these historically-significant titles in microfiche, CD-ROM or other electronic formats do not at this time meet required standards to ensure permanent long-term access and preservation, nor are they the official, authoritative versions; now, therefore, be it RESOLVED, That the American Association of Law Libraries urge Congress to continue to fund the production of the U.S. Congressional Serial Set and the bound Congressional Record in the permanent, print versions required for long-term access and preservation; and be it further RESOLVED, That the American Association of Law Libraries urge Congress to recognize the historical significance of these print titles as the official record of their deliberations, and to guarantee their continued availability to the American public through local depository libraries; and be it further RESOLVED, That the American Association of Law Libraries transmit a copy of this resolution to Members of the House and Senate Legislative Appropriations Subcommittees, to other appropriate congressional committees, and to the Public Printer. Endorsed by the AALL Executive Board, July 19, 1996 |