E-mail: Svein.Arne.Solbakk@nbr.no
CONTENTS
0. AbstractThe Norwegian Legal Deposit Act of June 9th 1989 includes legal deposit of electronic material. In effect, this means that the Norwegian National Library is responsible for aquisition, registration and long time archiving and preservation of various documents in electronic form. In addition, this material must be available for research and documen- tation purposes in the same way as more traditional legal deposit material.
The National Library branch in Mo i Rana (NBR) is responsible for long term storage of one copy of all the documents deposited due to the Legal Deposit Act. To optimize storage conditions for all types of material there has been built a huge storage magazine inside a mountain in Mo i Rana. The first building in this magazine offers 42.000 meters of shel- ving, and a well controlled climate at +8 C, 35% humidity as well as powerful ventilation. This is very suitable for most kinds of paper based documents, and the question now is whether our new electronic documents with their respective physical storage media types will enjoy their time in our mountain hall of Norwegian cultural heritage.
In general, our ambitions are to preserve all kinds of information for at least a 1.000 years. Today this requirement is hard to meet for most kinds of digital storage media, optical as well as magnetic. And even if the digital storage media had been able to hold the data for a 1.000 years, one would have to face the problem of finding a computer able to understand the electrical interface of the storage device, the physical format of the storage media, and the semantic format of the data itself.
The electronic material can be grouped into these categories:
Electronic text
This constitutes electronic journals [Rustad94a], newsgroups and other open conference
systems, electronic book manuscripts, documents available through WWW or gopher, and
other electronic texts which are openly distributed on CD-ROM or via data networks.
Databases
On-line or off-line databases which are publically available. The databases may be
reachable via various kinds of network connections, or sold on CD-ROM. The information
in the databases may be text, statistics, reference data, images, and in practice anything
which is retrievable from an on- or off-line database system.
Data programs
The Legal Deposit Act states that a program must be especially adapted for Norwegian
users to be covered by the Act. There are no limitations on system platform or program
category.
Audiovisual material
I regard this to be outside the scope of this article, but would like to mention it for informa-
tion. Our Legal Deposit Act also include publically available audiovisual material; i.e.
television and radio broadcasts, films, musical records and more. A lot of this material is
deposited in some kind of electronic form, and therefore some of the problems connected
with the categories mentioned above also apply to this material. Also we are discussing
the possibility of converting parts of, or all the audiovisual material to digital form to
improve accessability and preservation, and if this is done the problems to be mentioned in
this article are highly relevant for this material.
There exists a lot of different physical storage medias for digital information, and numerous principles for writing the information onto these physical mediums. In this article I focus on the ones that are regarded as most interesting for long term preservation of digital information, and the list is therefore by no means complete.
When planning to store information for a very long time in digital form, the physical storage device has to meet some new requirements:
1. The hardware necessary to read the physical medium must be available for some years. I.e. it should be produced in large numbers preferably by several inde- pendent companies, and it should be a standard or a de facto standard.
2. The hardware necessary to read the physical medium should have a standardized interface towards computer equipment, and the software drivers necessary to read information from the hardware device should be available for a variety of computers and operating system platforms.
3. Data written to the storage device by a computer and its operating system environment, should be readable by a variety of computers and operating system platforms.
CD-ROM
Current CD-ROM technology offers approximately 660 MB of storage space per unit. As
the CD-ROM technology may be regarded as a very common consumer technology, one
may expect that the hardware necessary to read a CD-ROM will be easily available for
many years to come. The high production volumes of such hardware has also pushed the
price very low. [Arps93] states that there exist in the range of 500 million CD audio
players, which are based on the same technology as the CD-ROM players for computers.
There exists a standardized ISO-format (ISO 9660) for writing data onto a CD-ROM, which makes it possible to read the data on a variety of hardware and operating system platforms. Whether a computer is able to understand the data it is able to read, however, depends on the format of the data itself. The format of the data itself is not "visible" to the CD-ROM reader. This aspect is discussed in chapter four of this article.
There exists at least three physical categories of CD-ROMs.
1. The traditional one commonly used for mass production of musical CDs and commercial CD-ROMs. One of the main problems with this category is that if a small scratch in the surface lets air get through to the aluminium layer, the aluminium will oxydate, with the unpleasant consequence that the CD-ROM falls apart. However, accelerated tests have indicated that these CD-ROMs have a 100 year lifetime at room temperature [Arps93]. With our excellent storage facilities the average CD-ROM may be expected to last for more than 100 years.
2. The gold-plated writable CD-ROM commonly used for small scale production of CD-ROMs. With this category the oxydation problem is eliminated as gold does not oxydate, and one may expect an average lifetime of more than a hundred years under optimized conditions. As there currently exists a lot of cheap hardware which enables you to write your own data onto these writable CD-ROMs, they may easily be used for preservation of electronic information.
3. Experiments have also been done with a special type of CD-ROM based on platinum. This would be VERY suitable for long time storage. A powerful laser is used to burn the digital information into the platinum. The hardware necessary to write to such a CD-ROM will not be a consumer product, as one needs a particularly powerful laser. In the trials done so far, there is used a non- standard format on the CD-ROMs not readable by common CD-ROM players. Another major disadvantage of this otherwise very interesting technology, is an extremely high price for each CD-ROM unit.
There is currently research going on to make it possible to store information in several layers on the same physical CD-ROM as one uses today. With this approach one may store 660 MB of information on each layer. In the research lab IBM has found that CD-ROMs with up to ten layers are theoretically viable, but they predict that a two or a four layer CD will be the first to hit the market. Only small adjustments need to be made to current CD- ROM players to be able to read multilevel CD-ROMs.
Other optical disks (WORM)
There exists optical WORM disk systems capable of storing 6 GB of information on a
physical medium quite suitable for long time storage. This is unfortunately based on non-
standard hardware and non-standard physical units, and the tehnology is not very wide-
spread. It is therefore not very tempting to use this technology for large scale long time
storage of electronic information.
Optical tape
Optical tape recorders were introduced to the commercial market five years ago. The
technology gives you 1 TB of data (i.e. approximately 1.000 GB) on one single tape unit.
The optical tape unit is designed for storage at room temperature, and it keeps its
information for up to 400 years when stored at approximately +18 C. As with any other
tape the optical tape has to be rewound periodically, and the current suggestion is that it
should be rewound once every 30 year.
The disadvantages with the technology is that it is very non-standard, that it is produced in small volumes, and that the price for the tape recorder and player currently is high. The cost per MB for the physical tape unit is low.
DAT
The DAT cassette has become a widely used short term storage medium for digital infor-
mation. Currently, up to 16 GB of data may be stored on a DAT cassette, using hardware
compression (1:2), and the price per MB is thus very low. There are jukeboxes available
handling several DAT-cassettes. The problem with DAT is that it, as with any other
magnetic tape, will have to be rewound periodically and recopied after some years to
prevent drop-outs.
DLT
Digital Linear Tape (DLT) was introduced by Digital Equipment Corporation Ltd. quite re-
cently. One tape unit takes 20 GB, and several tape units may be handled by a jukebox.
The tape unit is more robust and reliable than a DAT cassette, but it still needs to be
rewound periodically and recopied after a few years.
3. HARDWARE AND SYSTEM PLATFORM
Even for an unformatted electronic text the variety of commonly used character sets makes transfer of the text between different hardware or operating system platforms cumbersome. There are always problems connected to the transfer of special characters, as the Norwegian æ, ø and å, between different platforms. One should therefore standardize on a specific character set for unformatted electronic texts, e.g. ISO Latin 1. Alternatively, one may standardize on one or a few formats which easily handle special characters, and also preserve formatting information. This is discussed in the next chapter.
3.2 Databases
A database is typically closely tied to the computer hardware it resides on and to the
database management system (DBMS) used to organize the data. There exists a huge
variety of database management systems for a variety of hardware and system platforms.
This implies that one has to preserve a lot of technology in addition to the actual data, to be able to guarantee that a set of databases deposited due to the Legal Deposit Act will be available for research in the long term.
Alternatively, one has to standardize on one or a few database management systems, and write a lot of data import filters to get all the deposited databases into the format one of the DBMSes chosen as standards. This approach is discussed in the next chapter.
All data programs have specific hardware and operating system requirements to be able to run. To be able to do research on a specific data program in the future, one therefore has to preserve also the technology required to run the program. In a perspective of 1.000 years, this is a challenge of unknown dimensions!
SGML is a standardized markup language. The current disadvantage with SGML is that there exists only a few very expensive word processors capable of handling the format. The consequence of this is that SGML is not widely used, and that it is not easy to convert existing texts into the SGML format.
Postscript is a format which is very commonly used when printing electronic documents. Most computers and word processors are capable of producing a postscript version of an electronic text, and most laserprinters available today are capable of printing a postscript document. Another advantage with postscript is that it is based on ASCII characters. The text (including special characters), formatting information, graphics and images are con- verted into an ASCII based description format. This makes postscript suitable for transfer e.g. via electronic mail. The main disadvantage with postscript is that even a small document requires a lot of storage space when converted to the postscript format. In addition, it is not straightforward to convert a postscript document back into another format.
Adobe, the company which was the creator of the postscript format, has recently introduced the Acrobat format. With the Acrobat tools one may convert any postscript file into the Acrobat format. Acrobat compresses its data, and consequently an Acrobat file requires only a fragment of the storage space its corresponding Postscript file requires. An Acrobat file may be viewed on a computer screen or printed using an Acrobat reader. In the current version of Acrobat the readers are distributed free of charge, and they are available for PC, Macintosh and some UNIX variants. When using Acrobat for distribution of electronic documents, printing, viewing and copying of parts of a document may be password protected. This is an elegant mechanism to have whenever copyright issues has to be taken into account.
The main advantages with the Acrobat format is that any electronic text file may easily be converted to the Acrobat format via the Postscript format, that the Acrobat format minimizes the storage space requirements, and that Acrobat readers may be distributed free of charge. The main disadvantage is that the format currently is supported only by Adobe, and that it has not yet become a standard or de facto standard.
Neither SGML, Postscript or Acrobat are capable of handling the new kinds of documents available through Gopher and World Wide Web (WWW) servers. The documents on these servers are publically available, and thus in principle covered by our Legal Deposit Act. The question is: How on earth do you preserve a document where important parts of the intellectual contents are given by links to other documents which may reside on a variety of computers spread all over the world, which again have links to other documents which may reside on another variety of computers spread all over the world, which again... In a perspective of 1.000 years one may be absolutely certain that some of these links will be obsolete when the virtual dust is removed from the original document by our (grand)40daughter who whishes to do some research on it. In fact, we may be quite sure that most of these links, if not all, will be obsolete in a 10 year perspective.
In spite of this, it may be of interest to preserve information from a gopher or WWW- document, and the best way to do this may be to use the original format of the documents. There exist a lot of readers for these formats for a variety of computer hardware and system software platforms, and these readers are mostly free of charge.
Most of the database systems support some kind of export of its data either in a self defined report format, or in a predefined exchange format. Unfortunately there exists no standard database data exchange format, and consequently one has to customize a data import filter for every combination of DBMSes one wants to exchange data between, even if every system has implemented exactly the same data model.
As mentioned in the previous chapter, one may standardize on one or a few database management systems, and write a lot of data import filters to get all the deposited databases into the format one of the legal DBMSes. E.g. in [Eaton93] they plan to use SQL (Standardized Query Language) to describe databases from various DBMSes, and convert all data into a relational structure. This is, however, a tremendous task to carry out. In addition, the chosen DBMS may suddenly cease to exist, or be released in a completely new version demanding a new export/import job to be carried out.
One never knows when, or even if, a deposited database will be the subject for research or documentation. A more reasonable approach may therefore be to export the data from a database to an exchange format which is suitable for import to another DBMS, and preserve this exhange format together with detailed information on the data model of the database and how this data model is mapped into the exchange format. If someone in the future wants to use the database for research or documentation, the cost for import of the data to a current DBMS may be taken at that point in time.
The size of this problem is somewhat connected to how one organizes the deposit of databases. Off-line databases are always deposited when distributed in more than 50 copies, and these may be treated as data programs. The on-line databases are more cumbersome to handle. Some of the options for on-line databases currently under discussion are:
1. An on-line database is only deposited when its host cease to offer it to the public. As long as the database is available on-line, the National Library is given on-line access to the database, and it is then considered as available for research and documentation.
2. A database is usually evolving continuously. If a book is released in a new edition, it will always be deposited according to the Legal Deposit Act. The consequence of this in the database world would be that every change to a database in some way should be deposited. For most databases this does not make sense! However, to preserve some of the history of a database one may want to get a snapshot of the evolving database periodically, e.g. every month or once a year. This should be combined with the first option.
The first alternative will of course give far less data to handle by the National Library than the second alternative, and consequently it will be easier to put more effort into each deposited database using the first alternative.
A data program may be available for various computers and operating system platforms. If the program behaves and looks exactly the same way on the various platforms, it is not necessary to preserve the program for all the platforms. If possible, one should try to focus on a few commonly used platforms. This will minimize, but unfortunately not eliminate, the problem of keeping available the technology necessary to run the programs in the future.
CURRENT ACTIVITIES AND PLANS
To gain experience with legal deposit of electronic material the Norwegian National
Library has started a few projects which aims towards aquisition of, giving access to, and
archiving and preservation of some categories of electronic material.
NetNews
All messages to the about 60 different Norwegian newsgroups in the world-wide netnews
system, are automatically being aquired and imported into a fulltext database at the Rana
branch of the Norwegian National Library. Through a special form supported by the
NCSA Mosaic software anyone on the WWW-network may perform free textual searches
in our "NoNews" database. The search result is presented as an HTML-document with the
newsmessages fulfilling the search criteria.
As NetNews messages are posted using any internet compatible electronic mail system, all special characters, as the Norwegian æ, ø and å, are presented in a variety of incorrect ways. In our project we do not try to correct special characters, but leave them exactly the way they were received in the NetNews system.
Our project started July 1st 1994, and consequently there are no breathtaking experiences to report on the preservation aspect yet. We plan to preserve each year of NoNews messages on gold-plated writable CD-ROMs. On the CD-ROMs the original unformatted unindexed text from the news messages will be stored.
Electronic Journals
We have recently started a project where all electronic journals given an official ISSN-
number are asked to deposit their material to the National Library. Currently, there are
only four official electronic journals in Norway. This material will be stored in a full text
database like the news messages, but in addition an index will be developed to simplify
retrieval and to make it easier to keep track of whether all editions of each journal are
deposited.
The availability of this material depends on the policy of each publisher. All the university libraries will be given access to the database, and whenever the publisher agrees to this, the database will be open for searches from everyone able to access our WWW-server.
As for the news messages, we plan to store all electronic journals from one year on a gold- plated CD-ROM for preservation. We will also try to include the index for the journals from the year in question in an easy readable format.
As far as we know, the electronic journals in question are currently being distributed as unformatted text. Whenever this is the case, we will preserve the material as unformatted text with a given character set. If an electronic journals is published as formatted text, the most likely solution will currently be to convert the material to Acrobat format, and preserve it in Acrobat format together with Acrobat readers.
If someone makes an electronic journal on the WWW, it should be stored, made acces- sible, and preserved in HTML-format. Consequently, when the first Norwegian WWW- based journal is published, we will make a WWW-branch of our electronic journals database, using the same index as for the other material. Such a journal will also be preserved in HTML-format together with the NCSA Mosaic WWW-browser.
Compendiums
A Norwegian publisher, Pensumtjeneste AS, has specialized in making small series of
compendiums for university level students. When a compendium is produced in a small
number, the cost per unit of the paperbased compendium is quite high. As a trial, the
Norwegian National Library has therefore agreed to accept deposit of the electronic copy of
compendiums produced in small numbers. The electronic compendiums will be delivered
using ISDN (Integrated Services Digital Network) communication, and they will be depo-
sited in the Acrobat format.
We plan to establish an index for the electronic compendiums, with references to the Acrobat files. The university libraries will then, instead of getting a paper copy, be given access to the database, with an option of retrieving the Acrobat files for terminal reading. As the complete compendium will be priced lower than the cost of printing the document from the database, a complete printout of a compendium from the database will be of limited interest.
The trial period for this project is one year from July 1st 1994.
No databases have yet been deposited. Our current concern is to get a comprehensive list over the norwegian databases subject to the Legal Deposit Act. When this list is complete, we will try to start trial projects also in this field.
Since a lot of databases contains more data than a CD-ROM can hold, one must consider other physical storage mediums for databases.
In 1994 we have selected 20 data programs to be deposited. These were chosen partly from the top ten list of the most sold norwegian software, partly from cultural evaluation criteria, and partly from the limitation that the programs should be made for DOS, Windows or Macintosh. In addition to this selection, a lot of data programs are deposited as parts of combined documents. In July 1994, a total of 240 data programs have been deposited to the National Library [Rustad94b]. All the programs are registered in BIBSYS, the library system used by the National library for books and paper-based periodicals. A detailed description of the hardware and operating systems requirements are given for every data program, and a report giving a complete list of these systems requirements will be used as a basis for preserving technology and software able to emulate old systems software.
Deposited data programs may not be freely distributed, by obvious reasons. We have therefore offered the producers a three year clause, in which the software will be placed in a locker in our mountain magazine. After this three year period, we want to be able to give access to the data programs for research and documentation. Even though most software will have lost its commercial interest after three years, it will still be a serious violation against the copyright law to make the data programs publically available after three years.
To preserve the software, we plan to make one or more CD-ROMs each year with all the data programs deposited that year. A report containing an index as well as description of computer hardware and operating system requirements will also be included on the CD- ROMs as unformatted text with a given character set.
A small but not irrelevant problem is that some data programs have implemented software locks to prevent the user from copying the software more than a given number of times. As this usually is an integrated part of the software, it may not be easy for the producer of the software to remove this lock just for one copy meant for preservation.
5.4 Long term preservation plans
All the CD-ROMs produced for preservation purposes will be stored inside our mountain
magazine. This means that the data will keep its original form and format for at least 100
years.
Since there were few computers around in the year of 1884, its hard to predict what functionality and physical interfaces computers will have in the year of 2094. However, an educated guess is that the current DOS 6.2, Windows 3.11, Finder 7.1 or UNIX System V release 4 will not be directly compatible with the future DOS 105.2, Windows 36.2, Finder 27.1.1 or UNIX System X release y, especially taking into account the true 2048 bits processor that is likely to be introduced in 2075. In the same way it is not very likely that the current SCSI/SCSI-2 interface standards will be compatible with the future SCSI-35 interface.
The likely effect of these educated guesses, is that the preserved CD-ROMs at one point in time, e.g. 5 - 10 years from now, will have to be read into a shabby old computer and then converted into current formats and written to a current storage device. The important thing therefore is to standardize as much as possible on formats and preservation method, to make this inevitable conversion to new formats and storage devices as easy as possible.
As data programs are not easily converted to new computer hardware and operating system platforms, one must closely monitor the ability to run preserved computer programs. To be completely sure, every deposited computer program should be installed and tried upon deposit. In addition, every time one has to rely on new software that is supposed to emulate an old operating systems platform, the deposited data programs dependent on the old operating systems platform should be reinstalled to ensure compatibility. Whenever there are no software available to emulate an operating system being abandoned, one has to preserve technology able to run the old operating system. In this case the technology should be restarted periodically to ensure its operability.
Obviously, the activities mentioned in the last paragraph will take a lot of resources if they are taken seriously. However, the sad fact is that if we really want to preserve data programs for the future, we have to take these things seriously from the first day after the first data program was deposited.
The Norwegian National Library still does not have any real experience of long time preservation of electronic material. However, since our Legal Deposit Act includes electronic material, we have discussed for a couple of years how to handle this. To gain experience with both aquisition, giving access to, and archiving and preservation of some categories of electronic material, a number of projects within this field has been initiated.
Our conclusions at this point in time may be summarized as:
Gold-plated writable CD-ROMs are used as storage medium for preservation of electronic texts and data programs. Whenever available, indexing information is stored as unformatted text together with the electronic material. Data is stored using the standardized ISO 9660 format.
Originally unformatted texts are preserved as unformatted with a given character set. Nor- mally the ISO Latin 1 character set will be used.
Formatted texts are converted to the Adobe Acrobat format, and preserved in this format together with Acrobat readers for various operating systems platforms.
One must continuously monitor whether the computer hardware and operating system plat- forms available are able to interface to the necessary storage devices to read the preserved information. In the same way one must monitor whether the preserved data formats are readable by existing computer equipment, and one must ensure that there exists technology able to run the preserved data programs. Whenever a physical medium, a storage device type, an electrical interface or a data format is being abandoned by the computer and/or software producers, preserved information dependent on the abandoned element should be converted to a new standard and represerved.
As far as data programs is concerned, such a conversion is not straightforward. For data programs one must therefore ensure that their system requirements can be met either by preserving hardware and necessary system software, or by using software emulating old system platforms.
All the electronic material will be given bibliographic records in the same way as more conventional material, using the same library system as for paperbased books and perio- dicals.
There is obviously still a lot of work to be done within this area, and only the future can tell whether our preliminary conclusions were wise ones.
[Rustad94a] Kjersti Rustad, Vidar Ringstrøm, Electronic Journals, An introduction given at the 4th Nordic ISSN / Union Catalogue meeting, Helsinki, September 1994
[Rustad94b] Kjersti Rustad, Legal Deposit of Computer Documents - Status July 1994,
Internal NBR note, Mo i Rana, Aug. 5th, 1994
In Norwegian.
Norwegian title: Avlevering av EDB-dokumenter - Status per juli 1994