The project plan
Spring, 2000

Introduction
Aims
Background
Functional description
Service
Results
Conclusion
Notes


1. Introduction

While campaigns are being conducted to conserve the ‘paper heritage’ of newspapers and books, our ‘digital heritage’ is in danger of being lost. By ‘digital heritage’ we do not mean digital datafiles – the preservation of which is now receiving more attention - but the web sites: the building blocks of the World Wide Web (WWW). The WWW has been with us since the early 1990s but, as far as we now know [Spring, 2000], nowhere in the world has a system been established for archiving web sites, whose form changes so rapidly and which often have a very short life. Much of our digital heritage has already been lost, and this process will continue into the foreseeable future. These lost sources will not be available for future academic research into the ‘virtual’ world of the World Wide Web and its relationship with the ‘real’ world.

In the context of ICT2005, a major funding project at the University of Groningen, the Documentation Centre for Dutch Political Parties (DNPP) and the Groningen University Library, supported by the Contemporary History section and the Journalism degree program of the Faculty of Arts, aim to archive the web sites of Dutch political parties and make the archive available on-line. The DNPP sees the archiving of these digital presentations as a logical extension of its original purpose: to collect, catalogue, and provide access to printed publications by and about political parties. The digital archive will be a valuable resource for journalists and for researchers from many disciplines, including history, sociology, political sciences and communications science. The project was designed as a pilot study. The experience and results of the project will contribute to the development of a general model that can be used by other institutions for archiving web sites.

 

2. Aims

The main aims of the project are to establish a digital archive of web sites of political parties in the Netherlands for scientific purposes (research and education), and to develop a model for digital archiving that can be used by other institutions within and outside the University of Groningen.

The secondary aims of the project are:

    • to develop an archiving standard;
    • to develop a technical method for the digital archiving of web sites;
    • to develop an infrastructure for storing the archived web sites, including version management at server level;
    • to develop a cataloguing structure for listing and providing access to the archived web sites;
    • to identify possible long-term problems relating to the storage and management of the archive and to develop migration strategies;
    • to identify legal problems relating to aspects such as copyright and data protection;
    • to construct a web site for the project.

 

3. Background

The World Wide Web is growing at a phenomenal pace. In March 1998 it was estimated that the Web comprised some 275 million pages, a number which was expected to increase by 20 million every month. On the basis of this estimate, the Web should have reached more than 500 million pages by the summer of 1999. However, the number of web sites is currently approximately four million, and this total increases by between 100,000 and 150,000 every month.1 This incredible expansion is continuing despite the fact that many sites (or parts of them) quickly become unavailable.2 At the same time, existing sites are constantly changing and in some cases they change even just a few seconds after they have been visited.

Although the Web is increasingly becoming part of our daily lives, surprisingly little is being done in the area of archiving. Since the summer of 1996, the Internet Archive in America has been involved in this field, archiving everything from news groups to home pages. Internet Archive uses ‘web-crawling robots’ – programs that open and download entire sites, making it possible to take ‘snapshots’ of the Internet.3 In 1997, the Royal Library of Sweden launched the Kulturarw3 Project, the aim of which is to archive as much as possible of the Swedish part of the Internet. Two snapshots were taken of this section of the Internet in 1997 and a total of almost 50,000 web sites have been archived. This digital library is not yet accessible to the general public.4

In the Netherlands, two institutions are focussing on archiving parts of the Internet. The Royal Library (KB) set up the Depot of Dutch Electronic Publications (Depot Nederlandse Elektronische Publicaties; DNEP), which is not yet fully operational. The DNEP contains not only off-line digital publications such as CD-ROMs, but also on-line publications such as e-journals, books and articles. Certain web documents can also be stored in the DNEP.5 Beside that, the International Institute of Social History has also begun archiving Internet newsgroups that are widely used by action groups and social movements.6

This overview is not exhaustive, but it is clear that there are currently no large-scale initiatives outside the Netherlands. It can therefore be said that digital archiving of web sites is still very much in its infancy. The projects referred to above are also at an early stage and, except for the Swedish project, none of them focuses specifically on archiving web sites. The disadvantage of the Swedish project is that its structure is rather unrefined and involves archiving as many web sites as possible on one or two occasions per year, which means that a great deal of information is lost. By contrast, the project of the DNPP and Groningen University Library will involve frequent archiving of a specific, limited category of web sites. A more complete digital network is not only in line with the documentation task of the DNPP, but also offers more avenues of research.

  1. Functional description
    • Archiving standard

    Archiving standards can be based on two methods: frequent integral archiving and continued archiving of modifications. The first approach involves downloading entire sites at specified times and the second approach involves copying all the modifications made to a downloaded site and writing them to a log file. It is also possible to use a method that lies between these two extremes.

      • Technical process

    After the appropriate method has been selected, the technical archiving process must be developed. This involves determining whether off-line web browsers for downloading sites are adequate for the task, or whether new systems are required.

      • Storage

    The web sites or versions of them stored in the archive will be made available via the Internet for reference and research purposes. A WWW server with sufficient drive capacity will have to be set up for this purpose. A hard disc will not be sufficient for the final archiving of the material. Instead, a CD-ROM system will be used as this is more durable, and the data will be stored using a burner. This method means that there will be a copy of every site that is available via the Internet. This is not only a valuable back-up facility, but also allows the integrity of the archived on-line site to be monitored.

      • Reference

    The archived sites must be described, catalogued and made available, and a standard is required for this purpose. The digital archive itself must be accessible via a transparent menu structure with the appropriate search facilities. The navigation system must allow for diachronous research (i.e. how a site has developed over time) as well as synchronous analysis (i.e. comparison of different sites during a given period). All archived site pages will have to be marked as such, so that they can be distinguished from the current site and in order to prevent confusion.

      • Migration strategies

    The short lifespan of software and hardware will cause problems in the future with regard to the storage, management and accessibility of archived sites, which must remain accessible even when hardware and storage formats for text, audio and animations have become obsolete. This means that the archives will have to be periodically converted to the next generation of software and hardware systems. During this process, the integrity of the digital documents must be protected as far as possible, and effective strategies will therefore have to be developed for media and/or formats.

      • Copyright and data protection

    By definition, archiving digital files means copying them, which automatically leads to copyright problems. When asked, a number of political parties indicated that they would be happy to co-operate in the project. Although their permission is essential, it is also important to check whether there are other copyright owners who should be consulted. A web site is, after all, a collection of text, audio and visual materials whose copyright may rest with several owners.

    5. Service

    The archiving project will be completed over a period of 24 months and will then be incorporated in the core activities of the DNPP.

    The schedule is as follows:

      • January 2000: project launch
      • January – July 2000: develop archiving standard, technical archiving mechanism, storage and access facilities
      • March – April 2000: set up WWW server and web site
      • May 2000: construct test catalogue
      • throughout project period: select/collect web sites
      • throughout project period: make available/catalogue web sites

     

    6. Results

    At the end of the project, an operational system for archiving and cataloguing web sites must be in place, together with a project web site providing access to the digital library. In addition, a policy document relating to copyright issues must be compiled. Reports on the project will be submitted to several professional journals.

    7. Conclusion

    In May 1996, the report Preserving Digital Information, compiled by the Commission on Preservation and Access and the Research Libraries Group, was published in the United States. The report advocated the creation of a decentralised network of digital archives with the purpose of collecting digital objects (including web sites), preserving them and making them accessible. The report emphasized the need for a decentralised structure: ‘A distributed structure […] places archival responsibility with those who presumably care most about and have the greatest understanding of the value of particular digital information objects’. The Archipol project is completely in line with that vision. Given its aims and expertise, it is logical that the DNPP should be responsible for archiving the web sites of political parties. The DNPP’s contribution to archiving the web as a whole will be a modest one, but it will play a valuable part in preserving the ‘virtual’ political culture of the Netherlands.

     

    8. Notes

    • 1. This estimate is given in: M. Hofstede, ‘Special zoekmachines op Internet’ (Special search engines on the Internet), in: Informatie Professional, 2 (1998), no. 12, 32-35.
    • 2. According to the American Internet archivist, B. Kahle, the average on-line lifespan of a web page is 70 days. See M. de Waal, ‘Archiveren Internet bijna onmogelijk’ (Archiving the Internet is virtually impossible), in: De Volkskrant, 30 January 1999.
    • 3. Scientific American, 1997, no.3; M. Cunningham, ‘Brewster’s Millions’, in: The Irish Times (on-line), 27 January 1997.
    • 4. See A. Arvidson and F. Lettenström, ‘The Kulturarw3 Project – the Swedish Royal Web Archive, in: The Electronic Library, 16 (1998), 2 (April), 105-108.
    • 5. See T. Noordermeer, ‘Depot van Nederlandse Electronische Publicaties’ (Depot of Dutch Electronic Publications), in: Informatie Professional, 1998, no.2, 22-24.
    • 6. J. Quast, ‘OCCASIO Digital Social History Archive’, in: Historia & Informatica, 1998, no.2, 3.