The project plan
Spring, 2000
Introduction
Aims
Background
Functional
description
Service
Results
Conclusion
Notes
1. Introduction
While campaigns are being conducted to conserve the ‘paper heritage’
of newspapers and books, our ‘digital heritage’ is in danger of being
lost. By ‘digital heritage’ we do not mean digital datafiles – the
preservation of which is now receiving more attention - but the web sites:
the building blocks of the World Wide Web (WWW). The WWW has been with us
since the early 1990s but, as far as we now know [Spring, 2000], nowhere
in the world has a system been established for archiving web sites, whose
form changes so rapidly and which often have a very short life. Much of
our digital heritage has already been lost, and this process will continue
into the foreseeable future. These lost sources will not be available for
future academic research into the ‘virtual’ world of the World Wide
Web and its relationship with the ‘real’ world.
In the context of ICT2005, a major funding project at the University of
Groningen, the Documentation Centre for Dutch Political Parties (DNPP) and
the Groningen University Library, supported by the Contemporary History
section and the Journalism degree program of the Faculty of Arts, aim to
archive the web sites of Dutch political parties and make the archive
available on-line. The DNPP sees the archiving of these digital
presentations as a logical extension of its original purpose: to collect,
catalogue, and provide access to printed publications by and about
political parties. The digital archive will be a valuable resource for
journalists and for researchers from many disciplines, including history,
sociology, political sciences and communications science. The project was
designed as a pilot study. The experience and results of the project will
contribute to the development of a general model that can be used by other
institutions for archiving web sites.
2. Aims
The main aims of the project are to establish a digital archive of web
sites of political parties in the Netherlands for scientific purposes
(research and education), and to develop a model for digital archiving
that can be used by other institutions within and outside the University
of Groningen.
The secondary aims of the project are:
- to develop an archiving standard;
- to develop a technical method for the digital archiving of web
sites;
- to develop an infrastructure for storing the archived web sites,
including version management at server level;
- to develop a cataloguing structure for listing and providing
access to the archived web sites;
- to identify possible long-term problems relating to the storage
and management of the archive and to develop migration strategies;
- to identify legal problems relating to aspects such as copyright
and data protection;
- to construct a web site for the project.
3. Background
The World Wide Web is growing at a phenomenal pace. In March 1998 it
was estimated that the Web comprised some 275 million pages, a number
which was expected to increase by 20 million every month. On the basis of
this estimate, the Web should have reached more than 500 million pages by
the summer of 1999. However, the number of web sites is currently
approximately four million, and this total increases by between 100,000
and 150,000 every month.1 This incredible
expansion is continuing despite the fact that many sites (or parts of them)
quickly become unavailable.2 At the same
time, existing sites are constantly changing and in some cases they change
even just a few seconds after they have been visited.
Although the Web is increasingly becoming part of our daily lives,
surprisingly little is being done in the area of archiving. Since the
summer of 1996, the Internet Archive in America has been involved in this
field, archiving everything from news groups to home pages. Internet
Archive uses ‘web-crawling robots’ – programs that open and download
entire sites, making it possible to take ‘snapshots’ of the Internet.3
In 1997, the Royal Library of Sweden launched the Kulturarw3
Project, the aim of which is to archive as much as possible of the Swedish
part of the Internet. Two snapshots were taken of this section of the
Internet in 1997 and a total of almost 50,000 web sites have been archived.
This digital library is not yet accessible to the general public.4
In the Netherlands, two institutions are focussing on archiving parts
of the Internet. The Royal Library (KB) set up the Depot of Dutch
Electronic Publications (Depot Nederlandse Elektronische Publicaties;
DNEP), which is not yet fully operational. The DNEP contains not only
off-line digital publications such as CD-ROMs, but also on-line
publications such as e-journals, books and articles. Certain web documents
can also be stored in the DNEP.5 Beside
that, the International Institute of Social History has also begun
archiving Internet newsgroups that are widely used by action groups and
social movements.6
This overview is not exhaustive, but it is clear that there are
currently no large-scale initiatives outside the Netherlands. It can
therefore be said that digital archiving of web sites is still very much
in its infancy. The projects referred to above are also at an early stage
and, except for the Swedish project, none of them focuses specifically on
archiving web sites. The disadvantage of the Swedish project is that its
structure is rather unrefined and involves archiving as many web sites as
possible on one or two occasions per year, which means that a great deal
of information is lost. By contrast, the project of the DNPP and Groningen
University Library will involve frequent archiving of a specific, limited
category of web sites. A more complete digital network is not only in line
with the documentation task of the DNPP, but also offers more avenues of
research.
- Functional description
Archiving standards can be based on two methods: frequent
integral archiving and continued archiving of modifications. The
first approach involves downloading entire sites at specified
times and the second approach involves copying all the
modifications made to a downloaded site and writing them to a
log file. It is also possible to use a method that lies between
these two extremes.
After the appropriate method has been selected, the technical
archiving process must be developed. This involves determining
whether off-line web browsers for downloading sites are adequate
for the task, or whether new systems are required.
The web sites or versions of them stored in the archive will
be made available via the Internet for reference and research
purposes. A WWW server with sufficient drive capacity will have
to be set up for this purpose. A hard disc will not be
sufficient for the final archiving of the material. Instead, a
CD-ROM system will be used as this is more durable, and the data
will be stored using a burner. This method means that there will
be a copy of every site that is available via the Internet. This
is not only a valuable back-up facility, but also allows the
integrity of the archived on-line site to be monitored.
The archived sites must be described, catalogued and made
available, and a standard is required for this purpose. The
digital archive itself must be accessible via a transparent menu
structure with the appropriate search facilities. The navigation
system must allow for diachronous research (i.e. how a site has
developed over time) as well as synchronous analysis (i.e.
comparison of different sites during a given period). All
archived site pages will have to be marked as such, so that they
can be distinguished from the current site and in order to
prevent confusion.
The short lifespan of software and hardware will cause
problems in the future with regard to the storage, management
and accessibility of archived sites, which must remain
accessible even when hardware and storage formats for text,
audio and animations have become obsolete. This means that the
archives will have to be periodically converted to the next
generation of software and hardware systems. During this process,
the integrity of the digital documents must be protected as far
as possible, and effective strategies will therefore have to be
developed for media and/or formats.
- Copyright and data protection
By definition, archiving digital files means copying them,
which automatically leads to copyright problems. When asked, a
number of political parties indicated that they would be happy
to co-operate in the project. Although their permission is
essential, it is also important to check whether there are other
copyright owners who should be consulted. A web site is, after
all, a collection of text, audio and visual materials whose
copyright may rest with several owners.
5. Service
The archiving project will be completed over a period of 24 months
and will then be incorporated in the core activities of the DNPP.
The schedule is as follows:
- January 2000: project launch
- January – July 2000: develop archiving standard, technical
archiving mechanism, storage and access facilities
- March – April 2000: set up WWW server and web site
- May 2000: construct test catalogue
- throughout project period: select/collect web sites
- throughout project period: make available/catalogue web sites
6. Results
At the end of the project, an operational system for archiving and
cataloguing web sites must be in place, together with a project web site
providing access to the digital library. In addition, a policy document
relating to copyright issues must be compiled. Reports on the project
will be submitted to several professional journals.
7. Conclusion
In May 1996, the report Preserving Digital Information,
compiled by the Commission on Preservation and Access and the Research
Libraries Group, was published in the United States. The report
advocated the creation of a decentralised network of digital archives
with the purpose of collecting digital objects (including web sites),
preserving them and making them accessible. The report emphasized the
need for a decentralised structure: ‘A distributed structure […]
places archival responsibility with those who presumably care most about
and have the greatest understanding of the value of particular digital
information objects’. The Archipol project is completely in line with
that vision. Given its aims and expertise, it is logical that the DNPP
should be responsible for archiving the web sites of political parties.
The DNPP’s contribution to archiving the web as a whole will be a
modest one, but it will play a valuable part in preserving the ‘virtual’
political culture of the Netherlands.
8. Notes
- 1. This estimate is given in: M. Hofstede, ‘Special
zoekmachines op Internet’ (Special search engines on the
Internet), in: Informatie Professional, 2 (1998), no. 12,
32-35.
- 2. According to the American Internet
archivist, B. Kahle, the average on-line lifespan of a web page is
70 days. See M. de Waal, ‘Archiveren Internet bijna onmogelijk’
(Archiving the Internet is virtually impossible), in: De
Volkskrant, 30 January 1999.
- 3. Scientific American, 1997, no.3; M. Cunningham, ‘Brewster’s
Millions’, in: The Irish Times (on-line), 27 January 1997.
- 4. See A. Arvidson and F. Lettenström,
‘The Kulturarw3 Project – the Swedish Royal Web
Archive, in: The Electronic Library, 16 (1998), 2 (April),
105-108.
- 5. See T. Noordermeer, ‘Depot van
Nederlandse Electronische Publicaties’ (Depot of Dutch Electronic
Publications), in: Informatie Professional, 1998, no.2,
22-24.
- 6. J. Quast, ‘OCCASIO Digital Social
History Archive’, in: Historia & Informatica, 1998, no.2,
3.
|
|
 |