Skip to main content

Dubious liaisons: archiving print and digital media

This was written for Parenthesis. By this time (since 2022) I was also its editor and had this article on my desk for several issues waiting for a space to show it. When a cyber attack disabled the British Library in December 2023 I thought now is the moment.

 

If you have ever had the good fortune to wander behind the scenes of almost any archive, you will have seen cardboard boxes discreetly stacked in a dark corner. On a recent visit to the magnificent British Film Institute’s National Archive I asked a conservationist, tactlessly I quickly realised, how long a certain pile of boxes had been there. This was clearly taken as a criticism because it was pointed out to me (unnecessarily I would like to add) that it was not simply a matter of tipping out the contents – someone would have to be allocated to untangle, clean, identify and then catalogue every sheet of paper, ‘and no’, (anticipating my next question) ‘I don’t know when someone will be made available’.

The preservation of books and other printed material, especially at current exponential rates of production, is an ongoing, indeed never-ending, task. For example, as a legal depository, the British Library, which, no doubt, has a constant stack of cardboard boxes or their equivalents waiting to be processed, also has the mammoth task of receiving and storing copies of all newly published books, magazines and newspapers produced in the United Kingdom and Ireland, as well as a significant proportion of overseas titles that are distributed in the UK. These add up to about three million items each year requiring an estimated eight miles (12.8 km) of additional shelving – totalling 488 miles by the end of 2024. 

Gutenberg printed between 160 and 185 copies of his two-volume Bible and, some 570 years later, 48 copies, or substantial portions at least, remain intact. However, the oldest dated printed book, the Diamond Sūtra, discovered in 1907 and held by the British Library, had been sealed in a cave in northwest China where it had lain for 1,150 years. On discovery it was opened and could be read immediately. 

Perversely, when digital technology was in its infancy, it was the permanence of paper that was cited as the main reason for library collections to invest in ‘digital storage and retrieval systems’. Printed paper, it was argued, was cumbersome, arriving in all shapes and sizes, bulky, taking up vast amounts of valuable space, and difficult to access without an inpecible cataloguing system. On top of which, it required careful monitoring of environmental conditions. In contrast, digital files were clean, efficient, instantly accessible, easily main-tained and took up almost no space.     

Many libraries with major collections of printed matter, be it books, journals, posters or ephemera, set about digitising their collections in a well-intentioned attempt to make their material accessible via the Internet. No-one seriously considered that digital copies should replace the ink on paper originals, although the more fragile nature of newsprint meant that it was, briefly, considered to be an exception. However, as subsequent commercial imperatives made themselves apparent, many began to question the longer-term cost and viability of committing to a programme of digitising whole collections and, instead, began to concentrate their efforts on a select and manageable number of popular or exceptional items to be made available on their websites.  

In the relatively brief time that digital technology has been in existence, its combined functional and physical status has been one requiring compulsory ‘upgrading’. Early on, this might have been motivated by genuine technical advancements, but the unequivocal necessity to regularly ‘upgrade’ has undoubtedly become an integral part of what is a commercially-driven obsolescence strategy. Such tactics have long been an essential aspect in the design of so many manufactured products, but digital companies have taken this to new levels of proficiency by ensuring digital documents are rendered inaccessible when ‘external dependencies’ no longer support the software, and vice versa. 

Amid this frustrating and costly strategy for users, the announcement in 2004 that Google intended to create a universal, cross-referenced, digital archive that contained every book in existence suddenly projected a renewed gloss and much-needed sense of wonder onto digital technology. The idea that every book ever printed, or certainly all of those still in existence, could be made universally accessible – to everyone who owns a computer or smartphone at least – was extraordinary, and would have been immediately dismissed as naive fantasy had it been proposed by any other company. But the ubiquity of Google, its global reach and almost unlimited funds, meant that the launch of Google Books was met with newly acquired optimism for digital technology as a power for good. 

Google began the project after reaching an initial agreement with 29 major research libraries, including Harvard University, Oxford University, the British Library and the New York Public Library, to begin digitising their book collections on machines provided by Google. Approximately 40 million books had been scanned by 2019 (Market Research Telecast estimates there were about 170 million books worldwide in  2021). However, as John P Wilkin, associate university librarian at the University of Michigan, explained, ‘our program is strong, and we have been able to digitise approximately 5,000 volumes a year; nevertheless, at this rate, it will take us more than a thousand years to digitise our entire collection.’ 

There was also the issue of what was suitable for scanning. Libraries would only allow books that, in the judgement of their conservationists, would not be adversely affected by the process. All of the libraries in the project have many rare and fragile books and it is only natural that they would choose books that are in a sufficiently robust condition to be scanned first. Many, the Newberry Library in Chicago among them, did not take part because, they explained, their conservation standards would not allow the kind of handling that the Google Books project would require. 

Some two decades later, it is clear that many thousands of the world’s rarer books will never be part of Google Books. This fact, together with major issues concerning copyright (resulting in the majority of more recently published books offering only ‘snippet views’) as well as complaints of occasional ‘missed’ pages or a blurred hand obscuring the text, seems to have dented Google’s ardour. One participating library archivist (who, interestingly, requested anonymity) explained that Google seems no longer to be placing the same priority on Google Books, ‘they have the funds, but their commitment appears to be ebbing’. 

If the idea of digital technology archiving every printed book seems overly ambitious, the idea of deposit libraries attempting to archive the Internet is not merely ambitious but surely futile. Nevertheless, since the 1990s, a significant number of deposit libraries have been acquiring ‘born digital’ material. As the British Library explained when the non-print legal deposit legislation was passed in April 2013; ‘Capturing the unruly, ever-changing Internet is like trying to pin down a raging river. But the British Library is going to try. For centuries the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting [6th April] it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s digital memory’. 

However, the current official line is that the British Library is not interested in anything that would not be considered ‘a publication’. I am not sure anyone would be brave enough to try and define what is meant by ‘a publication’, especially in relation to digital technology. However, UK legal-deposit laws mean that the British Library is free to choose what it thinks appropriate to archive and does not have to ask for permission.

In the last ten years the size of the BL digital collection has grown to more than six petabytes (six million gigabytes). In the meantime, there has been no corresponding drop in the amount of printed material being collected.

The Web was not designed to preserve its past, instead it functions in a never-ending present which inevitably means it must constantly change. This is its purpose: to remain up-to-date. But just as the information it carries is in a constant state of flux so are the physical arbiters of digital software; our desktop computers, laptops and smartphones. The same applies to software. Far from being the robust, secure, maintenance-free technology it was initially predicted to be, the life-span of both hard- and software is shockingly brief. Archiving the Internet is indeed problematic.

There are three significant issues to be considered when archiving digital material. Firstly, the degrading of digital material and the fragility of digital content means that preservation actions need to be made far earlier and, therefore, more frequently than with paper-based collections, advisedly every two years – ad infinitum. Secondly, there is the issue of securing the integrity of digital material. For instance, it is easier for someone to make unnoticed (even accidental) changes to digital files than to paper-based material. However it is also necessary for archival staff to adapt commands or structural boundaries in order to manage and ensure access is maintained as hardware technology progresses. Someone using contemporary equipment to view and navigate a website as it was designed to be seen and used 30 years earlier, will not have the same experience. However, maintaining the original hardware on which to view archived material requires an added layer of expertise, administration and cost. Lastly, the software by which digital material is stored is unstable and quickly deteriorates, sometimes catastrophically. This can be made worse by unsuitable storage conditions (just as it is for books) although today this is rarely the reason for lost material. ‘Bit rot’, ‘data rot’, ‘data decay’, ‘data degradation’ et cetera, happen regardless, with the added problem that such disintegration cannot be detected until a file fails to render correctly, or, more likely, not at all. Failure is inevitable and yet there is no prior warning of when this will occur. Thus, digital preservation is not just a technical challenge; it requires a financial commitment to ensure a never-ending set of procedures – the constant upgrading of both hard- and software – to maintain access. 

In its Digital Preservation Strategy (March 2013) the British Library summed up the various problems as follows: Action and intervention is required from before even the point of acquisition in order to properly manage the risks involved in maintaining digital content for the long term. Only through a comprehensive life-cycle approach can these risks be addressed in a consistent and controlled manner. Furthermore, the strategies we implement must be regularly re-assessed: technologies and technical infrastructures will continue to evolve, so preservation solutions may themselves become obsolete if not regularly re-validated in each new technological environment. Only in this way can we ensure that our digital collections remain reliably accessible and authentic for future users in the long term. 

All of this is a reminder that the word ‘immaterial’, commonly used when describing the nature of digital technology, is a misnomer. Computational technologies and processes are all embedded in, and dependent upon solid manufactured objects: processors, servers, networks, silicon, input/output devices; all of which have unpredictable life spans.    

Compare this to the archiving of paper-based material – itself not without difficulties. Different papers degrade at different speeds, and 500 year-old paper is generally in better condition than paper from 50 years ago. Nevertheless, most paper-based material can be handled and read hundreds of years after being printed. Importantly, papers degrade at a predictable rate and so if damage or deterioration is discovered there is time to analyse the problem, make plans, raise funds, and arrange for repairs or restorative action. The deterioration of paper will always be an issue, but it is understood, calculable and remediable. Digital material just vanishes.

Archives are a testament not only to the fact that printed material undoubtedly occupies a large amount of space but also that it has previously occupied space elsewhere. Whether it is a single sheet that was folded to fit in a coat pocket, a stapled document to be stored in a filing cabinet, or finely bound books that are part of a private library, printed paper is invariably altered by those who have used and kept it. Evidence of past use enriches its cultural importance and provides invaluable information about the time and place of its making, clarifying its purpose by indicating the way it was used. When handling archived printed material, it is often physical characteristics; size and weight, touch and smell, together with its own accumulated signs of handling that cause the researcher, previously having only seen the document reproduced in a book or on a screen, to re-assess previously held perceptions about its time and place of origin, purpose and function.  And its owners.

A sign of wear is evidence of a useful book. Feathered page edges, creased corners and incidental marks provide a certain esteem. But a book’s natural ageing process – what Walter Benjamin described as ‘the historic witness that it bears’ –can only occur if the book was designed to have a long life. While the lightness, flexibility and low cost of paperback books encourage them to be shoved into pockets, rucksacks, and beach bags, the more substantially made fine press book, designed with superior materials for a longer life, will be handled with care, nurtured, kept safe and dry so that future generations might also enjoy it. 

The significance of a book’s long-lived physical presence is demonstrated by the fact that book-sellers will, given the opportunity, provide the provenance of a book. This is possible because of the once common practice of owners pasting a personal bookplate onto an inside cover. Even today it is common for an owner to write their name inside a new book. In this and so many other ways, the ageing of a printed book, once one of an edition of many all the same, provides each with an identity all its own. 

Nevertheless, the ability to see and read the content of a rare book held by the British Library on a screen anywhere in the world is something to be grateful for and, encouraged by improved scanning equipment, many major libraries, including the Newberry Library, are now actively increasing the amount of digitised material available via their websites. It might be expected that such increased online facilities would reduce the number of visitors to the reading rooms but instead, annual numbers at BL have remained the same: just over 400,000. Additional tables, just large enough to hold a cup of coffee, notebook and a laptop, continue to spill out from the café in all directions and I can vouch that finding one unoccupied is all but impossible. Perhaps seeing a book on screen has persuaded many that they must now see the real thing. 

Footnote: During the last week of October, shortly after I had finished this article, the British Library suffered a crippling cyber attack. On the 9th November the Legal and Contracts Services made a public announcement, headed ‘Technology Outrage’:

The British Library is experiencing a major technology outrage, as a result of what has now been identified as a cyber-attack. The outrage is affecting our website, online systems and services, as well as some on-site services including our Reading Rooms and public wifi. We expect to restore many of our services soon, although some disruption is likely to continue for several weeks.

In response to the attack, we have taken protective measures to ensure the integrity of our systems, and we are undertaking a forensic investigation with the support of the National Cyber Security Centre (NCSC) and cyber security specialists. 

Two months later, the online BL catalogue is still offline.