There is a very interesting relationship between big data and digital preservation. Preservationists work with very similar data types – equally complex, variable, voluminous and sparse. However, preserving big data introduces its own challenges. What of it is necessary to hang on to? Some memory institutions, as libraries and museums are called, have started using big data technologies to store and maintain their digital collections and their metadata, some of it in the cloud. This post is based on my presentation at the ARK Group’s recent event on Digital Preservation in the Government Sector.
Digital preservation
Digital preservation is way more involved than you may initially think. Through a series of managed activities you have to ensure continued access to digital materials for as long as necessary or for as long as possible, budgets permitting. It involves the planning, resource allocation, and application of preservation methods and technologies to ensure that digital information of continuing value remains accessible and usable. Accessible means much more than being merely readable – it must retain all the qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for. To make this possible, there is usually a lot of metadata surrounding the digitally preserved data or artefacts. Metadata to describe its contents, its context, as well as how it should be accessed and analysed.
The goal of digital preservation is to render authenticated content accurately over time. The purpose is to ensure future scholarship, accountable government, effective and efficient business and industry, civic rights and entitlements are protected, personal needs for information and creative expression are met and a balanced record of society from the late twentieth century onwards endures. That’s quite a tall order!
So who uses digital preservation? Any organization who needs to keep materials or data accessible for long periods of time. Libraries, museums, scientific institutions, research organisations, government departments, custodians of legal and legislative documents, and conventional businesses that have to keep accurate records.
Digital preservationists face many challenges, the two most pressing being the decay of storage media and the constant change in access mechanisms. Take a very simple example – a document created in WordStar not too many years ago is not that easily readable anymore, that is, if you actually managed to read the 3.5 inch “floppy disk” that it was stored on. Now imagine you archived a whole application database and they don’t support that release any more?
So digital preservation is about a lot more than merely technology. It is about institutional commitment and policy. It is about on-going management, monitoring and on-going charge to make sure the digitally preserved content remains accessible and meaningful.
Big data and preservation
If we look at the types of data a memory organisation like a library or a museum has to preserve, we find unstructured text, documents in various formats, images, videos, large datasets, Geospatial data, complex relationships between these and a lot of metadata to describe them all. Right there I could open one of my presentations on big data and match those data types almost one to one. Only with digital preservation, we have the added requirement that this data must still be accessible in 50 years’ time, and that the context of that data or content should be clear. Big data, by itself, is also introducing new challenges for digital preservationists – especially related to indexing, efficient search and future accessibility.
What is really interesting is that some memory organisations have already progressed quite far in investigating the big data platforms as storage alternatives – and we’re not only talking about Hadoop here; there are some very interesting implementations on Cassandra, MongoDB and other NoSQL and graph-oriented data stores as well. If you think about it, apart from Hadoop for the size and scalability, a lot of a library’s collections can be very well managed in a NoSQL document database, while the metadata sets that have to be maintained are very well suited to a key-value NoSQL database because of the varying structures.
A very interesting case study is Internet Archive’s WayBackMachine where they attempt to capture every webpage, video, television show, MP3 file or DVD published anywhere in the world. On 6 May 2013 they had 10 petabytes of data stored on a Hadoop file system. Heritrix was developed by Internet Archive for web crawling and adding content. Nutch running on Hadoop’s file system is used for Google-style full-text search. Even more interesting, during a recent government block-out, all the public government website content was accessible on WayBackMachine.
BI and preservation
In pure BI implementations, we often advise clients that they can archive the data off their operational systems, in order to streamline the operational transactions on those, once that data has been duplicated to the data warehouse and BI environment. However, my research into digital preservation has triggered two thoughts about BI information.
The first is that we generate new insights in the BI environment. For example, segmentation indicators, propensity to churn scores, and other measures which do not exist in our transactional systems are created by applying analytical models. Neither do sentiment and influence scores exist which we generate off big data analyses, often from off-site and public data sets. If we are making business decisions based on those newly generated insights, shouldn’t they be preserved as well?
Secondly, the many issues related to context and contextual metadata also have me thinking. Organisations make strategic and tactical decisions based on information analysed through data discovery tools or presented through mashups, visualisations and other information presentation paradigms. If they are making crucial decisions based on those presentations, shouldn’t they be preserved as well?
Preservation and the cloud
Cloud-based storage is often touted as a scalable cost-effective alternative to managing on-premise storage infrastructure. In fact, it is caters well for unforeseen requirements. In addition, with outsourced cloud storage (and processing) it shows as operational expenses on the books instead of capex. However, for long-term preservation you really have to investigate the potential longevity, stability and future accessibility of whatever you have stored there, in addition to the security and privacy protection facilities. In short, you really have to study the fine print. But of course, the contracts and SLAs can be how good, it still cannot protect you from a provider going out of business.
I was surprised, however, how many digital preservation initiatives are already running on cloud-housed solutions, varying from infrastructure-as-a-service to even database-as-a-service implementations.
A very interesting case study is Europeana Cloud, the European Union’s flagship digital cultural heritage initiative, which is a cloud-based system for researchers to explore and analyse Europe’s digitised content. To date it contains metadata for over 23 million objects from over 1500 European universities, libraries, data centres and publishers managed in a cloud-based Apache Cassandra distributed NoSQL database. It uses an OpenStack cloud for its storage and computation layers. They have a target of 30 million items by 2015.
Concluding remarks
Coming from a strong data management, BI and big data background, when I did my research for this paper I quickly realised that the people in the digital preservation and records management world live and work in quite a similar but parallel universe. Strategy, technology, vast and different business requirements, vast quantities of very different data, importance of metadata, governance, charge management, efficient search, easy self-service access to information, trends, analytics…. it all sounds so familiar! Except in this world the timescales are way, way longer. You may think a 5-year trend line or keeping operational data for 10 to 15 years seem long? How about digitising and preserving documents written by quill on hand-made paper more than a century ago? And even more imposing is that it must still be accessible and analysable in 50 years’ time. Big data analytics, eat your heart out!