A strategic view of document and digital asset management for the University of the Witwatersrand, Johannesburg

Proposal to Strategic ICT Board

Prof Derek Keats

Deputy Vice Chancellor
Knowledge and Information Management

'Digital documents last forever or five years, whichever comes first' – Jeff Rothenberg

Introduction

Documents comprise one form of asset that needs to be managed in an organisation through digital means. Other assets include videos (e.g. Wits TV's archive), artwork (e.g Wits art gallery, rock art collection), other image assets (e.g. photographs, informational images), sound (e.g. research interviews, recorded lectures), presentations (e.g. conference slide presentations), etc. All of these asset or object types have the same sets of processes associated with them in terms of how they create value to the institution, and how they are preserved in the long term (Illustration 1). In addition, all are capable of being subjected to the social and semantic elements of 21st Century technology in ways that both add to their usefulness today as well as capture and carry forward elements of their use and the conversations that happen around them.

As a subset of digital asset management, document management often carries with it both immediate workflow requirements and long-term preservation requirements. This is particularly true when our processes increasingly produce documents that are born in digital form, and where we need to leave a legacy to the future for both institutional governance and historical reasons.

This document is divided into the following parts:

  • why preservation is essential;
  • the opportunities for synergy;
  • the physical layer – the Wits private cloud;
  • the application layer and its technologies;
  • the agile approach;
  • initial projects.

This document only reflects on the digital asset management component (Illustration 2), it does not include digitization or the creation of documents, although both of these will be developed within the proof of concept projects. Furthermore, digitization facilities exist in the library, and could be expanded to provide for university wide facilities supporting other projects, but that is to be considered as part of the larger archiving strategy.


Why preservation is essential

To understand why preservation is essential, we need to understand the nature of digital documents, and how they are essentially the same as any other digital asset. We also need to understand the threats to digital assets and how an underlying digital object repository will help us address a the preservation aspects of digital asset management. Although these observations are fairly obvious, they are often overlooked when focusing on 'document management' per se.

What are documents

It may seem obvious what documents are, but in the digital world this is not quite as simple as it sounds. We are used to thinking of documents as one thing, based on paper documents, but with digital documents, there are at least four different 'views' of a document (Illustration 3). Taking the slide presentation made to the 'stakeholder group' as an example, the view that we most often think of is the operational view – the one most people see (Illustration 3). This would be the closest we come to having a paper document in digital form, but the only characteristic it bears in common with its printed counterpart is its visual perceptibility in that form. Digital documents and other assets are represented on the computer in the format of binary patterns a storage medium – the storage view (Illustration 3). That is, unlike physical documents, digital documents and their storage medium are not part of the document itself. Other views of the document that are separated from storage include the manipulation view (the view in which the document is edited) and the structural view (the way in which the elements that make up the document are organized and themselves represented within the file format of the document).

The capability of all of these views being represented in perceptible form is dependent on the existence of compatible hardware and software (Illustration 4). For a variety of reasons, there are differential risks that we may not be able to access a particular view after some time. Furthermore, the collective risks increase over time. Digital documents and other digital assets are different from the way in which we have previously represented information on physical media such as stone, velum, papyrus and paper. While we can still read assets from several thousand years ago, the digital information created today is in serious danger of being lost after only a decade, creating a digitally-induced 'Dark Age'. Such a 'Dark Age' is a serious threat to our institutional memory, as well as to our ability to leave a legacy for future historians to study.

Digital preservation

To ensure that the digital assets which we create and store today are available in the future, we will have to cater for risks that arise from a number of threats including physical deterioration of storage media (sometimes called bitrot), accidental damage when moving devices or files around among devices, loss of the metadata that enables us to understand and use a particular asset, and digital obsolescence of devices and file formats (Illustration 5).

Digital preservation is the management of digital assets over time in the face of these threats. This requires a constant input of effort, time, and money to handle rapid technological and organizational advance is considered the main stumbling block for preserving digital information. Digital preservation is facilitated if there is a digital object repository underlying the storage of digital assets, and the ability to apply rules to the assets we store in such a repository to prevent their deterioration and loss.

We will return to some of these concepts later in this concept paper.

Opportunities for synergy

Because all digital assets are represented the same way, and have similar processes associated with them (Illustration 1), there are many opportunities for synergy. Some of the needs for digital asset management arise from the following (examples only, there are more):

  • Institutional papers and documents

◦  Legal contracts

◦  Student records

◦  Committee documents

◦  Email correspondence that relate to decisions

  • Library collections

◦  Historical papers

◦  Africana

◦  Other collections

  • Various history projects
  • Rock art collections
  • Fossils and other 3-dimensional objects
  • Video and audio collections

◦  e.g. Wits TV

  • Donations of significant collections from industry
  • History of human evolution research
  • Research output and theses

◦  Institutional repository

◦  ETD system

◦  Conference presentations

  • Research data

Some of these areas need conversion of their processes to digital ones, and some have extensive born-analogue materials that need digitisation. By pooling the needs of these areas, and creating technical infrastructure that can support all of them, we have the opportunity to do more, and also to be innovative in the way in which we approach our digital assets. All of them require some or all of the processes outlined in Illustration 1. By taking advantage of projects that we are already doing, together with the skills that we have built and are continuing to build, we can also achieve synergy among the people involved in carrying them out.

The Physical layer – the Wits private cloud

During 2009 we had a strong engagement with the Preservation and Archiving Special Interest Group (PASIG) chaired by Michael Keller University Librarian & Director of Academic Information Resources at Stanford University. Through PASIG, we realised that we had a unique opportunity at Wits to create a technical infrastructure based on private cloud that would serve many of our operational needs while at the same time creating the infrastructure that we need for digital asset management. Through PASIG we were able to collaborate with Sun Microsystems (now Oracle) engineers to design and implement the private cloud technology by the end of 2009[1].

The private cloud (Illustration 6) consists of two overlapping components, the compute cloud and the storage cloud. These separate components are transparent to applications and users because applications run within a virtual machine layer and access storage and computing capacity independently of the individual components. The three tiered, or hierarchical storage (when completed) optimizes storage costs based on frequency of file access with complete transparency to the user accessing a file. Frequently accessed files reside in the most expensive storage – flash memory, less frequently accessed files reside on spinning disks (hard drives similar to what you have in your desktop computer), and least frequently accessed files are stored on tape and automatically loaded to one of the other systems when required by an application or user.

The ability to provision virtual machines provides for servers that are not physical servers, but are actually software 'processes' running within the larger cloud infrastructure. This means there will be a lower maintenance overhead, shorter time to deploy, and expensive computing resources are more optimized than one can achieve with standalone servers.

This is important because smaller document management projects often purchase small servers with limited CPU power, limited storage, and no optimization of compute or storage capability at all. This increases overall costs, reduces efficiency, and increases maintenance overhead. It also makes effective digital preservation difficult, costly or even impossible.

Hence, the Wits private cloud is the basis for the physical layer for the management of digital assets.

The application layer and its technologies

The application layer consists of a number of components, some – but seldom all – of which are included in advanced document management systems. These can in turn be thought of as layers or components in the overall system. Here we are referring primarily to the back-end systems that provide the functionality needed for digital asset management, as well as limited front end applications. It does not cover ingest tools for born-analogue materials, as those tools are highly specific to the asset type and associated workflow. In addition, for some materials, outsourcing the digitisation is cheaper than doing it yourself.

Component: Digital object repository

Application: Fedora Commons

Overview: Fedora Commons provides an open source digital asset management core, upon which many types of digital library, institutional repositories, digital archives, and digital libraries systems are built. Fedora provides the underlying software system for a digital repository, and is not a complete management, indexing, discovery, and delivery application in and of itself. Other layers or components are necessary to integrate with it for those purposes. Being highly specialised for its intended purpose, Fedora is suited for acting as a repository back end for any type of content.

Fedora Commons provides a general-purpose management layer for digital objects based on content models that represent data objects (units of content) or collections of data objects. The objects contain linked data streams (content files, representations of content files, information about content files), as well as behaviours that are themselves code objects that provide bindings or links to disseminators (software processes that can be used with the datastreams). Content models provide XML (eXtensible Markup Language) representations that can be thought of as containers that give a useful shape to information placed into them; if the information fits the container, it can immediately be used in predefined ways. This is a very powerful way to provide for the deposit of digital objects and the independent definition of behaviours associated with them, and this in turn makes Fedora Commons a suitable back end for all of our digital asset management requirements, and provides the basis for preservation activities, although it is not itself a preservation system. Fedora Commons runs in a Java servlet container, such as Apache Tomcat.

The Fedora Project is currently supported by grants from the Andrew W. Mellon Foundation and the Gordon and Betty Moore Foundation, and is directed by Sandy Payette from Cornell University and Thornton Staples from the University of Virginia. A number of companies around the world provide technical support for Fedora Commons, and there are a number of commercial products built on top of it by companies such as ExLibris. Over 100 universities and other large organisations make use of Fedora Commons for digital repositories. It is also being used as the repository for the Wits CMS project.

Component: Ingest of documents to the repository

Application: SWORD

Overview: SWORD (Simple Web-service Offering Repository Deposit) is a lightweight protocol for depositing content from one location to another using the Atom Publishing Protocol (known as APP or ATOMPUB). The SWORD vision is 'lowering the barriers to deposit', principally for depositing content (any content!) into repositories, but potentially for depositing into any system which wants to receive content from remote sources. SWORD has been funded by the Joint Information Systems Committee, UK (JISC) because a large number of UK universities use Fedora Commons and other compatible repositories on which SWORD operates.

In the short term (weeks), we may use EasyDeposit, a SWORD deposit tool creator from the University of Auckland Library, to create deposit interfaces for particular asset types. EasyDeposit is currently funded and support by JISC, and allows deposit of resources into repositories powered by platforms such as DSpace, Eprints, Fedora Commons, IntraLibrary, and Zentity.

In the longer term (months), due to the nature of SWORD, integrating and extending the protocol but leveraging the power of WEWE, HAL and Chisimba, would better fit the current architecture and design. This approach would meet the exact needs of the University, and further the product base of South African initiatives.

Component: Desktop ingest

Application: Microsoft Word

Overview: Microsoft Research have developed a plug-in for Word that allows direct deposit to a Fedora Commons repository from within the word processor using the SWORD API. Specifically, This add-in for Word enables authors and editors to save Word files in the National Library of Medicine's NLM DTD (article or book) format, which is used for publishing and archiving. In addition, the add-in enables more metadata to be captured and stored at the authoring stage and enables semantic information to be preserved through the publishing process, which is essential for enabling search and semantic analysis once the articles are archived within information repositories. The plug-in is available from Microsoft's open source Codeplex site. Plug-ins for OpenOffice are under development at the University of Southampton, and will probably be available for use in the next few weeks.