PREMIS Workshop

November 7-8, 2006

Boston

  1. Background and Context

Preservation Metadata Implementation Strategies (PREMIS) developed out of a body of work on preservation:

-NASA’s Open Archival Information System (OAIS)

-OCLC/RLG’s Preservation Metadata Framework

The PREMIS Working Group, sponsored by OCLC and RLG, conducted a survey of institutions and prepared a report in September 2004:

-Implementing Preservation Repositories For Digital Materials: Current Practice and Emerging Trends In The Cultural Heritage Community

which informed the development of PREMIS.

PREMIS is not an off-the-shelf solution, but it moves closer to implementation in addressing preservation requirements than OAIS or the Preservation Metadata Framework. It is intended to be technically neutral so that it can be implemented in a range of systems.

The PREMIS Working Group decided not to include metadata for encoding business rules (i.e., a repository’s preservation policies), the exceptions being preservationLevel, which was thought to be too important to omit, and significantProperties, which is intended to be flexible and may require further definition. preservationLevel is a semantic unit that is used to encode the preservation functions to be applied to an object (i.e., the level of effort that will go into preserving the object). significantProperties is a semantic unit that is used to encode the characteristics of an object that a repository plans to preserve. preservationLevel would most likely apply to objects of a class, whether defined by collection, format, or some other criteria, and could simply take a range of integers as its value, while significantProperties would most likely be used to encode properties that apply to a particular object or collection in contradistinction to properties of some larger class to which that object or collection belongs. To automate processing, a repository might need to define a more granular structure for significantProperties, either by constraining its permissible values or by making it a container with semantic components.

PREMIS consultancies:

-Rights issues for digital preservation (Karen Coyle)

-PREMIS implementation guidelines and recommendations (Deborah Woodyard-Robinson)

Reports forthcoming.

  1. PREMIS Data Model

The PREMIS data model is a conceptual tool. It is not a formal relationship-entity model.

Refer to the PREMIS Data Dictionary for a diagram of the model. The arrows in the diagram indicate the direction of reference and the entity on which the reference is recorded.

The relationship between Intellectual Entities and Objects is addressed in the definition of Representation: a set of files, possibly including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity.

Object Entity:

A repository need not control objects at all levels (Representation, File, Bitstream). Inheritance is treated as an implementation issue (e.g., inheritance of values for environment from Representation to File to Bitstream).

Event Entity:

Ingest may be recorded as a sequence of events.

Agent Entity:

Agents may be further defined by external standards.

Rights Entity:

Rights statements are intended to be limited in scope to preservation: Agent X grants permission Y to repository C in regard to Object Z. The copyright status of an object and preservation laws are likely to be the major source of rights.

  1. PREMIS Semantic Units

Semantic units are not metadata elements. Semantic units may not be in one-to-one correspondence with metadata elements. They may be recorded explicitly or known implicitly. A semantic unit must be recoverable from a digital archiving system, which includes a repository’s business rules, in addition to hardware and software.

objectIdentifiers

-How should their values be built?

-Should there be a naming authority?

-objectIdentifierType may be implicit in a repository but needs to be made explicit for exchange.

significantProperties

-Tied to business rules.

-Possibly format specific.

Relationship of compositionLevel to relationship

-Use compositionLevel for encryption and compression.

-Use relationship for tar files since the component files are standalone.

environment

-May apply at the collection level.

-May be object-specific.

-The National Library of the Netherlands provides what they call view paths – a particular combination of hardware and software – to avoid creating an exhaustive list of all possible combinations of hardware and software. How many view paths should be provided?

relationship

-Should be recorded in one direction.

-Bidirectional relationships create integrity problems in database design.

  1. Next Steps

Registry development

-Format and environment (PRONOM is a start.)

-Controlled vocabularies

Additional metadata development

-Non-core metadata (e.g., for installation requirements)

-Metadata for business rules

-Integration with other metadata standards (MIX, METS)

Metadata storage issues

-Should metadata be stored with the object or persisted (i.e., extracted)?

-NARA’s approach is to persist only the metadata required for identifying classes (based upon format), which can be used to determine preservation functions.

-Performance vs. storage costs

Business rules

-May pertain to pre-ingest activities.

-Different institutions may follow different practices before ingest, e.g., wrt normalization.

-How does the work of InterPARES < on the evidentiary value of digital objects inform work on PREMIS?

Revision of the Data Dictionary

-Undertaken by the PREMIS Editorial Committee:

-Propose and discuss changes on the PREMIS Implementors’ Group (PIG) discussion list:

-Proposals for changes:

  1. PREMIS XML Schemas

The schemas were created by Jerry McDonough at NYU and are available on the Web:

-

There is a separate schema for each entity (Object, Event, Agent, Rights) and a container schema. Schemas may be used independently.

Elements that are mandatory in the schema for the Object Entity are semantic units that are mandatory across object categories (Representation, File, Bitstream).

PREMIS and METS issues:

  1. Which METS sections should be used and how many?
  2. Should information be recorded redundantly, i.e., checksums in fileSec (METS) and/or fixity (PREMIS)?
  3. Where should elements that pertain to particular formats be recorded, i.e., in format-specific metadata (e.g., MIX) or in Object Entity (PREMIS)?
  4. Where should structural relationships be recorded, i.e., structMap (METS) or in relationship (PREMIS)?
  5. Controlled vocabularies
  6. Should the PREMIS container be used?

Alternatives for recording PREMIS in METS:

  1. Object -> techMD

Event -> digiProvMD

Rights -> rightsMD

Agent -> with corresponding Event or Rights entity (See below for exception.)

  1. All PREMIS metadata -> digiProvMD
  2. All PREMIS metadata -> techMD

METS processes lax. Metadata sections need to be validated separately.

Redundancy, synchronization, and storage:

If metadata is not persisted but is generated on demand, then synchronization may not be a problem. Metadata may be stored with the object or in a database. METS files may be generated for exchange. NARA is storing metadata with the object. MIT is using RDF 3store to store metadata. A report on the scalability of triple store applications is available on the Web:

Object Entity vs. other technical metadata:

One approach is to record everything in the Object Entity and to record any remaining metadata in technical metadata for that format, e.g., MIX.

Linking:

METS uses IDs and IDRefs for linking. PREMIS allows multiple objectIdentifiers to be recorded; the semantic unit is repeatable. If only one type of identifier is needed, that would be a reason for using IDs and IDRefs.

Complex linking:

Complex linking may be required if an agent is involved in more than one event. Each Agent Entity and Event Entity should be recorded in a separate digiProvMD section. An Agent Entity can be linked to multiple Event Entities in the fileSec through the use of ADMIDs.

Controlled vocabularies:

Starter lists are provided in the Data Dictionary. Institutions have also developed local controlled vocabularies. These would most likely be recorded in METS as OTHERTYPE, etc., but they would need to be shared if PREMIS records are exchanged. There is a need for a central registry.

  1. Implementers’ Panel

6.1University of Illinois, Urbana-Champaign (Matt Cordial)

Matt’s presentation focused on technical solutions to creating and managing PREMIS metadata across distributedsystems. The presentation slides are available on the Web:

Echo Dep is part of UIUC’s National Digital Information Infrastructure and Preservation Program (NDIIPP) project. UIUC’s investigation of PREMIS began as part of an evaluation of repository software for preservation.

The project is a proof of concept implementation. The initial phase of the project has focused upon promoting interoperability among repositories (DSpace, Fedora, Eprints).

Deliverables includes a METS profile, APIs, and scripts for creating Submission Information Packages (SIPs) and Dissemination Information Packages (DIPs). The architecture is based upon a hub and spoke model. DIPs are used to transfer metadata from the repository to the hub. SIPs are used to transfer metadata from the hub to the repository. Preservation metadata is collected or generated centrally in the hub but is transformed to be compatible with and disseminated to different repositories. The next phase of the project will focus on the development of Archival Information Packages (AIPs) for persistent storage of preservation metadata.

6.2Ex Libris (Yaniv Levi)

Digitools includes PREMIS metadata. Relationships are encoded in the tool, not in the metadata. Metadata and objects are linked.

Priorities for development include automating the creation of PREMIS metadata from JHOVE, increasing support for Agents, and resolving redundancy across metadata schemes (PREMIS, METS, MIX, DC). Redundancy is the biggest problem.

6.3New York University (Eric Stedfeld)

NYU also has an NDIIPP project. These are the three phases:

  1. Design of architecture and proof of concept implementation (completed August 2006)
  2. Pilot collections including a range of format types
  3. Final implementation

Collections include Afghanistan Digital Library (digital television), text files, etc.

Repository is DSpace with DSpace modules and modules created at NYU. Joseph Pawletko gave a separate presentation on the repository design at the forum:

-

The modules created at NYU are designed to work with any type of repository software, not just DSpace.

Workflow for the Afghanistan Digital Library:

  1. Digital conversion specialist specifies metadata to be recorded.
  2. Metadata specialist decides where it will go and delivers it to the programmer. This involved conducting a gap analysis by comparing existing metadata to PREMIS and other schemes.

6.4Yale University (David Gewirtz)

David’s and Gretchen’s slides from IPRES are available on the Web:

Issues:

-Collections repository vs. preservation repository

-Consumers vs. clients

-Possibility of separating interfaces from data store

-Metadata harvesting

-Overlap between features of repository design and object metadata (i.e., overlap between Fedora and METS)

Fedora objects have a functional view and a representation view, which would make it possible to separate the interface from the data store.

To reduce storage and processing costs, assign fixity checks to a larger object that includes the digital object and its associated metadata.

(Matthew also presented separately, but I’m not including a summary.)

1