RDA Repository Platforms for Research Data Interest Group

Use Case: SURFsara Trusted Digital Repository Author(s): Hans van Piggelen, SURFsara

This is a use case description of the “Repository Platforms for Research Data” IG. While points 1, 2 and 3 aim at a general description/overview of the use case. Point 4 is meant to list the requirements.

  1. Scientific Motivation and Outcomes

The Trusted Digital Repository (work title, currently in development) of SURFsara aims to deliver a generic large-scale data set publication and sharing platform service to any researcher, collaboration group or research community. The repository supports the full data cycle in terms of data storage, distribution and publication by allowing annotation, private sharing, citation, curation and long-term archival and preservation. Focus is on the later stages of the data cycle. In future versions, data processing and online visualisation will be supported. In principle, the service’s processes will be automated and do not require human assistance.
Our main use case for the development of the service is the CosmoGrid[1] data set. The most important characteristics are:

-  Over 400,000 files with a combined size of over 71 TB

-  Multi-resolution data set in over 550 separate snapshots

-  Stored in tape archive, data by default not online

Other use cases are:

-  Massive automated direct ingest of existing data sets and metadata by file reference

-  Automated file and metadata validation

-  Embargo and private sharing of data sets

-  Full control by community administrators over data sets and metadata that belong to their community

-  Self-defined and stackable metadata schemas for any data set

-  Relationships between data sets, collections using PIDs

  1. Functional Description

If possible, give one diagram/picture that indicates the overall structure/architecture you have/envision. Describe in simple words the functioning of the use case/system.

A simplified overview of the service is given in Figure 1.

Figure 1. Schematic overview of the service.

Functionality:

-  Communities and individual researchers can ingest any-size data sets. Data set annotation is mandatory. Quality control is automated as far as possible. PIDs are automatically assigned to metadata, data sets, collections and files.

-  Community administrators can administer their community’s data sets, metadata and metadata schemas.

-  Data sets that haven’t been requested for a specific period will be put offline by transferring the files to tape (if not already present there).

-  Disclosure of data sets currently stored in the long-term archive, will be ingested using an automated process with specific metadata schemas in collaboration with the corresponding researchers.

  1. Achieved Results

Describe results (if applicable) that have been achieved compared to the original motivation. What requirements could not be fulfilled and how did this influence the outcomes.

Due to the current in-development status of the service, several functions are not implemented yet. Currently, all digital objects are assigned a PID upon creation or ingest. Massive automated data ingest scripts have been developed which allow simultaneous creation of data sets and metadata in the repository system. Specific metadata schemas can be defined and selected for data sets upon deposition.

  1. Requirements

Describe the requirements, their motivation from your use case and how you rate their importance. The descriptions don't have to be comprehensive.

Requirement / Description / Motivation from Use Case / Importance (1 - very important to 5 - not at all important)
Data annotation with specific metadata schemas / All data sets are annotated with metadata, defined in (community-defined or project-specific) metadata schemas / To allow community-specific metadata schemas and to improve discoverability / 1
Stackable metadata schemas / A complete metadata set for a data set can be formed by multiple metadata schemas / To improve control of the metadata of a data set by the repository, community and depositor all at the same time / 1
Assignment of PIDs / All digital objects are assigned to a unique PID / To make all objects referable and to establish relationships between them / 1
(Dynamic) PID collections / Define PID collections in complex hierarchies and allow dynamic collections by querying databases / To improve file and data set organisation and presentation; to allow subset downloading / 1
Massive automated data set ingest and registration / Data sets consisting of huge amounts of files and size currently stored in the archive need automated processes for registering file location and metadata / To efficiently ingest large amounts of data and files with annotation / 1
Community driven administration and management / Community administrators can control the data sets, set metadata schemas and define (access) policies of each data set ingested under their name / To improve quality control of data sets and publication by communities / 2
Periodical online cache flushing / System will clean up online data cache storage to be able to accommodate more large-scale data sets by transferring newly uploaded data sets to tape and remove not-in-use data sets already replicated to tape / To allow multiple large-scale data sets to be available for download / 2

Page 4 of 4

[1] DOI: ​10.1088/0004-637X/767/2/146