Title / Virtual Collection Registry 1.0 work plan /
Version / 1
Author(s) / Twan Goosen (CLARIN ERIC)
Date / 2014-06-11
Status / Draft
Distribution / Public
ID / CE-2014-0354

1  Planning

Month / 2014
June / July / August
Week / 23 / 24 / 25 / 26 / 27 / 28 / 29 / 30 / 31 / 32 / 33 / 34 / 35
Development plan
Alpha (1.0-alpha) @ RZG
Video conference[1]
Public beta (3.0-beta) @ IDS
Stable release (3.0)

2  Work

The Virtual Collection Registry (VCR) consists of a single web application with a number of facets:

·  A database driven storage of virtual collections and programmatic access to these collections, including assignment of persistent identifiers (PID’s) to published items

·  A web-based user interface that allows for the browsing of public virtual collections and creation and maintenance and publication of private virtual collections and

·  A REST interface exposing the registry for reading and writing (the latter requires authentication)

·  An OAI-PMH endpoint that provides the virtual collection metadata

As of May 2014 a number of versions of the Virtual Collection Registry have been made available. All previous work on the VCR has been done by Oliver Schonefeld at the Institute of German Language (IDS), most of it between January 2010 and October 2011.

The current development cycle is aimed to deliver a stable version of the VCR that can be used by the community and integrated into the broader CLARIN infrastructure. The following sections describe the aspects of the VCR that will be addressed within its scope.

2.1  Authorisation of user interface and REST service

The custom authentication/authorisation implementations will be replaced by container-managed security (for local testing) and an employment of the SHHAA filter[2] (for ‘Shibbolised’ server environments). The REST service will be Shibboleth-enabled as well, so that external applications can trigger authorised operations such as the creation of a new collection (see section 2.2). The server-side configuration aspects required to enable such functionality will be investigated.

2.2  Infrastructure integration

A call will be added to the REST service that accepts HTML form posts (of the type x-www-form-urlencoded), which will provide an easy way for external applications, such as the Virtual Language Observatory (VLO), to trigger the creation of a new collection. Using this facility, other applications can prepare a form specifying a set of resources (or, alternatively, a query definition for ‘intensional’ collections), which then can be submitted to the VCR REST service by the client in a (to be initiated) authenticated session.

2.3  PID assignment via EPIC2

The current implementation of the VCR supports the EPIC version 1 API as is currently available through GWDG[3], though this service should be considered deprecated. As a replacement, the VCR will be made to support EPIC API version 2[4] and will be configured to use a handle prefix specific to CLARIN ERIC (11372).

2.4  Release of CMDI metadata

There exists a profile ‘cmdi-virtual-collection’[5] in the CMDI Component Registry. However, the CMDI produced by the VCR is not valid with respect to this profile. This will be fixed. In addition, the profile will be assessed and an adapted version will be created if necessary. Compatibility with the DataCite schema[6] (on the level of convertibility to the minimal set) is considered desirable and therefore a factor in this assessment.

2.5  User guidance and documentation

A number of aspects of the virtual collection model are non-trivial and call for some guidance to the casual (and especially novice) user. This could be achieved relatively easy with some in-application explanatory notes and field descriptions. Furthermore, an ‘offline’ user guide would be very useful but not an initial necessity.

A number of high quality and diverse sample collections would be of help to users that prefer to learn by example. A number of candidates for such tests to consider:

·  CLARIN related papers at LREC 2014

·  Sign language corpora

·  Trobriand resources as an alternative to the TLA archive sub-tree[7]

2.6  Code base: maintenance and quality

The code base is of sufficient quality but may benefit from some maintenance, primarily consisting of updating libraries used and simplifying and modularizing parts of the implementation. Moreover, the long term stability and maintainability would benefit from increased coverage by unit tests. However, functional, usability and integration aspects should take higher priority at this point.

3  Future work

We identified a number of desirable potentials additions to the VCR that however are beyond the scope of the development phase described in this document. For future reference, a partial list of potential future extensions is provided below:

·  Optional inclusion of an ORCID for collection creators

·  Integration with DataCite in the shape of (optional) publication to the DataCite Metadata Store and assigning DOI’s for collections

·  Extended user management features:

o  Transferal of ownership

o  User groups allowing for collaboration on virtual collections

1

[1] To discuss process as well as functional and technical aspects in reference to the alpha release

[2] https://aai2.rzg.mpg.de/secure/shhaa/site/

[3] http://handle.gwdg.de:8080/pidservice/

[4] http://www.pidconsortium.eu/, http://epic.gwdg.de/wiki/index.php/EPIC:API

[5] http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1271859438175

[6] http://schema.datacite.org/

[7] http://www.mpi.nl/trobriand