Task Force: Service for Reference Data and Controlled Vocabularies

Task Force: Service for Reference Data and Controlled Vocabularies

Controlled Vocabularies and Reference Data / Concept Registry

Matej Durco, 2013-03-25, Draft – currently being circulated within CLARIN SCCTC

changes 2013-03-25: Updated diagrams and descriptions, updated info about OpenSKOS,
added references to TextGrid, eSciDoc/CoNE, LT-World

The urgent need for reliable community-shared registry services for concepts, controlled vocabularies and reference data[1] for the Digital Humanities community has been discussed on many occasions in various contexts.

As there is a substantial overlap in the vocabularies of the various DH communities and even more so a high potential for reusability on the technical level, there is a strong case for tight cooperation between different initiatives.

This short memo summarizes the latest developments on this issue within the two infrastructure projects CLARIN and DARIAH.

DARIAH taskforce: Service for Reference Data and Controlled Vocabularies

This taskforce was introduced at the 2nd VCC Meeting in Vienna in November 2012. It is conceived as a collaborative endeavor between VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). The main goal is to establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community.

The service is primarily meant to serve other applications, rather than being used directly by end users, but a basic user interface will still be necessary for administration etc. Applications and tasks requiring or profiting from this kind of service comprise Data-Enrichment / Annotation, Metadata Generation, Curation, Data Analysis, etc. By using global semantic identifiers instead of strings,such a service enables the harmonization of metadata descriptions and annotations and is an indispensable step towards semantic data and LOD.

Besides providing vocabularies, the service should also hold and expose equivalencies (and other relationships) between concepts from different vocabularies (concept schemes). These relationships come primarily from existing mappings, but can (and hopefully will) be subsequently generated (manually) for specific subsets on demand in a community process. An example for equivalencies from Wikipedia (page for Johann Wolfgang Goethe):

GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065 | Wikipedia-Personensuche

Regarding the responsibilities of the DARIAH working groups:

VCC3/Task 3 identifies and recommends vocabularies relevant for the community.VCC1/Task 5 provides basic/generic services relevant for whole community. Especially, the Schema Registry, that allows to express mappings between different schemas seems to be one starting point.In accordance with the VCC1 strategy, concentrate on pulling together (pooling) existing resources and only implement necessary "glue" to put the pieces together (data conversion, service-wrappers...)

CLARIN: ISOcat and CLAVAS

For data categories, the ISO-standardized Data Category Registry ISOcat is being maintained by MPI for Psycholinguistics, Nijmegen. This is dedicated mainly to linguistic concepts, but could be used for other communities as well. (Perhaps in a separate instance.) Lately, the so called RELcat has been added, that allows to express mappings between data categories.

While ISOcat has been in productive use for some time, it is – by design – not usable for all kinds of reference data. In general, it suits well for defining concepts/data categories (with closed or open concept domains), but its complex data model and standardization workflow does not lend itself well to maintain “semi-closed” concept domains, controlled vocabularies,like lists of entities(e.g. organizations or authors). In such cases, the concept domain is not closed (new entities need to be added), but it is also not open (not any string is a valid entity). Besides, the domain may be very large (millions of entities) and has to be presumed changing (especially new entities being added).

This shortcoming leads to a need for an additional registry/repository service for this kind of data (controlled vocabularies). One activity to tackle this issue, is the project or taskforce CLAVAS in CLARIN-NL: Vocabulary Alignment Service for CLARINwhere the plan is to reuse and enhance for CLARIN needs a SKOS-based[2]vocabulary repository and editor OpenSKOS[3], developed within the CATCHplus project (by a commercial company, but released open source).

Currently, the Meertens Institute[4] of the Dutch Royal Academy of Sciences (KNAW),as well as Netherlands Institute for Sound and Vision[5]are running an instance of OpenSKOS. Since the work on this vocabulary repository started in the context of a culturalheritage program, originally it served vocabularies not directly relevant for the LRTcommunity, like GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven or AAT - Art & Architecture Thesaurus. As part of the process of adaptation to the needs ofCLARIN and LRT-community data categories from ISOcat have been converted intoSKOS-format and ingested into the system. In coming weeks, Meertens Institute plans to also publish the list of language codes and there are experiments being conducted with organization names for the domain of language resources. CLARIN Centre Vienna is also running an instance of the OpenSKOS system with ISOcat data.[6]

One important feature of the OpenSKOS system is its distributed nature. It allows individual instances to synchronize the maintained vocabularies among each other via OAI-PMH protocol. This caters for a reliable redundant system, as multiple instances would provide identical synchronized data, while the primary responsibility for individual vocabularies could lie with different instances/organizations based on their specialization, field of expertise.

Most recently, the Standing Committee on CLARIN Technical Centers (SCCTC)started discussing the issue of Controlled Vocabularies and Concept Registries as one of the infrastructural (A-center) services to be dealt with.

Candidate vocabularies, taxonomies

Based on popular demand, the work on reference data should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies):

  • Data Categories / Concepts - ISOcat
  • Languages - ISO-639
  • Countries - country codes
  • Persons - GND, VIAF, dbpedia?
  • Organizations - GND, VIAF, dbpedia?
  • Schlagwörter/Subjects - GND, LCSH
  • Resource Typology -
  • [VCC3 - please add more or link to appropriate findings in your Task]

AAT - international Architecture and Arts Thesaurus
GND - Gemeinsame Norm Datei
GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
VIAF - Virtual_International_Authority_File

Other related relevant activities and initiatives

One obvious candidate for LRT community is LT-World[7], however this is already a full-blown ontology with People, Projects, Organisations, Events and Language Resources, so the integration would have to happen at another level (RDF/LOD).

At MPDL, within the eSciDoc[8]publication platform there seems to be (work on) a service (since 2009 !) for controlled vocabularies:

Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities – developed at the New Zealand Electronic Text Centre (NZETC).

There has been work on controlled vocabularies within TextGrid[9][10], but the current status is unclear. [TODO: check status]

A broader collection of related initiatives can be found at the German National Library website:

FRBR -Functional Requirements for Bibliographic Records
RDA - Resource Description and Access
- Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)

Following diagramsdepict the relationships between individual software components and pieces of data involved in the issue (basically in the context of the CLARIN infrastructure).

Figure 1 Relationship between modules of CMDI[11]

Figure 2 Expressing data categories in SKOS

The semantic proximity of a /data category/ to a /concept/ may mislead to a naïve approach to mapping DCR to SKOS, namely mapping every data category (from oneprofile) to a concept all of them belonging to the ISOcat-profile:ConceptScheme. However this is not practical/useful, ISOcat as whole is too disparate, and so would be theresulting vocabulary.

A more sensible approach is to export only closed DCs as separate ConceptSchemesand their respective simple DCs as Concepts within that scheme. The rationale is, thatif we see a vocabulary as a set of possible values for a field/element/attribute, complexDCs in ISOcat are the users of such vocabularies and simple DCs the DCR equivalenceof values in such a vocabulary.

Figure 3 Linking between Schemas, data categories and vocabularies

With an extension (tentatively: <clavas:vocab>), the conceptualDomain of data categories can be suggested within ISOcat. However, real binding restriction is defined only in the schema. Here, a given element can be concept-linked to a data category, but the allowed values can be restricted either by another vocabulary, or by different means (regular expression), or not at all. When a vocabulary is closed (stable) it can be integrated into a schema as enumeration (cf. ISO-639 component in CMD).Open vocabularies however cannot be strictly integrated into the schema, they can only be referenced as annotation, that client application would have to learn to recognize and process (i.e. made vocabulary-aware).

Client applications can use information about data categories from ISOcat as explanations (of fields) to the users and the Vocabulary Service, to provide users with suggest/autocomplete functionality when authoring structured data, like Metadata or Annotations.

In generic terms, the main function of the Vocabulary Service is find()returning matching concepts for search terms. In the actual implementation, the API of OpenSKOS[12] provides two more specialized function: /find-concepts and /autocomplete

Involved persons

Primarily involved:

  • Hennie Brugman, Meertens Institute (KNAW), Amsterdam
  • Christoph Plutte, BBAW, Berlin
  • Daniel Kurzawe, GWDG, Göttingen
  • Matej Durco, ICLTT, Vienna
  • Marc Kemps-Snijders
  • ??

Informed:

  • Charly Mörth, ICLTT, Vienna
  • Daan Broeder, MPI, Nijmegen
  • Menzo Windhouwer, MPI, Nijmegen
  • Dieter van Uytvanck, MPI, Nijmegen
  • Stefan Schmunk, SUB, Göttingen (DARIAH, AP1 Head)
  • ??
Questions

-It is not clear (to me) how do DARIAH software components: Collection Registry and Schema Registry fit into the picture

-What is the current status of using controlled vocabularies within TextGrid?

-What is the status of CoNE?

1

[1]There is a need to clarify the terms: reference data, controlled vocabularies, concepts, data categories, …

[2] SKOS is a widely used simple W3C standard for expressing “knowledge systems” (vocabularies, taxonomies, etc.). It provides means to handle multilinguality and expressing relations between concepts, as well as a global unique identifier and a recommended label for the concepts/entities, which is crucial to prevent the proliferation of variants of names (very virulent for organization names). Thus this standard should lend itself well to express the planned controlled vocabularies as well as concept data. Although it will probably need to be accompanied by other description framework for additional information (e.g. data about persons)

[3]

[4]

[5]

[6] older prototype version! update to the current version planned in May 2013

[7]

[8]

[9]

[10]Presentation by Wolfgang Plempe at Workshop Text Mining, Controlled Vocabularies and Linked DataGoettingen 2011‐07‐13

[11]From SMC4LRT presented at LREC workshop Metadata 2012, Istanbul (paper, presentation)

[12]