Terminology Services and Technology

JISC state of the art review

Douglas Tudhope University of Glamorgan

Traugott Koch UKOLN, University of Bath

Rachel HeeryUKOLN, University of Bath

Document details

Date: / 15-09-2006
Version: / Final draft for approval
Notes: / Circulation to JISC Development Team

Acknowledgement to funders

This work was funded aspart of the JISC Information Environment.

UKOLN is funded by the MLA: The Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath where it is based.


Contents page

Executive Summary

Purpose

Overview of report contents

Key points

Recommendations

1.Introduction

1.1Purpose of this review

1.2Terminology services overview

1.2.1Controlled vocabularies

1.2.2Folksonomies

1.2.3Combination of terminology tools and techniques

1.3Cost benefit issues

1.3.1Benefits

1.3.2Return on investment

2Use cases - scenarios

2.1Retrieval performance

2.2Name Authorities

2.3Mapping and other TS

2.4Repositories

3Types of vocabularies

3.1Vocabularies by structure

3.1.1Term Lists

3.1.2Taxonomies

3.1.3Subject Headings

3.1.4Relationship-based KOS

3.2Vocabularies by purpose

3.2.1Retrieval purposes

3.2.2Linguistic purposes

3.2.3AI purposes - modeling the entities in a domain

3.2.4eLearning purposes

3.2.5eScience purposes

3.3Named entity authority and disambiguation services

3.3.1Name Authority databases

3.3.2Other named entity authorities

3.3.3Named entity recognition, text mining, name disambiguation

3.3.4Tools, Web services

3.4Social tagging and folksonomies

3.4.1Terminology

3.4.2Context

3.4.3Categorization of tagging systems

3.4.4Disadvantages and problems

3.4.5Advantages and benefits

3.4.6Proposed developments

3.4.7Research

3.5Best practice guidelines for constructing and using vocabularies

3.6Network access to vocabularies

3.7Terminology Registries

4Activities with TS

4.1Studies and models of information seeking behaviour

4.2Information lifecycle with regard to TS

4.3Types of Terminology Web Services

4.3.1Definition of Terminology Web Services

4.3.2Groups (and layers) of abstract terminology services

4.3.3Illustration of TS assisted search process

4.3.4Terminology Web Services review

4.4Mapping

4.5Automatic classification and indexing

4.6Text mining and information extraction

4.7General sources for work in TS

5Review of current terminology service activity

5.1JISC related activity

5.1.1Archaeology Data Service (ADS)

5.1.2Co-ODE: Collaborative Open Ontology Development Environment

5.1.3geoXwalk Gazetteer Service

5.1.4High Level Thesaurus (HILT)

5.1.5Learning and Teaching Portal (Portals Programme)

5.1.6Mersey Libraries, Archives Hub and Cheshire

5.1.7Resource Discovery Network (RDN)

5.2Other UK activity

5.2.1COHSE Conceptual Open Hypermedia Project

5.2.2FACET

5.2.3FATKS

5.2.4FISH Interoperability Toolkit

5.2.5NHM Nature Navigator and other Scientific Taxonomic Projects

5.2.6OpenGALEN

5.2.7SKOS (Simple Knowledge Organisation System)

5.2.8STAR (Semantic Technologies for Archaeological Resources)

5.3International activity

5.3.1Alexandria Digital Library

5.3.2E-Biosci : EC platform e-publishing and info integration in Life

5.3.3Renardus

5.3.4Simile Piggy Bank

5.3.5SPIRIT

5.3.6OCLC and OCLC Research

5.4Projects in relation to vocabulary lifecycle framework

5.5Repositories

5.6Augmenting existing programmes and projects

6Standards

6.1Design

6.2Representations

6.3Identification of concepts, terms and vocabularies

6.3.1URIs

6.3.2Practical experience

6.3.3Further issues

6.4Protocols, profiles and APIs

6.4.1Protocols to access a vocabulary

6.4.2Protocols to support query

6.5Related standards

7Conclusions

8References (by main sections of the review)

Executive Summary

Purpose

Over the next two years, as part of its Capital Funding Programme, the Joint Information Systems Committee (JISC) is supporting further work to realize a rich information environment within the learning and research communities. This review is intended to inform JISC’s planning for future work related to Terminology Services and Technology, as well as to provide useful background information for participants in future calls, whether specifically featuring terminology or where terminology can be used to underpin other services.

Overview of report contents

This report reviews vocabularies of different types, best practice guidelines, research on terminology services and related projects. It discusses possibilities for terminology services within the JISC Information Environment and eFramework.

Terminology Services (TS) are a set of services that present and apply vocabularies, both controlled and uncontrolled, including their member terms, concepts and relationships. This is done for purposes of searching, browsing, discovery, translation, mapping, semantic reasoning, subject indexing and classification, harvesting, alerting etc. Indicative use cases are discussed.

One type of TS attempts to increase consistency and improve access to digital collections and Web navigation systems via vocabulary control. Vocabulary control aims to reduce the ambiguity of natural language when describing and retrieving items for purposes of information searching. Another type of TS is not concerned with consistency but with making it easier for end-users to describe information items and to have access to other users’ descriptions. This results in vocabularies (folksonomies) that may not be controlled, at least initially. The report reviews different kinds of vocabularies, according to their structure and their intended purpose. Potential benefits and return on investment are discussed. Named entity authority and social tagging services are discussed in some detail. Pointers are given on best practice guidelines and networked access to vocabularies, including key issues for future terminology registries.

The wider context of TS is considered. Relevant literature on user studies is reviewed. TS are located within an information lifecycle and within the JISC IE. Suggestions are made towards a more specific definition of Terminology Web Services within the JISC IE. Current work on Terminology Web Services is reviewed, along with work on mapping, automatic classification/indexing and repositories. Current projects that involve TS activity (JISC, UK, and international) are briefly reviewed.

Relevant standards are discussed, particularly for vocabulary representation; identification of concepts, terms and vocabularies; protocols and APIs.

Key points

TS can be m2m or interactive, user-facing services and can be applied at all stages of the search process. Services include resolving search terms to controlled vocabulary, disambiguation services, offering browsing access, offering mapping between vocabularies, query expansion, query reformulation, combined search and browsing. These can be applied as immediate elements of the end-user interface or can underpin services behind the scenes, according to context. The appropriate balance between interactive and automatic service components requires careful attention.

Return on investment should be considered in any service provision. There are various types of vocabularies serving different purposes, with different degrees of vocabulary control, richness of semantic relationships, formality, editorial control. There are a range of TS options, both interactive and automatic. There is potential for piloting TS to augment existing JISC programmes and projects.

TS are sometimes contrasted with free text searching, assisted by statistical Information Retrieval techniques in automatic indexing and ranking. These are not, however, exclusive options and there are opportunities in exploring different combinations of the two approaches. It should be noted that Web search engines have introduced elements of TS, by offering synonym and lexical expansion options. Thus TS should not be seen as antithetical to free text searching and can augment it.

There are many existing vocabularies. Different arrangements regarding ownership, maintenance and licensing of vocabularies can be found. The issue of who will maintain a vocabulary and the basis on which it can be described or made available in a registry needs investigation since this underpins systematic use of vocabularies in the JISC IE. This involves establishing business models for access to and maintenance of vocabularies.

Mapping is a key requirement for semantic interoperability in heterogeneous environments. Although schemas, frameworks and tools can help, detailed mapping work at the concept level is necessary, requiring a combination of intellectual work and automated assistance. The impact on retrieval is a key consideration.

Automatic classification and indexing tools are important for addressing the potential resource overheads in applying TS to indexed collections and repositories. Some tools are emerging that should be investigated for JISC purposes. Many argue for a combination of intellectual and automatic methods.

It is important to consider how people search for information when designing and evaluating TS, in order to reduce the scope for design errors and increase the possibility that services will actually be used. User studies should be conducted where feasible in ongoing project work.

TS should not be seen as an isolated, free-standing component. TS need to be considered within the wider context of the JISC IE, and need to be integrated with other components of the eFramework. They should be seen as forming a set of services that can be combined with a wide range of other services. There is a need for specifications of TS and their workflow, as part of the JISC IE.

Interoperability requires commonly agreed standards and protocols. Standards exist at different levels and types of interoperability. The prospect is emerging for a broad set of standards across different aspects of terminology services - persistent identifiers, representation of vocabularies, protocols for programmatic access, vocabulary-level metadata in repositories. Such standards are an infrastructure upon which future TS will rest but it is not feasible to wait for international agreements; international consensus will be influenced by operational experience. Pilot TS projects should orient to existing potential standards (in persistent identifiers, representations, protocols for programmatic access) and help to evaluate and evolve them.

Recommendations

The review was asked to include: “recommendations for further activities needed in this field, and the extent to which JISC should be involved in the work (both short and longer term), including collaboration with other organizations as a possible form of involvement". The following recommendations are listed according to the relevant section of the review, where further context may be found.

1.Introduction

1.1Purpose of this review

  • Terminology services can support various stages of the information lifecycle
  • JISC should highlight subject access and terminology services in all relevant JISC programmes, whether as extensions to existing projects or as new projects

1.2Terminology Services overview

  • Demonstrate integration of Terminology Services with other components of the JISC Information Environment. (See also Recommendation 4.3)

1.2.3 Combination of terminology tools and techniques

  • Encourage inter-disciplinary collaboration in the development of terminology services and co-operation with memory institutions and archives
  • Investigate different combinations of TS and uncontrolled (non-TS) search

1.3.2Return on investment

  • Investigate methods to make vocabularies available to the education sector through a Registry, initially for experimentation purposes but ultimately in a sustainable, maintained, licensed manner. (See also Recommendation 3.7)

2Use cases - scenarios

  • Use cases should be developed and refined in an ongoing basis, along with case studies of TS in practice, user session logging, observation, etc.

3Types of vocabularies

  • Provide access to a range of different vocabularies according to context
  • It is important to consider the broader context and return on investment

3.1Vocabularies by structure

  • Consider faceted approaches when developing vocabularies and TS

3.2Vocabularies by purpose

  • Descriptions of intended purposes of a vocabulary would be a useful element of a vocabulary registry (see also Recommendation 3.7).

3.2.4eLearning purposes

  • Increased cross-fertilisation between eLearning and Digital Library fields
  • User studies of behaviour by indexers (cataloguers), students, teachers. Investigate how to support effective practice with a variety of indexing and retrieval tools
  • Investigate conversion between VDEX and SKOS Core representations for compatible vocabularies (see also Recommendation 6.2).

3.2.5eScience purposes

  • Studies of user practice with vocabularies describing research data

3.3Named entity authority and disambiguation services

  • Investigate lists of institutional names and academic affiliations (IESR Agents etc.)
  • Study the coverage of available name authorities in OPACs and academic web publishing (LEAF, CiteSeer and similar)
  • Engage in international cooperation (eg, LEAF, OCLC, SURF DARE)
  • Prototype a demonstrator UK Name Authority File, possibly involving BL and universities (authentication, staff, institution databases) and evaluate its use in a limited application
  • Address the treatment of place and geographical names in UK services and activities, and the development of standards and authorities, in cooperation with related projects and terminology efforts.
  • Support active participation of UK institutions in international naming standardisation efforts in scientific disciplines and, via project support, assist their implementation in UK
  • Apply methods of name extraction and investigate their benefits compared to and in combination with traditional authority systems. Build and evaluate different name disambiguation demonstrators
  • Experiment with a Name Authority Web Service, e.g. to be built into metadata creation tools
  • Develop or support metadata enhancement services for correction and enrichment: vocabularies, schemes, mapping, names

3.4 Social tagging and folksonomies

  • Experiment with combination of KOS-based controlled indexing with an established vocabulary and free (social) tagging for research purposes in a specific discipline, optimised for discovery and retrieval
  • Experiment with potential for automatic linking of tags to facets, controlled vocabularies and authorities
  • Integrate tagging with existing services such as repositories, OPACs, (RDN/Intute) subject gateways, Digital Libraries, KOS creation and management systems, museum exhibitions and catalogues, metadata enhancement services etc.
  • Comparison study between different types of user participation: annotation, recommendation, personalization, restructuring of information, categorization, concept space, concept maps, topic map tools. This could inform a prototype integrating different types of user participation with social tagging

3.7 Terminology Registries

Demonstrate the use of a terminologies registry within JISC IE testbed to include

  • Investigating inclusion of terminologies into IESR, potentially describing vocabularies as collections
  • Developing marketing proposition for a UK terminology registry (include use scenarios, IPR issues, business models, cost benefit)
  • Evaluating use of the draft metadata description profile proposed by NKOS
  • Maintain collaboration between various UK initiatives (with eScience e.g. GRIMOIRES and learning communities e.g. Becta Vocabulary Tool) and internationally (e.g. NSDL)

4 Activities with TS

4.1 Studies and models of information seeking behaviour

  • User studies of TS in context of JISC IE, illuminating the search process (for work flow of services) and the appropriate balance between interactive and automatic TS

4.3Types of Terminology Web Services

  • Develop more precise definitions of TS, as part of the JISC IE and eFramework
  • Define search process workflow of TS within JISC IE eFramework
  • Within the context of eFramework,develop a hierarchical layered set of protocols for TS and standard bindings to (various) APIs
  • Develop open source, reference terminology web service implementations

4.3.4Terminology Web Servicesreview

  • Collaborate with international efforts in terminology web services
  • Develop a range of TS-based search and browsing tools

4.4Mapping

  • Investigate/compare different mapping approaches and granularities in pilot projects
  • Develop a range of TS-based tools to assist in creating mappings
  • Investigate the potential for standard mapping relationships and a mapping protocol
  • Collaborate with international efforts in mapping services

4.5Automatic classification and indexing

  • Investigate semi-automatic solutions to indexing and classification in pilot projects
  • Investigate currently available tools for automatic indexing and classification

4.6Text mining and information extraction

  • Investigate relationship between KOS and text mining:
  • Demonstrate how KOS can support text mining
  • Demonstrate how text mining can be used to update and enhance KOS

5Review of current terminology service activity

  • JISC should negotiate Dewey licenses for JISC services and projects

5.5Repositories

  • Pilot different approaches to subject based access to repository content via different types of vocabulary and TS, taking cost benefit issues into account and various levels of aggregation of content:

-use of subject classification and

-use of specialised KOS vocabularies

-use of author assigned keywords

-full text indexing

  • Consider use of mainstream classification (such as DDC) in combination with assigning specialised vocabulary terms (as in use within RDN)

5.6Augmenting existing programmes and projects

  • JISC should support a range of pilot demonstrators with end-users and evaluation
  • Investigate different TS approaches to (eg) indexing, mapping, search/browsing, query expansion, disambiguation
  • Consider subject access and terminology service adjuncts to appropriate JISC programmes and projects, including TS support for Intute; connection of TS (and subject access) to collection level metadata (e.g. topical composition, correlation); TS support for repositories; project-specific examples
  • Harvesting
  • Investigate possibilities for extending harvesting tools with more subject metadata
  • Investigate relationship of TS and OAI etc
  • Evaluate benefits of vocabulary-oriented metadata normalising and enhancement service, e.g. aggregator harvesting relevant metadata, enhancing it and then offering harvesting of the improved metadata
  • Develop vocabulary visualisation tools supported by TS
  • Flexible display and tailoring of segments from vocabularies
  • Flexible display and tailoring of results
  • Combined search/browsing

6Standards

  • JISC should encourage participation in international standardisation activities

6.1Design

  • Relevant standards should be included in JISC Standards Catalogue. All new initiatives should take account of relevant design standards

6.2Representations

  • Strongly recommended to use XML-based representations
  • Recommended that vocabulary providers consider using SKOS Core if appropriate and contribute to further extensions and customising of SKOS Core

6.3Identification of concepts, terms and vocabularies

  • A global identifier mechanism for referring to vocabularies and their components underpins interoperable TS
  • Recommended to consider building upon existing work with the http URI approach for concept identifiers
  • Investigate the addition of identifiers to a widely used freely available vocabulary in a pilot study
  • Educational work with vocabulary providers on need to supply identifiers and discussions on practical issues should be undertaken

6.4Protocols, profiles and APIs

  • Need for standard m2m protocols for networked access to vocabularies (and their constituent concepts, relations and terms) with common bindings (APIs) building on web services and other low-level standards
  • Recommended to consider using SKOS or ZThes API for TS (with a view to contributing to further development). Investigate possibilities of unifying SKOS and ZThes APIs
  • Investigate possible standard m2m protocols for mapping access to vocabularies, perhaps by expanding SKOS or ZThes APIs
  • Investigate the combination/integration of TS with existing query APIs (SRU/SRW, CQL) or possibly develop new TS-based query APIs

1.Introduction

1.1Purpose of this review

Over the next two years, as part of its Capital Funding Programme, the Joint Information Systems Committee (JISC) is supporting further work to realize a rich information environment within the learning and research communities. This review is intended to inform JISC’s planning for future work related to Terminology Services and Technology, as well as to provide useful background information for participants in future calls, whether specifically featuring terminology or where terminology can be used to underpin other services. The review is intended to identify useful areas of activity and highlight current initiatives of interest rather than be comprehensive or prescriptive. The review will recommend a number of areas with potential, either for further investigation, or for the development of tools or demonstrator services.