BODC Controlled Vocabularies

Document History

Initial versionRKL2006March28

MHUG comments incorporatedRKL2006March29

Maintenance procedures updatedRKL2006April04

Updated format of web service outputMHUG2006Jun13

Added new web service use casesMHUG2006Jul25

Added specification for whatListsCat and MHUG 2006Jul25

searchTerm web service methods for

Vocabulary Server version 2.0

Introduction

The BODC controlled vocabularies are a set of lists, each containing:

  • A key – an alphanumeric string, sometimes referred to as a code that used to represent the list entry to software agents. Keys are therefore stored in data files and Oracle table fields. Ideally, users should never be exposed to keys and therefore they should be random byte strings devoid of semantic content. However, mnemonics are sometimes used for historical reasons or to assist database maintenance.
  • A term – a string used to represent the list entry to humans in applications where space is not a problem.
  • An abbreviated term - a string used to represent the list entry to humans in applications where space is a problem.
  • A definition – a clear and unambiguous (and therefore often quite longwinded) statement of the semantics associated with the list entry.

The controlled vocabularies have grown from what used to be known as the ‘BODC Code Tables’. The change in branding has been designed to make a break with the past when great importance was put on the key, even resulting in multiple lists for the same entity group to deliver alternative key formats. These have subsequently diverged in content causing all kinds of problems and are definitely a bad idea. The transition from ‘code tables’ to ‘controlled vocabularies’ has aimed to make as clean a break with the past as possible. However, there is some historical baggage (documented in a later section) that is so deeply embedded in our infrastructure that it could not be excised.

The BODC code table system consisted of ‘tables’ that were sometimes subdivided into ‘sections’, including a ‘special’ section (section 0) that was a normalised extension of all other sections in that table. This has now been turned upside-down and each ‘section’ in the old system is now a ‘list’ in the new.

Sections are aggregated in two ways. First, the controlled vocabularies are designed to constrain the content of specified fields in our database schema.In certain circumstances, we may want to constrain a field against more than one list. For example, we might want to constrain a ‘record created by’ field against both the ‘Current BODC staff’ list and the ‘Past BODC staff’ lists. This aggregation is a strict one-to-many relationship.

Secondly, lists are aggregated through a many-to-many relationship into subject groups (themselves defined by a controlled vocabulary list) designed to bring together the lists constraining related information.

List Nomenclature and Namespace

Each list is uniquely identified by a 4 or 5-byte key (associated with a short name, long name and definition) that is built from the concatenation of the ‘constraint reference’ (formerly known as the code table number) and the ‘constraint component reference’ (formerly known as the code table section).

A namespace under NERC DataGrid branding (vocab.ndg.nerc.ac.uk) has been set up for exposing the controlled vocabularies to the Web. Thus each list is uniquely addressable through a URI such as vocab.ndg.nerc.ac.uk:C751, as is each term in a list such as vocab.ndg.nerc.ac.uk:C751:POL.

Exposure of the Parameter Dictionary Tables

In the past, the BODC Code Tables and BODC Parameter Dictionary have been considered as totally separate entities. However, they have an awful lot in common – both systems assign meaningful text to keys - and therefore it makes a lot of sense to expose them through a common interface. This could be done by loading a subset of the parameter table fields into the controlled vocabulary master list, but this has maintenance implications because the original parameter dictionary structures need to be maintained because of the mapping information they hold. Information duplication in database schemas always ends up in tears.

Alternatively, a series of views can be created that map parameter dictionary fields to fields in the controlled vocabulary master list. These views are then catalogued as lists in the same manner as physically stored lists. The following views have been set up:

P011BODC Parameter Usage Vocabulary

P021BODC Parameter Discovery Vocabulary

P031SEASEARCH Agreed Parameter Groups

P041GCMD Parameter Valids

P051ISO Topic Categories

P061BODC data storage units

The mappings of these views are as follows:

P011 / P021
TN / ‘P01’ / ‘P02’
CODVAL / PARAMETER.CODE / PARAMETER_GROUP.CODE
CODNAM / PARAMETER.FULL_TITLE / PARAMETER_GROUP.FULL_TITLE
CODALT / PARAMETER.SHORT_TITLE / PARAMETER_GROUP.SHORT_TITLE
ISECT / 1 / 1
GF3COD / Null / Null
DESCRIPTION / ‘Unavailable’ / PARAMETER_GROUP.DEFINITION
CREATED / PARAMETER.CREATED / PARAMETER_GROUP.CREATED
MODIFIED / PARAMETER.MODIFIED / PARAMETER_GROUP.MODIFIED
P031 / P041
TN / ‘P03’ / ‘P04’
CODVAL / SEASEARCH_CATEGORY.CODE / GCMD_CATEGORY.CODE
CODNAM / SEASEARCH_CATEGORY.TITLE / GCMD_CATEGORY.TITLE
CODALT / SEASEARCH_CATEGORY.TITLE / ‘Unavailable’
ISECT / 1 / 1
GF3COD / Null / Null
DESCRIPTION / DOCTEXT(SEASEARCH_CATEGORY.INDREF) / ‘Unavailable’
CREATED / SEASEARCH_CATEGORY.CREATED / PARAMETER_GROUP.CREATED
MODIFIED / SEASEARCH_CATEGORY.MODIFIED / PARAMETER_GROUP.MODIFIED
P051 / P061
TN / ‘P05’ / ‘P06’
CODVAL / ISO_CATEGORY.CODE / UNITS.CODE
CODNAM / ISO_CATEGORY.TITLE / UNITS.FULL_TITLE
CODALT / ISO_CATEGORY.TITLE / UNITS.SHORT_TITLE
ISECT / 1 / 1
GF3COD / Null / Null
DESCRIPTION / DOCTEXT(ISO_CATEGORY.INDREF) / ‘Unavailable’
CREATED / ISO_CATEGORY.CREATED / UNITS.CREATED
MODIFIED / ISO_CATEGORY.MODIFIED / UNITS.MODIFIED

Notes:

  • The ‘MODIFIED’ field in the Parameter Dictionary tables may be null but it is mandatory for controlled vocabulary lists. This problem may be circumvented by using NVL (MODIFIED, CREATED) in the views.
  • The DOCTEXT stored function outputs the content of an NODB document as a single chunk of text without any of its formatting markup.

Organisation Metadata Extensions

One of the list aggregations (C75) defines what BODC terms ‘data originators’. These are organisational in nature, although not necessarily ‘organisations’ as they are frequently component parts of an organisation (university departments etc.), but never individual people. The code table is simply a lookup between a key used within BODC and a single text field used to hold a plaintext name for the originating entity. There is no standardisation of the descriptions.

There are two problems with C75. First, the information content is insufficient to populate NERC DataGrid metadata records. Secondly, there has been an inconsistent maintenance policy in the past. The C75 entity identifies the source of an accession and consequently a new C75 record should have been generated every time there was a change in the attributes of that entity, such as name or its contact details. With the exception of one or two haphazard changes this has not been done in the past and so we have another piece of historical baggage.

The first problem has been addressed by setting up an extension to the C75 list aggregation (table ORIG) containing the necessary additional fields.

The second problem may be expressed in the terms that BODC has marked up datasets with a single key whose definition (C75 aggregation terms and C75 extension fields) has changed over time. The solution adopted is to make the relationship between C75 and the new ORIG records potentially many to manyrather than one to one, but constrained to be one to one for any given period of time. This is achieved by implementing a map table between C75 and the ORIG table. This mapping table also forms the basis for the resolution of the ‘multiple originator list’ (EDMED, EDIOS, etc.) problem. Initially an EDMED mapping has been implemented.

Oracle Schema

Tables

The controlled vocabulary schema comprises the following tables:

  • BODCCDTX – the controlled vocabulary list index carrying their identifiers and definitions. There is one record per list. This table is supported by tables BODCCDTX_DELETED and BODCCDTX_MODIFIED that hold any deleted records and pre-modification versions of changed records.
  • BODCCODES – the controlled vocabulary term repository. There is one record per list entry.This table is supported by tables BODCCODES_DELETED and BODCCODES_MODIFIED that hold any deleted records and pre-modification versions of changed records.
  • BODCCDHD – the field constraint index. There is one record per constraint. Each list must map to at least one constraint, but constraints may map to more than one list.
  • BODCCDTX_C980_LINK – the link establishing a one-to-many relationship between controlled vocabulary list terms and list subject terms in the C980 controlled vocabulary.
  • ORIG, ORIG_MODIFIED, ORIG_DELETED and C75_ORIG_LINK – the extensions to the C75 code list aggregation (data originating entities). C75_ORIG_LINK contains one record per record in the ORIG table
Tables BODCCDTX, BODCCDTX_DELETED and BODCCDTX_MODIFIED

TNThe controlled vocabulary field constraint reference

ISECTThe controlled vocabularyconstraint component reference

LONG_NAMEThe full name by which the list is known

SHORT_NAMEAn abbreviated name by which the list is known

DEFINITIONA full description of the scope and contents of the list

USAGE_NOTESNotes designed to guide BODC personnel populating the controlled vocabulary entries. This is very much information for BODC consumption only that will not be publicly exposed.

VERSIONA monotonically incrementing number, starting from zero, identifying the current version of the list.

VRS_DATEA timestamp indicating when the current version of the list was created.

VISIBLEA flag set to ‘Y’ for public lists and ‘N’ for BODC-only lists.

CREATEDA timestamp indicating when the list index record was created.

MODIFIEDA timestamp indicating when the list index record was last modified.

  • The primary key is formed from fields TN and ISECT.
  • The controlled vocabulary list identifier is formed from the concatenation of TN (must always be 3 bytes long or applications will be unable to parse list identifiers into keys) and ISECT.
  • Each controlled vocabulary list must be part of a field constraint aggregation. In other words, there has to be a BODCCDHD record for each unique value of TN in BODCCDTX (historical baggage).
  • The version number is incremented whenever one or more terms in the associated list has been changed. This is trigger-driven for deletions, but implemented by a cron script for inserts and modifications to prevent large jumps in version number when records are added. Note that multiple deletions will cause such a jump, but this should only happen should a list be withdrawn.
  • The version timestamp is set to either the time of the deletion or to the most recent record modification timestamp in the list.
  • All fields are mandatory except usage notes
  • Details of field typing are not given in his document – refer to Oracle for current settings.
  • Tables BODCCDTX_DELETED and BODCCDTX_MODIFIED have the same structure as BODCCDTX with additional fields for deletion/modification reasons and timestamps.
Table BODCCODES, BODCCODES_DELETED and BODCCODES_MODIFIED

TN The controlled vocabulary field constraint reference

CODVAL The key for the controlled vocabulary list entry

CODNAM The full version of the controlled vocabulary list entry term

CODALT The abbreviated version of the controlled vocabulary list entry term

ISECT The controlled vocabularyconstraint component reference

GF3COD An alternative key for the controlled vocabulary list entry (historical baggage)

DESCRIPTION The definition of the controlled vocabulary list entry term

CREATEDA timestamp indicating when the term record was created.

MODIFIEDA timestamp indicating when the term record was last modified.

  • The primary key is formed from TN and CODVAL. Note that ISECT is not included in the primary key to keep keys unique across multiple list constraints.
  • All fields are mandatory except the abbreviated name, the alternative code and the definition. A worthwhile objective would be to populate the abbreviated terms and definitions and then make those fields mandatory as well.
  • Details of field typing are not given in his document – refer to Oracle for current settings.
  • Tables BODCCODES_DELETED and BODCCODES_MODIFIED have the same structure as BODCCODES with additional fields for deletion/modification reasons and timestamps
Table BODCCDHD

TN The controlled vocabulary field constraint reference

TITLE The name for the field constraint

GF3TTLAn alternative name for the field constraint

IVERSI Redundant field – version numbers moved to BODCCDTX

DATMOD Timestamp when the record was last modified

LENVAL Maximum length of the key for the field constraint (effectively the size of the field that is being protected by the constraint)

LENGF3 Maximum length of the alternative key for the field constraint

LENABB Maximum length of the abbreviated controlled vocabulary term (redundant field)

  • The primary key is TN.
  • All fields are mandatory except GF3TTL
  • Details of field typing are not given in his document – refer to Oracle for current settings.
Table BODCCDTX_C980_LINK

C980CODThe key to the subject entry in the C980 list

TNThe controlled vocabulary field constraint reference

ISECTThe controlled vocabularyconstraint component reference

  • All three fields form the primary key.
  • TN+ISECT is constrained against BODCCDTX. C980COD is constrained against C98TAB.CODVAL.
  • Details of field typing are not given in his document – refer to Oracle for current settings.
TablesORIG, ORIG_MODIFIED and ORIG_DELETED

ORIGREFUnique reference for the data originating entity

ORIGNAMEName of the data originating entity.

ABBRVAbbreviated name for the data originating entity

EMAILOriginating entity e-mail address

FAXOriginating entity fax number

PHONEOriginating entity phone number

ADDRESS1Originating entity address line 1

ADDRESS2Originating entity address line 2

COUNTYOriginating entity county or administrative region

CITYOriginating entity post town

COUNTRYOriginating entity post town

POSTCODE Originating entity postcode

URIOriginating entity web address

CREATEDRecord creation date

MODIFIEDRecord modification date

  • The primary key is ORIGREF
  • EMAIL, FAX, PHONE, ADDRESS2 and POSTCODE may be null. All other fields are mandatory.
  • FieldsCITY, COUNTY and COUNTRY are constrained by C61TAB.CODVAL, C62TAB.CODVAL and C18TAB.CODVAL respectively.
  • Envisaged address format for output in reports or mail merge is:
  • ORIGNAME
  • ADDRESS1
  • ADDRESS2
  • CITY
  • COUNTY
  • COUNTRY
  • POSTCODE
  • Tables ORIG_DELETED and ORIG_MODIFIED have the same structure as ORIG with additional fields for deletion/modification reasons and timestamps.
  • Details of field typing are not given in his document – refer to Oracle for current settings.
Table C75_ORIG_MAP

CODVALThe key for the entry in the C75 list aggregation

ORIGREFThe key for the entry in the ORIG table

CENTRE_IDThe key for the entry in the EDMED address book

TBEGINSDate from which mapping is valid

TENDSDate to which mapping is valid.Null value indicates mapping is current.

CREATEDRecord creation date

MODIFIEDRecord modification date

  • The primary key for this table is formed from CODVAL and TBEGINS.
  • CODVAL is constrained by a foreign key constraint referencing C75TAB.CODVAL and ORIGREF by ORIG.ORIGREF. CENTRE_ID is constrained by a foreign key constraint referencing EDMED_ORG.CENTRE_ID. Note that the CENTRE_ID field is, initially at least, allowed to be null to allow population of this field to be done retrospectively.
  • Details of field typing are not given in his document – refer to Oracle for current settings.

Views and Referential Integrity Tables

The controlled vocabularies were concatenated into a single table with the aim of limiting the table count under bodcdb. This worked initially, but the advantage was lost when referential integrity constraints came along, requiring the creation and maintenance of a physical table for each constraining list. However, there is sufficient infrastructure requiring the concatenated structure to require that it be retained (historical baggage).

There are three types of view exposing bodccodes:

  • Views accessing the individual controlled vocabulary lists by their reference key (e.g. C621, C622 etc.)
  • Views accessing the field constraint list aggregations by their reference key (C01, C62 etc.)
  • Views accessing individual controlled vocabulary lists by mnemonic, which is usually the name of the NODB field constrained by the list.

The views named using keys expose all columns of bodccodes, but the mnemonic views only expose the key (headed by the view name) and the term (headed by a descriptive name).

The following mnemonically named views are currently implemented:

View Name / Output Title / Comments
cfscat / Fixed Station Category (C01)
cpjcat / Project Category (C02)
cshcat / Series Category (C03)
cdocat / Documentation Category (C04)
cfldpq / Sea Floor Depth Qualifier C06
csdepq / Series Depth Qualifier (C06)
csposu / Series Positional Unc'y (C07)
ciunit / Units of Nominal Int'val (C08)
copcat / PrimaryOcean. Data Cat. (C09)
coscat / 2'ndryOcean. Data Cat. (C09)
cmount / Instrument Mounting Cat. (C10)
cplcat / Platform Category (C10)
cincat / Instrument Category (C11)
conf / Acc'n Confidentiality (C12)
csposd / Position Fields Def'n (C14)
flags / explanation / Additional column outputting the alternative key
ihb_codes / Sea Areas / Additional column outputting the alternative key
sea_areas / Sea Areas / Additional column outputting the alternative key
cplcod / Platform Code (C17)
cshcty / IOC Country Code (C18)
country / IOC Country Code (C18)
dbcode / BODC Database Code (C20)
qc / NODB Quality Control Fld (C21)
checks / Series Quality Class (C21) / Restricted to values 'X' and 'A'
csdepd / Series Depth Field Def'n (C24)
bnst / Accession Banked Status (C25)
ctmprc / Precision of Time Spec. (C28)
csampl / MTSIA Sampling method (C49)
cdatyp / MTSIA Data Category (C50)
cparms / MTSIA Parameters (C52)
cparmi / MTSIB Parameters (C52)
mooring_params / Parameter Measured
csouri / Inventory Info. Source (C54)
cstorf / Data Storage Format (C69)
source_formats / Source (C74)
cshorg / Organisation (C75)
crsorg / AuthorisationOrganisation C75
labs / Laboratory (C75)
labcode / Laboratory (C75)
nerc_labs / Laboratory (C75) / Restricted to section 1
gov_labs / Laboratory (C75) / Restricted to section 2
universities / Laboratory (C75) / Restricted to section 3
comm_labs / Laboratory (C75) / Restricted to section 4
foreign_labs / Laboratory (C75) / Restricted to section 5
data_type / Data Type (CNDX) (C76)
roscop_data_type / ROSCOP Data Type (C77)
csbjct / Documentation Subject (C78)
userid / user C99

The referential integrity field constraints are implemented as one single-column table (the key) per constraining list (i.e. record in bodccdhd) and are named using the corresponding view name suffixed by ‘TAB’ (e.g. C01TAB, C62TAB, etc.). The table population is maintained by three triggers fired by insertion, deletion and update on bodccodes.