Essnet on Data Warehousing Report 22(21)

Version 1.12

ESSnet on Data Warehousing Report 22(21)

WP1 2013-09-06

in partnership with

Title: / Framework of metadata requirements and roles in the S-DWH
WP: / 1 / Deliverable: / 1.1
Version: / 1.12 / Date: / 6-9-2013
Author: / Lars-Göran Lundell / NSI: / Sweden

ESS - NET

On micro data linking and data warehousing
in production OF BUSINESS STATISTICS

Version 1.12

ESSnet on Data Warehousing Report 22(21)

WP1 2013-09-06

Metadata Framework for Statistical Data Warehousing

Contents

1 Metadata – general considerations 3

1.1 Metadata definitions and terminology 3

1.1.1 Metadata and data 3

1.1.2 Metadata categories 4

1.1.3 Metadata subsets 6

1.1.4 Metadata structures 9

1.2 Metadata collection and usage 10

1.3 Metadata standards 11

1.3.1 GSBPM 11

1.3.2 GSIM 11

1.3.3 MDR (ISO/IEC 11179) 12

1.3.4 CWM 12

1.3.5 DDI 12

1.3.6 SDMX 12

1.3.7 MCV 13

2 Metadata in the statistical data warehouse 13

2.1 The SDWH metadata requirements 13

2.1.1 Minimum metadata requirements for the SDWH 15

2.2 Metadata and the layered SDWH 15

2.2.1 Source layer metadata 16

2.2.2 Integration layer metadata 17

2.2.3 Interpretation and data analysis layer metadata 17

2.2.4 Data access layer metadata 17

2.2.5 Summary of SDWH layers and metadata categories 18

2.3 Organising SDWH metadata 19

2.4 SDWH metadata governance 19

2.5 The SDWH and metadata standards 20

Metadata plays a very active and important part in the data warehouse environment. [...] Metadata for the data warehouse environment is one of the most important aspects.[1]

Metadata is the DNA of the data warehouse, defining its elements and how they work together. [...] Metadata plays such a critical role in the architecture that it makes sense to describe the architecture as being metadata driven.[2]

The quotations above originate from “the fathers of the data warehouse”, Bill Inmon and Ralph Kimball. Even if they do not always agree on how a data warehouse should be built and maintained, they obviously share the view that much effort should be devoted to designing the metadata system when establishing a data warehouse. To everyone working in an organisation that produces statistics, like a national statistical institute (NSI), the need for good metadata is already well known, regardless of the production environment. Thus it is obvious that a statistical data warehouse (SDWH) is dependent on its metadata for statistical as well as data warehousing purposes.

According to the framework partnership agreement (FPA) for the ESSnet project on micro data linking and data warehousing in statistical production, this project is to “define a functional model of the SDWH, so that the issues raised by the ESSnet can be assessed in a generic and standardized way”.

This paper attempts to define the roles and purposes of metadata in the SDWH in generic terms, and to distinguish between them and those used in statistics production regardless of the environment, i.e. a general metadata framework for statistical data warehousing.

1 Metadata – general considerations

1.1 Metadata definitions and terminology

The first step in any kind of standardisation work usually concerns making sure that all involved parties understand and agree on a set of basic definitions and use a common terminology. This chapter covers a number of important basic definitions.

1.1.1 Metadata and data

General definitions of metadata can be found in many books and many sites on the Internet. Most of them are very short and simple. The most commonly used generic definition states that:

[Def 1.1] Metadata are data about data[3]

There are some variations on the theme, e.g. claiming that metadata should (or must) be structured or formalised. Perhaps somewhat unexpectedly the sources that have a relation to statistics give definitions that are even shorter and vaguer than some of the general purpose sources. The definition of statistical metadata given by OECD and UNECE, e.g., simply states that:

[Def 1.2] Statistical metadata are data about statistical data[4]

This definition will obviously cover all kinds of documentation with some reference to any type of statistical data and is applicable to metadata that refer to data stored in a SDWH as well as any other type of data store.

Since the definition of metadata shows that they are just a special case of data, we need a reasonable definition of data as well. A derivative from a number of slightly varying definitions would be:

[Def 1.3] Data are qualitative and/or quantitative information collected through observation[5]

As well as a definition of statistical metadata, we can find several definitions of statistical data:

[Def 1.4] Statistical data are data derived from either statistical or non-statistical sources, which are used in the process of producing statistical products [6]

These basic definitions are very generic and state nothing about requirements on the contents or organisation of the data or metadata.

1.1.2 Metadata categories

Metadata may describe many different aspects of data. Hence metadata can be categorised in several ways, where the categories form a multi-dimensional structure. Consequently, each metadata item normally belongs to several categories.

1.1.2.1 Active – passive

Traditionally, metadata have been seen as a documentation of an existing object or a process, such as a statistical production process that is running or has already finished – i.e. the result of a task most often carried out as the last, even optional step of the production process. This indicates a passive, recording role, which is useful for documenting, e.g., the variables, objects and methods used to plan and carry out a survey or the quality achieved for the results.

Passive metadata will become more active if they are used as input for planning, e.g., a new survey round or a new similar statistics product. The term active metadata should, however, be reserved for metadata that are operational. Active metadata may be regarded as an intermediate layer between the user and the data, which can be used by humans or computer programmes to search, link, retrieve or perform other operations on data. Thus active metadata may be expressed as parameters, and may contain rules or code (algorithmic metadata). Some authors use the term active only for those metadata, i.e. those that can be interpreted or executed at runtime to support metadata driven processes, calling all other non-passive metadata semi-active.

Passive metadata are used as documentation in all statistics production regardless of storage environment. In the SDWH active metadata must be available in what is often called the metadata layer (cf. definition 4.1).

[Def 2.1] Active metadata are metadata stored and organised in a way that enables operational use, manual or automated, for one or more processes
Examples: Instruction; user manual; parameter; script (SQL, XML)

[Def 2.2] Passive metadata are all metadata that are not active
Examples: Quality report for a survey, a census or register; documentation of methods that were used during a survey; most log lists; definitions of variables

1.1.2.2 Formalised – free-form

According to some sources all metadata must be structured, or formalised. In a reverse case all metadata would be created and stored in completely free form – unstructured and non-formalised. In practice all metadata probably follow some kind of structure, which may be more or less strict. At one end, we have completely and strictly formalised metadata, meaning that only pre-determined codes or numerical information from a pre-determined domain may be used. At the other end, we find a loose structure, e.g. a set of chapters, subdivisions, headings, etc., that may be mandatory or optional and whose contents may adhere to some rules or may be entered in a completely free form (text, diagrams, etc.).

Strictly formalised metadata are obviously well suited for use in an active role, but there is no simple, unambiguous mapping between active and formalised, and passive and free-form, respectively. Still, formalised metadata are more easily used actively, and since active metadata are vital to building an efficient SDWH it follows that its metadata should also be formalised, whenever possible.

[Def 2.3] Formalised metadata are metadata stored and organised according to standardised codes, lists and hierarchies
Examples: Classification codes; parameter lists; most log lists

[Def 2.4] Free-form metadata are metadata that contain descriptive information using formats ranging from completely free-form to partly formalised (semi-structured)
Examples: Quality report for a survey a census or register; methodological description; process documentation; background information

1.1.2.3 Reference – structural

Most sources define two main categories of metadata, often called business and technical metadata. The “statistical sources” rarely use those terms. Several different synonyms can be found for business metadata, e.g. conceptual or logical, but the most commonly used term is reference metadata. Instead of technical metadata, the “statistical sources” most often use the term structural metadata to refer to the same thing.

The distinction between the two categories varies between sources, but generally reference metadata help the user understand, interpret and evaluate the contents, the subject matter, the quality, etc, of the corresponding data, whilst structural metadata help the user, who in this case may be man or machine, find, access and utilise the data operationally.

Particularly in data warehousing structural metadata can be defined as any metadata that can be used actively or operationally in a metadata driven system. The user may in this case be a human or a machine (a programme, a process, a system). This includes metadata that describe the physical locations of the corresponding data, such as names or other identities of servers, databases, tables, columns, files, positions, etc.

Structural metadata are normally represented as formalised and active, whilst reference metadata are typically passive and stored in a free format, requiring more efforts to make them active by storing them in a structured way.

[Def 2.5] Reference metadata are metadata that describe the contents and quality of the data in order to help the user understand and evaluate them (conceptually) [7]
Examples: Quality information on survey, register and variable levels; variable definitions; reference dates; confidentiality information; contact information; relations between metadata items

[Def 2.6] Structural metadata are metadata that help the user find, identify, access and utilise the data (physically)
Examples: Classification codes; parameter lists

All categories described above are valid for all metadata, i.e. every metadata item can be categorised on the three scales: active–passive, formalised–free-form, and reference–structural, as illustrated in figure 1.

Figure 1 Categorisation of a metadata item

1.1.3 Metadata subsets

In addition to the categories described above a metadata item may (but does not necessarily have to) also belong to a specific type, or subset of metadata. Below are described the subsets that are generally best known or considered most important. Several more types may be identified to serve special purposes, but are not further described here.

1.1.3.1 Statistical metadata

According to definition 1.2 statistical metadata are “data about statistical data”. This definition is very generic and needs to be more precise in order to be useful. From a more operational point of view statistical metadata can be seen as those metadata that directly refer to central concepts in the statistics, e.g., those that define and describe statistical unit types used in a survey, a census or a register, their characteristics, the variables and the statistical activities[8]. This still means that the statistical metadata subset may – at least partly – overlap some other subsets, but will exclude some more administrative and technical ones.

Statistical metadata may belong to any of the metadata categories described above.

[Def 3.1] Statistical metadata are data about statistical data.
Examples: Variable definition; register description; code list

1.1.3.2 Process metadata

Information on an operation, such as when it started and ended, the resulting status, the number of records processed, which resources were used, etc., is a specific type of metainformation. This kind of metadata is known under several names, such as process metadata, process data, process metrics, or paradata. These data may contain either expected values or actual outcome. In both cases, they are primarily intended for planning – in the latter case by evaluating finished processes in order to improve recurring or similar ones. If process metadata are formalised, this will obviously facilitate computer-aided evaluation.

Process metadata are less likely to be categorised as free-form, but may be active or passive, and reference or structural.

[Def 3.2] Process metadata are metadata that describe the expected or actual outcome of one or more processes using evaluable and operational metrics
Examples: Operator’s manual (active, formalised, reference); parameter list (active, formalised, reference); log file (passive, formalised, reference/structural)

1.1.3.3 Quality metadata

Quality metadata may be read as metadata on the quality of the data or metadata (of high) quality. Both interpretations are relevant to statistics production and data warehousing.

Keeping track of, maintaining and perhaps raising the quality of the data in the SDWH is an important governance task that requires support from metadata. Quality information should be available in different forms and serve several purposes: to describe the quality achieved (e.g. how a survey was carried out, or what the outcome was), or to measure the outcome (a contribution to the process metadata). The main objective of the former is to serve the end users of the data, while the latter primarily supports governance and future improvements.

Most quality metadata can be categorised as passive, free-form and reference metadata.

[Def 3.3] Quality metadata are any kind of metadata that contributes to the description or interpretation of the quality of data.
Examples: Quality declarations for a survey, a census or a register (passive, free-form, reference); documentation of methods that were used during a survey (passive, free-form, reference); most log lists (passive, formalised, reference/structural)

Metadata quality is obviously a very important issue, and it should be high, within the restrictions of reasonable cost-benefit analysis. Inferior metadata quality may lead to unnecessary misinterpretations of the data contents or even in completely useless data. A detailed discussion and recommendations on metadata quality can be found in Recommendations on the Impact of Metadata Quality in the Statistical Data Warehouse[9] (Deliverable 1.2).