SDMX Data Modeling Tutorial

Handbook for statisticians on how to use SDMX

Contents

1. Introduction

2. Microdata and macrodata, the statistical characteristic

2.1 Microdata and Macrodata

2.2 Statistical characteristic and object characteristic

3. The SDMX Information Model – principles

3.1 Reference metadata and structural metadata in SDMX

4. Comparison between SDMX and Statistical concepts

4.1Data Structure definition and related artefacts

4.2 Statistical concepts

4.3 Statistics and SDMX

5. Modelling data

5.1 From a statistical multi-dimensional table to a DSD

5.2 From a statistical survey to a Data Structure Definition

5.2.1 Definition of concepts

5.2.2 Measure

5.2.3 Dimensions

5.2.4 Attributes

5.2.5 Code lists

5.2.5.6 Definition of DSD and mapping of data

6. SDMX in practice

7. Conclusion

8. Appendix

5Appendix A – Bibliography

1. Introduction

SDMXsupportsthe statistical processnotonly in thedissemination phase, throughthe definition ofstandard data and metadata messages,but alsoin thedata modelling phase,throughguidelinesforthe definitionofmetadatadefining data (structural metadata)[1]. The main aim of this handbook is to introduce statisticians on how to model statistical aggregated data using SDMX.

This handbookis aimed at people workingin the production services but it is also intendedto those metadata expertsthat are responsible for variable and classificationdefinitions used for data dissemination. It is alsoaddressed to those readersalready havingsome knowledge of the SDMX standard and that want to have only a hands-on approachon data modelling. Any depth investigation on SDMX is left to the official documentation.

The reader will learn how to model databy a practical example of survey: theItalian Labour Force Survey.

Chapter 2contains two extracts about “The statistical business process”

Chapter 3 describes some artefacts from the SDMX Information Model

Chapter 4 represents the “bridge” between a statistical cube and SDMX.

Chapter 5 shows the modelling of data using the Istat Labour Force Survey.

Chapter 6 Show some tools that can be useful in this context

2. Microdata and macrodata, the statistical characteristic

This chapter quotes some extracts from two documents about the data modelling and the statistical business architecture. This should be considered asan introduction to the statistical characteristics of aggregateddata.

2.1 Microdata and Macrodata

“[…] Microdata are data about individual objects (persons, companies, events, transactions, etc). Objects have properties which are often expressed as values of variables of the objects. For example, a ”person” object may have values of variables such as ”name”, ”address”, ”age”, ”income”. Microdata represent observed or derived values of certain variables for certain objects.

Macrodata, "statistics", are estimated values of statistical characteristics concerning sets of objects, "populations". A statistical characteristic is a measure that summarizes the values of a certain variable of the objects in a population. ”The average age of persons living in OECD countries” is an example of a statistical characteristic. Some statistical characteristics, e.g. correlations, summarise the values of more than one variable. Macrodata represent estimated values of statistical characteristics. Estimated values deviate from true values because of different imperfections (errors and uncertainties) in the underlying observation (measurement) and derivation processes. The difference between ”estimated” and ”true” values is an issue not only on the macro level, but also on the micro level, since the observed (measured) values deviate from the true values because of measurement errors.

Statistical metadata are data describing different quality aspects of statistical data, e.g.

  • contents aspects, describing definitions of objects, populations, variables, etc;
  • accuracy aspects, describing different kinds of deviations between observed/estimated and true values of variables and statistical characteristics;
  • availability aspects, describing which statistical data are available, where they are located, and how they can be accessed.” (United Nations – 1999)

2.2 Statistical characteristic and object characteristic

“[…]A statistical characteristic is defined by a triple <O, V, f>

where

  • O is a set of objects (or object vectors), called a population;
  • V is a variable (or a vector of variables) having values for the objects in the population;
  • f is an operator, called a statistical measure, producing a value f(O, V) for the population from the values of the variables for the objects in the population.

Typical examples of statistical measures are frequency count, sum, average, and variance.

The population is often structured into subpopulations, for which estimates are produced as well.

Time usually plays an important role in the definition of a statistical characteristic. The population is often defined as the set of objects of a certain type, having a certain property (or combination of properties) in common at a certain point of time. Alternatively, the population can be defined as the set of objects of a certain type that have been born, lived, or died during a certain time period, e.g. the events of a certain type that have occurred during a certain year, or the processes of a certain kind that have started, been on-going, or stopped during a certain month.

The variable V must usually be qualified by a time parameter, too, in order to ensure that every object in the population is associated with a unique value (or set of values, in the case of multivalued variables). If V is a set of variables, all the variables may be separately qualified by (possibly different) time parameters.

Some examples of statistical characteristics:

  • the number of people living in Canada at the end of 1996;
  • the average income of people living in France at the end of 1996;
  • the total value in current US dollars of the production of commodities in the United States during the first quarter of 1996;
  • the number of road accidents that have occurred in Germany during 1996;
  • the average length of hospital treatments that was on-going in Holland during (at least) some part of 1992;
  • the average percentage increase/decrease between 1995 and 1996 in the annual income of people living in Sweden during the whole of the two-year period 1995-1996.

The (true) value of a statistical characteristic is derived (by means of an aggregation process) from the (true) values of one or more sets of object characteristics. The estimated value of a statistical characteristic is derived (by means of another aggregation process, called the estimation procedure) from the observed values of (possibly the same) sets of object characteristics, the so-called observation characteristics.

An object characteristic is defined by an ordered pair <O, V>

where

  • O is a set of objects (or object vectors), called a population;
  • V is a variable (or an object relation), having values for the objects in the population.

Time plays a similar role in the definition of an object characteristic as it does for the definition of a statistical characteristic.

Each object (or object vector) in the population is associated with one instance of the object characteristic. At any particular time t, each object (or object vector) in the population is associated with a unique value of V (or with a unique set of values of V in the case of multivalued variables).”(United Nations – 1995)

3. The SDMX Information Model – principles

The SDMX Information Model (SDMX-IM) provides a broad set of formal objects to model statistical data and metadata. Figure 1 shows the high level SDMX artefacts detailed in the SDMX-IM.

Figure 1- SDMX-IM

  • A Data (metadata) Provider is a statistical organization that provides data and metadata to other organizations that act as Data (metadata) Collectors;
  • The exchange is often based on an agreement (Provision Agreement)between the Provider and the Collector. A Provision Agreement specifies which (and when) data (metadata) set has to be exchanged between the Data Provider and the Data Collector;
  • The Data (Metadata) Structure Definition (DSD/MSD) specifies a set of concepts which describe and identify a set of data (metadata);
  • The Data and Metadata flows represent the containers defined by the DSD or MSD for the sets of data and metadata (Data Set and Metadata Set)
  • The Category and Category Scheme represent a way of groupingData Sets in a common subject theme.

This tutorial is focused on how to build a Data Structure Definition according to the SDMX description of statistical aggregated data.Therefore,the Metadata Structure Definition is demanded to other manuals and tutorials that can be found on the SDMX.org website.

3.1 Reference metadata and structural metadata in SDMX

In SDMX,we can find Structural metadata and Reference metadata. Figure 2 shows the relationship between them andData or Metadata Set.

Figure 2- Relationship between DSD & MSD

a DSD describes the information structure within a specific statistical domain, thus allowing a full complete description of a set of data if all values are given. A limited number of specific concepts are needed for DSDs to function properly;

a MSD describes how a metadata set is organized. It particularly defines which reference metadata are being compiled, how the concepts are related to each other, how they are represented (either as free text or coded values) and with which object types (agencies, data flows, data providers, etc.) they are associated.

Structural metadataact as identifiers and descriptors of data sets and reference metadata sets.Therefore, DSD and MSD consist of structural metadata.

Reference (or explanatory) metadatadescribe the contents and the quality of the statistical data (conceptual metadata, describing the concepts used and their practical implementation, methodologicalmetadata, describing methods used for the generation of the data, e.g. sampling, collection methods, editing processes, and quality metadata, describing the different quality dimensions of the resulting statistics, e.g. timeliness, accuracy).

4. Comparison between SDMX and Statistical concepts

4.1Data Structure definition and related artefacts

The Data Structure Definition is “a set of concepts which describe and identify a set of data. It tells which concepts are dimensions and which are attributes, and it gives the attachment level for each of these concepts, based on the packaging structure (Data Set, Group, Series, Observation) as well as their status (mandatory versus conditional). It also specifies which code lists provide possible values for the dimensions, as well as the possible values for the attributes, either as code lists or free text fields [….]”(SDMX Initiative – 2005).

Figure 3 shows the relationship between DSD, Concepts, Codelists, Dimensions, Attributes and Measures.

Figure 3-DSD artefacts

Concepts are those SDMX artefacts necessary to interpret the data and they can be distinguished in:

  • Dimensions represent those concepts that identifyand describe the data;
  • Attributesrepresentstatistical concepts providing qualitative information about a specific statistical object such as a data set, observation, data provider, or dataflow. Concepts such as units, currency of denomination, observation status, titles and methodological comments can be used as attributes in the context of an agreed data exchange (from MCV[2]).
  • Measuresrepresent the measure of the phenomenon or phenomena;

Code lists represent a collection of items that provide each possible value for concepts. The Codelists are characterized by a set of codes and a descriptions and in case of hierarchical codelist every code is associated to another code of the same codelist (parent code) to establish a simple hierarchy.

4.2 Statistical concepts

How does SDMX fit with statistics? The following list is just a rule of thumb, but it can turn out useful for statisticians in order to better organize all the elements that define a table.

Among the concepts, it is necessary to include these sets of elements, which have already been introduced in par 2.2:

  • statistical variables: they include those variables to which the operator is applied (named V in par. 2.3).
  • statistical measures/operator(named fin par. 2.3): this set of concepts should include the statistical operators (average, total, numerosity, index number,…) and their possible characteristics (base year for the index number, adjustment for those data that can be seasonally adjusted,…). Furthermore, this set could also include those concepts which are used to represent (number of decimals; unit multiplier:…), measure (unit of measure;…) and disseminate (for definitive data or not; for estimated data or not; for forecasted data or not;…) data;
  • statistical population(named O in par. 2.3): this set describes the reference group of elements over whichthe statistical variables are observed andthe statistical operator computes the corresponding data.

4.3 Statistics and SDMX

How can SDMX be interpreted in a statistical sense for aggregated data? Where can we position the notions of “population”, “statistical variable”, “statistical measure/operator” in a DSD?

  • the variables(V)should always be used as dimensions or as elements of a measure dimension when applied to an operator. In statistics, they are the dimensions of a multivariate distribution;
  • the statistical operator (f) always describesdata measures. Therefore, it would be useful to describe it (together with variable to which it is applied) as a measure dimension (if not a primary measure);sometimesthe statistical operator can be either defined as a dimension or an attribute;
  • the concepts that are used to better define the statistical operator can be either dimensions or attributes, depending on their role in the definition of data;
  • the population of interest (O) is usually a neglected element in a DSD. Anyway, it can find a place among the attributes.

The comparison between statistical characteristics and SDMX artefacts is summarized in the table below:

Statistical concepts / id / description / SDMX concepts
variables / V / phenomena investigated on the population of interest / Dimensions
statistical operator / f / statistical operators like average, total, index number,i.e. / Dimensions or attributes
statistical operator applied to the variable and to the population of interest / f(V,O) / Average of hours worked, total number of employee, index number of industrial production, i.e. / Measure Dimension
characteristics of statistical operator / Characteristics of statistical operator like base year, adjustment, i.e. / Dimensions or Attributes
population of interest / O / reference group of elements over which the statistical variables are observed and the statistical operator computes / it can be represented by an attribute, but it is not always defined in a DSD explicitly,

Table 1 – Comparison between statistical characteristics and SDMX artefacts

5. Modelling data

How to model data in SDMX? Where is it possible to position the SDMX artefacts in the statistical notions of “population”, “statistical variable”, “statistical measures/operator”?

We will answer these questions by using the Istat Labour force survey. From this survey Istat derives its official estimates of the number of employed persons and job-seekers, as well as information about the main labour supply aggregates, such as occupation, economic activity area, hours worked,contract types and duration and training.

This survey has been chosen becausesince the beginning of 2004 it has been in line with European Union regulations.Therefore, it can be considered as a survey with common characteristics for every ESS country.

Before going deeper in data modelling, there is another important aspectto deal with: when data should be modelled?

The most suitable phaseto model data for the dissemination is the micro data aggregation phase. By referring to the GSBPM (General Business Process Model), this phase corresponds to phase 6 “Analyze” (see Appendix). However, in some cases the dissemination starts before the SDMX modelling.When it happens, the SDMX modelling is applied on disseminated static tables (e.g. Excel tables) or on data in a data warehouse. This can be a heavier process that harms both vertical (inside the same survey) and horizontal (among different surveys) metadata harmonization.

In this chapter,two different approaches are shown. The formerdeals with the modelling of a multi-dimensional table, taken by the Istat dissemination web site (after dissemination process). The latterdirectly models the statistical variables as they result from the aggregation phase of the survey.

5.1 From a statistical multi-dimensional table to aDSD

The table below, which has been taken from the Italian corporate data warehouse (Istat, can be used as an example. It shows the number of employed personswhose age is 15 years and more, which areclassified according to the cross-section of territory, gender, age class, Nace 2007.

Figure 4 – Multi-dimensional table

This table can be described by pointing out that:

  • the population (O) under investigation in the table consists of the employed persons, in June 2012;
  • the variables (V) we are interested in are: territory, gender, age class, Nace 2007;
  • the statistical measures/operator(f)which are applied to the population of interest is given by the “Data type” variable (which, in this case, corresponds to the enumerator: “number of total employed person”).
  • the characteristics of the statistical operator (frequency, unit of measure,data adjustment) are not explicitly declared as specific dimensions/attributes in this table, but are included both in the dataset title (seasonally adjusted data) and in the description of the “Data type” variable(as the unit multiplier, in this case“thousand”).
  • There are two other dimensions that have to be taken into consideration: Time and Frequency of data release. Both these dimensions will allow us to represent the SDMX model of this hypercube as a time series.

As described in chapter 3, SDMX uses different artefacts in order to describe statistical data. The main ones are:

  • concepts: i.e. all those elements that are used in order to define, interpret and measure the statistical data in a table;
  • code lists: i.e. the way in which a concept can be associated to every single datum in order to describe it;
  • measure: for those concepts that describe what the figures in the table are measuring;
  • dimension: for those concepts that identify each cell in the table;
  • attribute: for those concepts that allow the correct interpretation of a figure in a cell, and any change in the value of the attribute does not modify the definition of that cell.

These artefacts constitute the data structure definition .

As far as Figure 4 is concerned, SDMX would consider (in a very simplified setting) the following model:

The concepts: territory, gender, age class, Nace 2007, frequency, time dimension,measure, adjustment, unit multiplier and “Data type”.

The corresponding code lists:

  • CL_NACE_REV2_CODE=total; agriculture, forestry and fishing; total industry excluding construction (b to e); construction; total services (g to u).
  • CL_AGE=15 years and over.
  • CL_SEX=male, female, total.
  • CL_GEO=Italy, north,centre, south.
  • CL_OPERATOR=total number.
  • CL_FREQUENCY=quarterly.
  • CL_UNIT=absolute values.
  • CL_UNIT_MULTIPLIER= thousand.

DSD: is the set of previously defined concepts, which are associated with the corresponding codelist (for coded concepts) and their respective role.