in partnership with

Title: / Guidelines (including options) on how the BR interacts with the SDWH
WP: / 2 / Deliverable: / 2.2.1
Version: / 1 / Date: / 17-9-2012
Autor: / Pieter Vlag / NSI: / Netherlands

ESS - NET

On micro data linking and data warehousing
in production OF BUSINESS STATISTICS

Revision History

Date / Version / Description / Author
17 September 2012 / 1.0p1 / Input document for ESSnet DWH WP 2 deliverable to receive feedback from other WP 2 members, Statistics Finland and the ESSnet EuroGroup Registers / Pieter Vlag

Distribution

Date / Version / Distributed
17 September 2012 / 1.0p1 / Jurga Ruksenaite, Sami Saarikivi, Harrie van de Ven

1. Purpose of the document

1.1.Purpose of the document

The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare recommendations about better use of data that already exist in the statistical system. Its ultimate aim is:

‘To create fully integrated data sets for enterprise and trade statistics at micro level:

a 'data warehouse' approach to statistics.’

The broad definition of a data warehouse to be used in this ESSnet is therefore:

‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’

Within this ESSnet one work package (WP 2) covers all essential methodological elements for designing, building and implementing the statistical data warehouse. One of its major issues is to provide guidelines about the role of the Business Register (BR) in a Statistical Data Warehouse, because it is generally recognized that in a Data Warehouse system the Business Register has a crucial role in

a) linking data form several sources

b)defining the (enterprise) population for all statistical output.

Within this context the purpose of this document is:

  • To present basic principles about the role of the Business Register in a Statistical Data Warehouse.
  • To presentguidelines about the methodological challenges, which are faced when incorporating information of the SBR in a Statistical Data Warehouse (SDWH).

The basic principles are described in chapter 2 of this document. The methodological challenges are described in chapter 3.

1.2Scope

The production –and the input data - of the BRmay differ per NSI. Related to this, its position as co-ordinating frame to produce consistent statistical output may differ in the current situation, too. Furthermore, the role of the Business Register to produce uniform output is also discussed in the European EGR and CORE projects. Therefore, the aim of this draft document is tofind an overall strategyabout the role of the Business Register for data-linking and producing coherent statistical output despite the differences in the current situation per NSI.

1.3Definition of the Business Register

Member States of the European Union maintain business registers for statistical purposes as a tool for the preparation and coordination of surveys, as a source of information for the statistical analysis of the business population and its demography, for the use of administrative data, and for the identification and construction of statistical units. The Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a common framework for the harmonisation of the national business registers for statistical purposes and Article 7 of the Regulation asks for the publication of a business register recommendation manual. The manual aims to explain the reasoning behind the provisions of the Regulation. It aims to provide the extra information required for the correct and consistent interpretation of the Regulation in all countries. This second edition of the manual is derived from the first one published in 2003 and replaces it. The manual has been updated in close cooperation with the Member States.

Taking into account the existence of regulations and manuals to produce the business register. It is assumed that the BR contains at least

a statistical unit.

a name and address of the statistical unit

an activity-code (NACE)

a starting and a stopping date of enterprises and van NACE-code for activity.

The technique for calculating these fields (plus updates) may differ per country, but it does not affect the remainder of the document.

Enterprises unit from other data sources than surveys (like admin data) may differ from statistical units. The same assumption is true for the output, because some is produced for statistical unit, some for enterprise groups and some for local units. A main characteristic for a SDWH is using different input sources and producing flexible output, it is assumed for this that the relationship between the different (input and output) unit types is known and is stored in a kind of unit base.

1.4Short description of the Statistical Data WarehouseProcess

A schematic sketch of the SDWH is provided in figure 1.Basically this sketch shows that the production of a Statistical Data Warehouse consists of two parts:

Part 1: Linking the different input sources (admin data, surveys etc.) to one population and (by imputing or weighting for missing data and correcting for inconsistencies between the sources) processing these data to one consistent dataset.This is the so-called processing base.

Part II: The actual Data Warehouse from which uniform and consistent output van be derived.

Fig 1. Schematic sketch of the Statistical Data Warehouse (Vlag et al., 2011).

Antonio Laureate Palma (2012) has - on behalf of Essen DWH work packages – worked out this model into a SDWH Business Architecture. This architecture consists of 4 phases and 4 architectural layers. The main implication for the interpretation of figure 1 is that the processing part consists of 2 phases (each containing of several sub-processes). As a consequence – the processing part – may consist of two data-storage points.

The position of the BR in figure 1is sketched as follows.

In the processing base:

  • information from the BR is – at a first stage – used as a frame to link all information from different input sources. This step requires knowledge (and a database) about possibly different enterprise units in different sources.
  • Information from the BR is – in a next step - a co-ordination tool for editing and weighting the variables of the different input sources. This step may reveal ‘incorrect information’s’in the BR, which needs to be changed. A requirement for weighting information is that the enterprise information of the BR is corrected for inactive enterprises.

In the “actual data warehouse”:

  • Information from the BR is one of the conditions to produce coherent output, because different output populations should be related. This requires a correction of the BR-data for inactive enterprises and a link between different output units like (enterprise group, enterprise and local unit).

1.5 Literature

Id. / Document / Version / Date / Author
01 / S-DWH Business Architecture
(ESSnet DWH) / 1.1 / 13.07.2012 / Antonio Lauretti Palma
02 / Current best practices and overview of Data Warehouse opportunities / Deliverable 2.1 ESSnet DWH-SGA2011 / 06.10.2011 / Pieter Vlag
03 / Principle document for EGR steering group / 1.0p2 / July 2012 / Harrie van der Ven

2.Basic principles

In the remainder of this document, thesebasic principles have are used.

  • The production of the SBR is a separate process, independent of the statistical Data Warehouse (SDWH). The SDWH only uses some specific information of the SBR, as a backbone to link various input data and produce coherent output.
  • The standard rule is that the information of the Business Register (e.g. NACE-codes) is leading in the SDWH for data-linking and the definition of the population frame.
  • One may deviate from the standard rule mentioned above only if one (or more) sources provide different and more plausible information. If the information of SBR-informationis changed in the Statistical Data Warehouse (due to conflicting) information; the changes should also be incorporated in the original SDW.
  • To keep the scope concise, this documentfocuses on annual statistics only.
  • The Business Register has – in practice - someover coverage, as some enterprises may due to time lags still be in the base registers but are in practice not active anymore. The extent of this over coverage may differ per country and activity, but needs to correct for weighting and producing coherent output.
  • Annual statistics are published with an annual population. This means a population which includes all enterprises which are active during the year.

3.Methodological challenges

3.1Introduction

Based on the basic prinicples as described in chapter 2, the following methodological challenges will be described. The following topics are discussed consequtively:

Use of a backbone of the BR in a Statistical DataWareHouse.

The construction of an annual population frame.

Dealing with various units in different input sources

Active population and admin data covering the entire population (for most activities).

Rules for adjusting information of the backbone of the Statistical Data Warehouse.

3.2Information from the Business Register

In a statistical data warehouse information from the statistical business registers is considered as the authentic backbone for business population elements and their common used descriptive characteristics in a statistical data warehouse.These characteristics can only become effective if the statistical business register is updated and can register changes in the data on the business population and its accompanying characteristics.

For this purpose not all information from the SBR is needed.The needed information from the Business Register can be limited to 11 variables (fig. 2).These 11 variables are ‘extracted’ from the BR and used in the SDWH. In the remainder of this document, this ‘extracted’ information is called the population frame, because this information is the base for linking the several input sources, weighting the surveys and publishing the results.

Fig. 2List of variables from the BR needed for a SDWH. This list agrees with the proposal of EGR-project (van der Ven, 2012).

3.2Construction of an annual population

The following release scheme is proposed for the construction of the annual population for year T in the statistical Data Warehouse.

  • A provisional population frame for year T frame is constructed in November year T-1. This populationframe is used to design surveys. It is also the starting point for the SDWH. This provisional frame is called release 1 and it does formally not cover the entire population of year T as it does not contain the starting enterprises yet.
  • During the year the population frame is regulerly updated with new information from the BR (especially new enterprises). The frequency of these updates depends on the updates of the BR.At the end of year T (or at the beginning of year T+1), a regular population frame for year T can be constructed. This regular population frame consists of all enterpises in the year and is called release 2.
  • At the beginning of year T+1 (or latter) additional admin data and survey results for year T become available. Therefore, it cannot be excluded that errors are detected in the ‘release 2’population. For this reason, a special procedure for additional frame error corrections should be developed and a final population frame is foreseen for July, T+1.

This updating schemeis schematically presented in figure 3.

Fig. 3Proposal for the construction of an annual population. Figure copied from van der Ven, 2012.This figure is an example for FATS but can be generalised for the entire Data WareHouse. Please note that the release 2 in this figure is skipped for the SDWH-procedure.

An important characteristic of the annual population frame is that it is based on (different releases of) the BR information only and therefore only based on data and not on estimations.

3.3Various units

The population frame provides information about the enteprises as represented by statistical units. The exact deterivation of the statistical unit may differ per country. Furthermore, in some countries statistical units correspond (in most cases) with fiscal units, but in other countries not. Therefore, the general strategy that the relation between the statistical units and

a)the (enterprise) units of the different input sources

b)and output obligations (enterprise group; enterprise and local unit)

Should be known and stored in a separate unit base.If statisticals units are equal to fiscal units – even for the biggest enterprises –1-to-1 relationships can be stored in a unit base. This separate unit base is considered as a satellite of the population frame. It covers all units in the population frame and it is linked to the population frame by the statistical unit (fig. 4).

Main reason to store the unit base in a satellite in stead of the population frame itself is:

  • It is quite specific information. To derive this information other input sources than those used for the production of the BR are needed.
  • If microdataare only available for larger units than the units used for output obligations, a ‘division module’ is needed. This is for example the case for employment data available at enterprises level (= statistical unit), but publications by region needs the use of smaller local (enterprise) unit level.This ‘division module’ needs some statistical estimation and should therefore be separated from the enterprise population information.

Preferably the unit base is updated with a similar periodicity as the releases of the population frame.

3.4Admin data covering the entire population and determination whether enterprises are active.

Theoreatly annual VAT and social security data (emloyment) cover for most activities all enterprises. Therefore, these two administrative data sources are considered as satellites and linked to the population frame. The reason is two-fold:

  • The quasi complete VAT and social security registers may provide good estimates about key variables like employment or turnover. As the information is quasi-complete these totals can be published at detailed level and can – by using specific weighting techniques- be used as ‘cornerstones’ of other variables derived from small surveys.
  • The quasi complete VAT and social security registers reveal whether an enterprise is still active or not. For example, if an enterprise is still included in the (information derived from the) BR but doesn’t report VAT or employment anymore one may assume that the enterprise is not active anymore. Certain, rules about this have been developed by the ESSnet AdminData.

The latter point implies that the combination of a population frame derived from information of the BR and the use of VAT and ‘social security’ admin data to check the activity status leads to an active population frame. The latter should be used to publish coherent statistical output.

Preferably the updates of the active population frame are carried out with the same periodicity as the updates of the population frame. Therefore, it is proposed to upload new data from the VAT en ‘social security’ registration with the same frequency as the releases of the BR-information. Note that for determining the active population frame around release 1 ‘historical’ admin data should be used and for release 2 incomplete annual data. Hence, an error around the estimation of the active population does exist. The latter is the main reason to separate the determinination of the active population from the pure BR-information and use the satellite information from the admin data for it.

3.5Adjusting information of the backbone of the Statistical Data Warehouse

By using information from the BR as a backbone for the population in a SDWH and combining this with satelites containing quasi-complete information about enterprise unit, turnover and employment the basic condition to produce coherent output is fullfiled. Then other input data covering only part of the enterprises (surveys, admin data for specific enterprises) can used because they need to be linked to an active population with known key variables (to detect non-representativity).

From the methodological point of view the coherent weighting techniques to link the estimates from the oother data sources with the backbone + satelites are not always straightforward, but can be resolved. A more fundamental problem is when the information from the input sources is conflicting due to errors in one of the data sources. Especially when this leads to incoherent outcome. In the case some datasources need to be changed according to fact finding at microlevel or some decisions based on general considered reliability of one specific source. Conflicting information between the sooruces may affect the backbone on two specific points

  • NACE-codes, which appear to be incorrect when confronting with other data sources
  • Size classes, which appear to be incorrect when confronting with other data sources.

If necessary, this may lead to changes in NACE-code and size-class in the population frame (=backbone). The latter leads to adapted weighting procedures for survey data covering only a part of the population and needs feedback to the original BR from which the population frame is derived.

As this is quite complex, it is recommended that all such changes are only made to a minimum extent and are queued until one moment (release 3).