Centraal Bureau voor de Statistiek

Detecting and correction data linkage errors

Arnout van Delden and Jeffrey Hoogland

Summary: In subtask 1.3 of WP 2 we concentrate on methodological developments to improve the accuracy of a data set obtained by record linkage. Record linkage is in general not a flawless process; we have to reckon with the possibilities of incorrect or missed matches. When variables are added to a unit by linking it to another source, the effect of incorrect linkage is that the additional variables are from another unit than they are considered to be and as a consequence the additional variables will, in general, have incorrect values. In the editing process of data from combined sources it is therefore important to devise edit rules that can detect linkage errors. We show examples of linkage errors and formulate methods to detect these errors.

Keywords: Linkage error, edit rules.

1.Background

1.1Introduction

There is a variety of economic data available that is either collected by statistical or by public agencies. Combining those data at micro level is attractive, as it offers the possibility to look at relations / correlations between variables and to publish outcomes of variables classified according to small strata. National statistical institutes (NSI’s) are interested to increase the use of administrative data and to reduce the use of primary data because population parameters can be estimated from, nearly integral data and because primary data collection is expensive.

The economic data sources collected by different agencies are usually based on different unit types. These different unit types complicate the combination of sources to produce economic statistics. Two papers, the current paper and Van Delden and van Bemmel (2011) deal with methodology that is related to those different unit types. Both paper deal with a Dutch case study in which we estimate quarterly and yearly turnover, where we use VAT for the more simple companies[1] and survey data for the more complicated ones.

Handling different unit types starts with the construction of a general business register (GBR) that contains an enumeration of the different unit types and the relations between the unit types. From this GBR the population of units that is active during a certain period is derived, denoted by the volume GBR. Next, different sources may be linked to the volume GBR. The current paper by Van Delden and Hoogland (2011) focuses on methodology to detect and correct errors in the relations between unit types in the volume GBR and errors in the linkage between the source-units and the GBR units.

In the Dutch case study, after linkage, we handle differences in definitions of variables and completion ofthe data. After both steps population parameters are computed. Both steps are treated by van Delden and Bemmel (2011) and resemble micro integration steps as described by Bakker (2011).After computation of population parameters, an additional step of detection and correcting frame and linkage errors is done as treated in the current paper.

In a next step, the yearly turnover data are combined at micro level (enterprise) with numerous survey variables collected for the Structural Business Statistics. The paper by Pannekoek (2011) describes how to achieve numerical consistency at micro level between some core variables collected by register data and variables collected by survey data. Examples of such core variables in economic statistics are turnover, and wages.There are also other European countries that estimate such a core variable, e.g. turnover, from a combination of primary and secondary data. Total turnover and wage sums are central to estimation of the gross domestic product, from the production and the income side respectively.

Because the current paper and Van Delden and Bemmel (2011) share the same background, the current section 1.1 and the sections 1.2 and 2and are nearly the same in both papers.

1.2Problem of unit types in economic statistics

The different unit types in different source types in economic statistics complicate their linkage and subsequent micro-integration. When a company starts, it registers at the chamber of commerce (COC). This results in a so called ‘legal unit’. Next, the government raises different types of taxes (value added tax, corporate tax, income tax) from these “companies”. Depending on the tax legislation of the country, the corresponding tax units may differ from the legal units of the COC, and they may also differ for each type of tax. Finally, Eurostat (EC, 1993) has defined different statistical unit types (local kind of activity unit, enterprise, enterprise group) which are combinations of legal units according to certain rules. For the more complicated “companies” the statistical units often differ from the tax units.

In the end, for each country, the set of unit types of “companies” may be somewhat different. Generally speaking, for all countries it is true that the more complicated “companies” have a 1: n relationship between the enterprise and the legal unit. In some countries, like France, there is 1:1 relationship between legal units and tax units, therefore tax units can uniquely be related to statistical units. In the Netherlands however, units that declare tax may be groupings of legal units that belong to different enterprises (Vaasen and Beuken, 2009). Likewise, in Germany, tax units may declare turnover for a set of enterprises (Wagner, 2004). As a consequence, at least in the Netherlands and Germany, for the more complex companies tax units may be related to more than one enterprise.

1.3General Business Register

NSI’s use a GBR that contains an enumeration of all companies in terms of the statistical units, derived from legal units. The GBR contains the starting and ending dates of companies. Additional variables are size class and economic activity (NACE code). In 2008, Eurostat has renewed its regulation on a business register (EC 2008) in order to harmonise outcomes over different European countries. NSI’s also use a GBR to harmonise outcomes over different economic statistics within an NSI. According to EC (2008) a GBR should contain statistical units (enterprises and their local units, enterprise groups) and the legal units of which those enterprises consist. In addition, in the Netherlands– and other NSI’s, also added the relations with the administrative units of the tax office, in order to be able to use tax office for statistical purposes.

1.4Problem description

Within the GBR errors may occur between the relations of the unit types in the frame, which is explained in more detail in section 3. Also, linkage errors may occur between units for which observations are available and the volume GRB, for example due to time delays. The focus of the current paper is to describe methodology to detect and correct errors in the relations between unit types (within the GBR) and linkages errors between observations and the GBR.

1.5Outline of the paper

The remainder of the paper is organised as follows. Section 2 describes the Dutch case study. Section 3 gives a classification of the errors that are considered in the current paper. In section 4 we describe the strategy of detection and correcting the errors. Section 5 gives an example of a preliminary test on the effectiveness of a score function that we use. Finally, section 6 concludes and gives topics for future research.

2.Description of the case study

2.1Background: statistical output

In the current paper we deal the estimation of quarterly and yearly turnover levels and growth rates, based on VAT declarations and survey data. The work is part of the project called “Direct estimation of Totals”. Turnover is estimated for the target population which consists of the statistical unit type the enterprise. For each enterprise, we know the economic activity from which we derive the NACE code and the number of working persons from which we derive the size class code (SC code). So turnover is classified by NACE and SC code. An overview of all processing steps from input to output data can be found in Van Delden (2010).

The estimated quarterly figures are directly used for the short term statistics (STS). Also, the quarterly and yearly turnover levels and growth rates are input to the supply and use tables of the National Accounts, where macro integration is used to obtain consistent estimates with other parameters. Also, results are used as input for other statistics like the production index (micro data) and the consumption index (the estimates). Finally, the yearly turnover results are integrated at micro level with survey data of the Structural Business Statistics. Next, the combined data is used to detect and correct errors in both the turnover data as well as in the other SBS variables. The yearly turnover results per NACE code  SC code strata are used as a weighting variable for SBS sample.

In fact we deal with four coherent turnover estimates:

-net total turnover: total invoice concerning market sales of goods and services supplied to third parties excluding VAT

-gross total turnover: total invoice concerning market sales of goods and services supplied to third parties including VAT

-net domestic turnover: net turnover for the domestic market, according to the first destination of the product

-net non-domestic turnover: net turnover for the non-domestic market, according to the first destination of the product

More information on turnover definition can be found in (EC, 2006). In the remainder of the paper we limit ourselves to net total turnover further referred to as turnover.

The quarterly and yearly figures are published in different releases, as shown in Table 2.1. The quarterly releases vary from a very early estimate delivered at 30–35 days after the end of the corresponding quarter to a final estimate for SBS publication delivered April year y+2 where y stands for the year in which the target period falls.

Table 2.1:. Overview of the turnover releases of the case study

Release / Period of estimation / Moment / Explanation
Flash estimate / Quarter / 30–35 days after end of target period / Provisional estimate delivered for Quarterly Accounts, STS branches with early estimates
Regular estimate / Quarter / 60–70 after end of target period / Revised provisional estimate for Quarterly Accounts, and for STS
Final STS estimate / 4 quarters simultaneously with year / Aprily+1, one year after year of target period / Quarterly estimate consistent with yearly turnover, final estimate for STS.
Final SBS estimate / 4 quarters simultaneously with year / Aprily+2, two years after year of target period / Quarterly estimate consistent with yearly turnover based on STS and SBS turnover data, final estimate for SBS

2.2Target population and population frame

The GBR consists of both administrative units types, e.g. VAT units and legal units, and statistical unit types and their relations. The GBR is updated daily. The GBR is constructed from different input sources, e.g. data from the Chamber of Commerce, data from a National Basic Business Register, and customer data from the tax office. During the production of the GBR, the relations between enterprises and their underlying legal units are derived on ownership relations from COC and tax office data, using automated business rules. Likewise, relations between legal units and tax units is derived from the BBR and the tax office data using automated business rules. Within the BBR, address information of tax and legal units are linked, using postal code and house numbers.

The more complex units are visited by profilers who use their findings to manually adjust – if needed - the relations between the unit types.

Every month GBR produces a validated list all units, and their relations, that are active on first day of the month. The statistical target population of a period consists of all enterprises that are active during a period. We represent this target population for quarterly estimates by combining all units that are active from 1 January, 1 February, 1 March and 1 April. The result is referred to as a quarterly GBR. The VAT-units in the VAT data sources are linked to corresponding unit types in the quarterly population frame.

As explained in Vaasen and Beuken (2009), VAT-units are nearly always related to just one enterprise group but they may have be relatedto more than one enterprise,with an enterprise as defined by EC (1993). At SN we use the simplification that for the smaller companies, referred to as non topX units, the enterprise group consists of just one enterprise. For the more complicated companies, referred to as topX units, an enterprise group consists of more than one enterprise.

For the present paper we are interested in the statistical target population of enterprises, in the population of legal unitsthat are related to the target population of enterprises and likewise in the population of VAT units that are related to the target population of enterprises.

Each enterprise and enterprise group in the GBR has an actual and a coordinated value for the SC and NACE code. The coordinated value is updated only once a year, at the first of January and is used to obtain consistent figures across economic statistics. In the remainder of the paper we always refer to the coordinated values of SC and NACE code unless stated otherwise.

2.3Data

In the case study we use two data source types. For the topX enterprises we use primary data and for the non topX enterprises we use VAT data. This approach is quite common, also at other NSI’s in Europe (e.g. Fisher and Oertel, 2009; Koskinen, 2007; Norberg, 2005; Orjala, 2008; Seljak, 2007).For the non topX units, we only use the VAT-observations of VAT units that are related to the target population of enterprises.

Concerning the VAT, a unit declares the value of sales of goods and services, divided in different types of sales. The different sales types are added up to the total sales value, which we refer to as turnover according to the VAT declaration.

We make turnover estimates by using VAT data for the non-topX enterprises and we use primary data collection for the topX enterprises. The latter is done because VAT data cannot be uniquely related to enterprises, as explained before.

In the current paper we use VAT data and primary data for Q42009 – Q4 2010 and focus on quarterly estimates. Data are stratified according to NACE 2008 classification.

3.Classification of linkage errors

3.1Overview

In the case study we consider the following both the relations between unit types within the GBR and the linkage between observations and the GBR,as shown in Figure 3.1. More precisely we have within the (quarterly)GBR:

  1. the relation(s) between the legal unit and the statistical unit (enterprise)
  2. the relations(s) between the VAT unit and the legal unit

Linkage between the data set and the GBR is split into:

  1. linkage of a VAT declaration to a VAT unit in the (quarterly)GBR
  2. linkage of a primary response to an enterprise in the (quarterly)GBR

Figure 3.1.: Units in Data sets (register and survey) related to economic statistics

As described in section 2.2 relation 1 and 2 are derived during the production process of the GBR. Errors may occur in relation 1 and 2, for example due to errors in the data sources used in GBR and due to time delays in those sources.

The tax data and the survey data are linked to the quarterly GBR using exact matching based on the identification numbers which is a straightforward process Errors in linkage of the data set to the GBR (relations 3 and 4 in Figure 3.1) may occurin the exceptional case that an error occurs in the identification number.

In the next two sections we give a classification of errors that are considered in the present paper. We distinguish between (a) errors in the relations between different unit types within the GBR (section3.2) from(b) linkages errors of observations to the GBR (section 3.3). The linkages within the GBR are further subdivided. Note that the errors in the relations between unit type within the GBR is part of the frame errors as mentioned by Bakker (2011).

3.2Linkages errors within the GBR

We divide the errorsin the relations within unit types of the GBR intoincorrect positive links(mismatches) and incorrectmissing links(missed matches). We further subdivide those two error types into those that result in coverage errors compared to the target population and those that do not. A non-domestic VAT unit whose declaration should not be added to the Dutch turnover may be related (via a legal unit) to an enterprise of the target population. We consider this as an incorrect positive link resulting in over coveragecompared to the (domestic) target population. Likewise, an example of under coverage is a domestic VAT unit that is incorrectly not related (via the legal unit) to an enterprise of the target population.