Executive summary
1.Introduction
1.1Aim of this report
1.2MDL 2014-2015
1.3Structure of the report
2.Phase I: Production of MDL-database
2.1Background of the sources
2.2Availability of variables
2.3Experiences building MDL-database
2.4Conclusions and recommendations
3.Phase II: Validation
3.1Introduction
3.2Experiences validation
3.3Conclusions and recommendations
4.Phase III: Output
4.1Introduction
4.2Data analysis
4.3Regular tabular output
4.4Limitations MDL and national sources
Annex I: Background information sources
Annex II: Availability variables
Annex III: Overview issues
Annex IV: Country report Austria
Annex V: Country report Denmark
Annex VI: Country report Finland
Annex VII: Country report Germany
Annex VIII: Country report Latvia
Annex IX: Country report the Netherlands
Annex X: Country report Norway
Annex XI: Country report Portugal
Annex XII: Country report Sweden
Executive summary
This methodological report summarizes the main conclusions and recommendations from the project ‘Micro data linking of structural business statistics and other business statistics’ (in short MDL 2014-2015).
The MDL 2014-2015 project produces national databases containing the most central structural business statistics – with information available on the enterprise and enterprise group level – in order to conduct micro-level economic analyses. Beyond this, the project provides the basis for further analyses in the future based on the national databases established in the project. The new statistical knowledge is produced without carrying out new surveys, i.e. without increasing the respondent burden on enterprises.
An important goal of the project is to test the feasibility of the micro data linking approach to produce (regular) statistical output to be considered a supplement to the current annual deliverables of tables as part of SBS.
The micro data linking project produces several deliverables, including three output deliverables. These output deliverables consist of tabular data for each of the three topics– intended for publication on the Eurostat website – and a descriptive analysis in the form of a Statistics Explained article for each topic.
Presented below are the main conclusions and recommendations from the MDL 2014-2015 project:
Conclusions:
1)Overall there was positive feedback on the quality and timeliness of the circulated guidelines and syntaxes.
2)The used approach was suitable for all participating countries in the project.
3)Most countries reported that the production of the national databases (phase 1 of the project) and its validation (phase 2) wastime consuming.
4)It was an educational experience for all countries. The participants learned more about their own (micro) data.
5)It is important to define the output specifications as early as possible; knowing that the projects are research oriented and as such will require an iterative process where the final output cannot be determined from the outset of the project.
6)The validation phase was overall very valuable, even though the validation produced mixed results and experiences.
7)The interpretations of the different national results of validation were mixed. Some reasons were mentioned by many NSI’s, while at the same time there were also quite some reasons for the output in the validation mentioned only by one country (see also annex III).
Recommendations:
1)Take into consideration explicit technical requirements in the project to avoid unnecessary problems and possible delays (both hardware and software).
2)Provide more meta description in SAS-syntax, examples of datasets and instructions on importing files in SAS. This is particularly helpful for countries with limited SAS-knowledge.
3)Define beforehand which consistenciesof the output tables need to be guaranteed; amongst themselves and in relation to official data from the NSI’s.
4)Make explicit references to framework regulations regarding project specific output (especially the BR-variables).
5)Focus the validation on the largest enterprises. Also macro validation techniques should be included.
6)Include additional BR-variables, like location and address, for validation purposes.
7)Evaluate after each phase of the project instead of only at the end of the project.
- Introduction
1.1Aim of this report
This report summarizes the experiences and results of the project ‘Micro data linking of structural business statistics and other business statistics’ (in short MDL 2014-2015) from the participating countries: Austria, Denmark, Finland, Germany, Latvia, the Netherlands, Norway, Portugal and Sweden. The main conclusions are presented in this report and recommendations for future exercises with micro data linking are made.
This report serves as deliverable ‘D4.4Methodological report’ of the MDL 2014-2015 project (theme: 06.1.23-Development of structural business statistics).
1.2MDL 2014-2015
“The picture of economic globalization provided by current official statistics is incomplete, the causal links to economic welfare indicators such as employment and wages tend to be weak and unconvincing, allowing a set of highly charged, politically motivated, and unproductive debates over the basic facts”, economic geographer Timothy Sturgeon[1]reported to Eurostat. He concluded: “[T]he most pressing need is to make full use of existing data sources, for a system that ties data from business surveys to the wealth of information from administrative sources”.
The MDL 2014-2015 project is a big step in this direction. It is thegoal of the project to producenational databases containing the most central structural business statistics, with information available on the enterprise and enterprise group level, in order to conduct micro-level economic analyses. Beyond this, the project provides the basis for further analyses in the future based on the national databases established in the project. The new statistical knowledge is produced without carrying out new surveys, i.e. without increasing the respondent burden on enterprises.
To the extent possible, the new MDL database is structured using input data for the reference period of 2008-2012 from Structural Business Statistics, International Trade in Goods Statistics, International Trade in Services Statistics, Community Innovation Survey, ICT usage and e-Commerce in enterprises Survey, Foreign Affiliate Statistics (Inward and Outward), Business Demography Statistics, International Organization and Sourcing Survey and the national Business Register. Information from the National Business Register is central for the establishment of the database as the issue of identity over time is essential for longitudinal micro-level analysis.
The micro data linking project is divided into three phases:
- Matching and adjustment of data and structuring of the database;
- Validation controls and calculation of weights for the control group population(s) where necessary;
- Production of standardized output.
The micro data linking project and its phases are illustrated in figure 1.
Figure 1: The phases of the MDL 2014-2015 project
1.3Structure of the report
This report is divided into three parts, in line with the phases of the MDL 2014-2015 project. In the first part the data sources are addressed.This part discusses the quality of the sources and the availability of variables. Also, part one discusses the member states’ experiencesregarding the construction of the national databases (chapter 2). The central topic of the second part is the validation of the constructed databases. The results and experiences of the participating countries are presented in this chapter. In order to further improve the consistency and quality of the data, special attention was given to the following issues: instability over time, no-match, unit representation and demographic change (chapter 3). The third, and final, part gives additional information on the choices made in the data analysis for the three Statistics Explained articles (chapter 4). The most relevant knowledge gathered during the micro linking project can be found in the annexes at the end of the report.
- Phase I: Production of MDL-database
This chapter presents the results and experiences from the first phase of the project: the matching and adjustment of the data and the structuring of the MDL-database. As it is crucial to have a clear understanding of quality and limitations of the available variables and sources used for the database, this part of the methodological report also focusses on the relevant background information of the national sources.
2.1Background of the sources
At the start of the MDL 2014-2015 project all nine participating NSI’s were asked to give additional background information on their national sources that are used in this project. Issues that are addressed include, amongst other: the type of statistical units that are used; how the coverage is of the source when compared to the Structural Business Statistics (SBS); and whether cut-off limits, other supplementary/complementary data, a sampling strategy, estimation methods and/or imputation methods are used during the construction of the national source.
The answers on these, and other, questions from the project participants are summarized in Annex I. The original completed questionnaires are in the possession of Eurostat.
2.2Availability of variables
Besides giving additional background information on the national sources, all participants were also asked to indicate which variables are available in their national sources. As the overall table in Annex II shows, the majority of the listed variables are available for most countries. Not surprisingly, variables from the national Business Register, Structural Business Statistics, International Trade in Goods Statistics, Inward Foreign Affiliate Statistics and Business Demography Statistics are widely available. Variables from other sources, however, are less available, for example due to non-participation in specific surveys (like the International Organization and Sourcing Survey).
Eurostat obtains the country specific overviews of available variables from the participating NSI’s.
2.3Experiences building MDL-database
The detailed experiences from the NSI’s were gatheredin questionnaires that were send out to the participants at the end of the project. These country experiences can be found at the end of this methodological report, in the annexes 4 – 12. A full overview of the mentioned issues/problems and proposals are displayed in annex 3.
This paragraph lists the main experiences and proposals of the participating countries regarding the construction of the MDL-database, as presented and discussed in the fourth, and last, task force meeting of the MDL-project.
Experiences
- The overall feedback in this phase of the project is positive. Moreover, all participants agree (5) or strongly agree (4) that the used approach to build the MDL-database was suitable.
- The circulated guidelines were positively evaluated. The nine participants ranked them as ‘good’ (2), ‘very good’ (3) and ‘excellent’ (4). The guidelines had a “clear structure and visualizations” and were “detailed and comprehensive”.
- The circulated syntax was positively evaluated. The nine participants ranked them as ‘good’ (4), ‘very good’ (4) and ‘excellent’ (1). Overall, the syntax was “easy to adept and apply” and “efficient despite some minor mistakes”. For NSI’s with limited knowledge of the SAS-software more description would have been helpful.
- Most countries agreed that this phase was more time consuming than anticipated.
- Two NSI’s had changes in staff during the project.
- Almost all countries had issues with different SAS-versions. This resulted in adjusting the standard syntax, either by the NSI’s themselves but more often by the project leaders (which caused extra time).
- It turned out that one country could not use SAS-software to produce the MDL-database. They had to build the database with different statistical software. There were legal and software/hardware reasons for this decision.
- One country had problems with importing the data from the Excel format into the database.
- A proposal for aggregated data sets for confidentiality checks was rejected by the project leaders; unaware of consequences for confidentiality.
- Some countries indicated that they could not include all variables that were required for the MDL-database.
- Not for everyone were all variables clear. For example, it was unclear which event the start of a new enterprise group marks.
- Some countries observed that a lot of time was spend on building specific variables and datasets that were not used.
Proposals
- Define earlier which output to make, avoiding time spend on unused variables and datasets.
- More time for syntax testing.
- More meta descriptions in SAS-syntax, examples of datasets, instructions on importing files in SAS.
- Take into consideration the technical requirements before or at the beginning of the project: both hardware and software.
- More clear variables, with reference to framework regulations
- Include extra BR variables identifying the enterprise, like name and address (for validation purposes).
- Limit dataset to most relevant variables.
2.4Conclusions and recommendations
The previous paragraph gave an overview of the experiences of phase 1 from the MDL 2014-2015 participants, including the proposals for further improvement. This served as input for the final discussion – held in the fourth task force in Copenhagen. The agreed conclusions and recommendations are presented below:
Conclusions:
1)Overall there was positive feedback on the quality and timeliness of the circulated guidelines and syntaxes.
2)The used approach was suitable for all participating countries in the project.
3)Most countries reported that the production of the national MDL-databases was time consuming.
4)It is important to define the output specifications as early as possible; knowing that the projects are research oriented and as such will require an iterative process where the final output cannot be determined from the outset of the project.
Recommendations:
1)Take into consideration explicit technical requirements in the project to avoid unnecessary problems and possible delays (both hardware and software).
2)Provide more meta description in SAS-syntax, examples of datasets and instructions on importing files in SAS. This is particularly helpful for countries with limited SAS-knowledge.
3)More clear variable definitions and references to framework regulations regarding project specific output (specifically the BR-variables).
4)Include additional BR-variables, like location and address, for validation purposes.
- Phase II: Validation
3.1Introduction
The second phase of the MDL-project consists of various steps of different scope aimed at validating the MDL-database. The main aim of the validation process in the MDL-project is to achieve improved consistency and stability in the data, taking into account the analyses to be carried out. This chapter focusses on the NSI’s experiences with the validation phase.
The key consistency issues in the validation work in this project consist of the following issues:
1)Instability over time: data for the empirical unit is not registered on the same statistical unit IDs across data sets (same data source, but different reference periods).
2)No-match: Data for the empirical unit is not registered on the same statistical unit IDs across data sets (different data sources).
3)Unit representation: Data is registered on the same statistical unit ID across data sets (different sources and/or different reference periods), but the data does not refer to the same empirical unit.
4)Demographic change: Data is registered on the same statistical unit ID across data sets (different sources and/or different reference periods), but the empirical unit has changed substantially at one (or more) point(s). This may be the case when an enterprise takes over another enterprise.
This chapter summarizes the experiences of the participating countries of the MDL 2014-2015 project and presents the main conclusions and recommendations for further micro data linking projects.
3.2Experiences validation
The detailed experiences from the NSI’s were gathered in questionnaires that were send out to the participants at the end of the project. These country experiences can be found at the end of this methodological report, in the annexes 4 – 12. A full overview of the mentioned issues/problems and proposals are displayed in annex 3.
This paragraph lists the main validation results and experiences from the participating countries. Also, the mentioned proposals are presented in this paragraph.
Validation results:
- Five countries used an additional methodology to validate the MDL-database regarding instability over time: use of other project with more reference years; matching by company name and address; consulting BR-experts; top-down validation using pattern analysis; auxiliary information from Business Demography Statistics and the national Business Register.
- The main reasons for the output of the validation check for instability over time were:
- SBS is sample / sampling design;
- Inconsistencies amongst SBS surveys between NACE branches;
- Mismatch ITGS and SBS due to ITGS being a monthly and SBS being a yearly statistic;
- Enterprises not in SBS scope anymore (either because of different NACE or size class;
- Restructured enterprises outsourcing their business;
- Inactivity of enterprises, but still reporting trade (mainly micro enterprises);
- Economic circumstances;
- Sample coverage amongst ICTeC and SBS;
- Too late response to integrate in final SBS database.
- Five countries corrected data in the MDL-database as a result of the instability over time validation. Five NSI’s added or replaced ENT_ID’s, two NSI’s added or replaced other variables, and five corrected all variables across the dataset.
- Three countries used an additional methodology to validate the MDL-database regarding no match for ITGS: Matching ITGS ‘no-matches’ with same SBS-year via BR using administrative ID; checking company characteristics as EGR, company name, location, persons employed, turnover; control BD for deaths. No countries used an additional ‘no match’ validation method for ITS, although one country added some enterprises in ITS connected to services related to sea and coastal transport.
- ITS is not available in two countries, while one country does not produce ITS but receives the data from the national bank.
- The main reasons for the output of the validation check for no-match were:
- (Foreign/inactive) enterprises in ITGS that are not in SBS;
- Mismatch ITGS and SBS due to ITGS being a monthly and SBS being a yearly statistic;
- Different NACE-scope;
- SBS is sample / sampling design;
- Effect of ‘unit-representation validation’ on the ‘no-match validation’;
- Demographic events;
- Added enterprises results in more mismatches;
- ITS includes ‘third party trade’.
- Six countries corrected data in the MDL-database as a result of the no-match validation for ITGS and three countries for ITS. For ITGS these corrections were as follows: adding or replacing ENT-ID’s (6), adding or replacing other variables (4), correcting all variables across the dataset (3), any other correction of the data (1). For ITS this was: adding or replacing ENT-ID’s (3), adding or replacing other variables (1), correcting all variables across the dataset (3), any other correction of the data (0).
- Only one NSI used an additional method to validate the MDL-database regarding unit representation for ITGS: using additional information like EFR and demographic events in BD. For ITS no NSI used a complementary method.
- The main reasons for the output of the validation check for unit representation were:
- Restructuring of enterprises;
- Indirect exports;
- Different use of enterprise groups;
- Controlling companies of tax groups clustered in M and L (ITGS more than one SBS unit);
- ITGS export includes the total value of goods, SBS turnover does not include this;
- Transport related enterprises include value of goods;
- Reporting unit on enterprise level not necessarily represent the same ‘true picture’;
- Added enterprises results in more mismatches.
- Four countries corrected data in the MDL-database as a result of the unit representation validation for ITGS and two countries for ITS. For ITGS these corrections were as follows: adding or replacing ENT-ID’s (4), adding or replacing other variables (2), correcting all variables across the dataset (3), any other correction of the data (2). For ITS this was: adding or replacing ENT-ID’s (2), adding or replacing other variables (0), correcting all variables across the dataset (2), any other correction of the data (0).
- One country used an additional methodology to validate the MDL-database regarding outliers with demorelations in BR: additional checks were made based on STS information. Two participating NSI’s used an additional method to validate the MDL-database regarding outliers without demorelations in BR. They used auxiliary variables from BR or BD, and consulted BD and BR experts.
- The main reasons for the output of the validation check for demographic change were either due to demographic events (like takeovers, mergers and deaths) or due to other reasons for fast growth.
- One NSI corrected data in the MDL-database as a result of the demographic change validation regarding outliers withdemorelations in BR and two countries corrected data due to the validation ofoutliers without demorelations. For the validation with demorelations in these corrections were as follows: adding or replacing ENT-ID’s (1), adding or replacing other variables (1), correcting all variables across the dataset (1), any other correction of the data (0). For the validation without demorelations this was: adding or replacing ENT-ID’s (2), adding or replacing other variables (2), correcting all variables across the dataset (1), any other correction of the data (0).
Experiences: