/ EUROPEAN COMMISSION
EUROSTAT
Directorate F: Social statistics
Unit F-2: Population

Doc. ESTAT/F2/CENS (2013)03

Luxembourg, May2013

Working Group on Population and Housing Censuses

Luxembourg, 4-5 June 2013

BECH building, room Ampere

STATISTICAL DISCLOSURE CONTROL

Item 3of the Agenda

1.Introduction

This document is intended to support discussion on the statistical disclosure control (SDC) methods used by countries to protect confidentiality in the dissemination of the 2011 census results. This focusses particularly on the methods used to protect the hypercube data produced for dissemination via the Census Hub.

Several points need to be considered:

  • The choice at national level of one or more SDC methods to protect census data may depend on a variety of factors. These factors include the characteristics of the national census data and production methods, national policies and legislation relating to statistical confidentiality in general and to the census specifically, and established national practices that are understood and accepted both by statistics producers and users.
  • Ensuring appropriate protection of the extremely detailed multidimensional hypercube data required for the Census Hub will present considerable disclosure control challenges. These data will often be more difficult to protect than more conventional aggregate outputs with only a limited number of dimensions and cross-tabulations, and often far less detailed disaggregations. For this reason, it may be necessary for countries to adopt different SDC methods for the Eurostat census hypercubes than the methods used to protect national outputs. This is further complicated by the fact that, being an innovative approach to dissemination, there is less experience and fewer examples of good practice to draw upon.
  • Decisions regarding the selection and implementation of SDC methods for the census data supplied to Eurostat rest with the NSIs. Given the different national situations and approaches to the production of census data, it is not possible for Eurostat to impose recommendations as to the SDC methods to be applied.

2. Questionnaires to NSIs

2.1 June 2012

Eurostat launched a Preliminary Information Questionnaire (PIQ) intended to provide a general overview of the methods used to protect confidentiality, as well as their likely impact on data availability, comparability, loss of information and general quality. As most countries were not at that stage in a position to provide a detailed analysis of the ways in which SDC methods would be selected, tested and implemented, as well as their impact on the data, Eurostat sent a second more detailed questionnaire for updated information on the issue.

2.2. April 2013

The second questionnaire that was sent in April 2013 focusedspecifically on the SDC methods that are used to protect the census hypercube data produced for the EU census programme and did not address other SDC methods that may have been applied to the national tabular or microdata releases. As shown below, this second questionnaire included general questions about the countries' priorities in the choice of SDC method(s), as well as questions addressing the detail of the implementation of the method(s) selected:

1. Introduction

2. General and method-specific questions

2.1. General questions

2.2. Method-specific questions

  • Swapping
  • Perturbation (including SAFE)
  • Cell suppression
  • Recoding
  • Rounding
  • Other SDC methods (pre and/or post-tabular)
  • Hypercube 06 test

Explanatory notes, definitions and glossary

  • Defining the qualities of an effective approach of SDC
  • Classification of the methods
  • Parameters
  • Choice and definition of the SDC methods: qualitative and quantitative criteria
  • Definition of the key SDC methods
  • Documentation

3. Reviewing the results of the second questionnaire

Summary

The second questionnaire was sent to NSIs in the 31 EU and EFTA countries, with a reply requested by 13 May 2013. At the date of writing (17 May 2013), answers had been received from 20 countries.

Of the 20 replies, 19 included a completed questionnaire. More details on the questionnaire results are given in Annex.

The SDC methods to be applied to the census hypercubes had been decided in 12 countries, with the test on hypercube 06 using real census data having been undertaken in 3 countries.

General questions

Priority characteristics

As might be expected, countries reported that the priority characteristics in selecting disclosure control methods were principally the minimisation of disclosure risk and of information loss, as well as ease of implementation.

The characteristics that were in fact ensured by the chosen methods were reported as being the minimization of the disclosure risk, consistency between the tables, and availability and consistency of the data at the higher aggregate levels.

Information loss

Replies to the questionnaire suggest that countries will evaluate information loss through:

- Deviations between protected and unprotected data

- Measuring the number of suppressed cells

- Analysis of the distribution of sensitive variables

- Comparing deviations on primary marginal tables

Countries frequently reported a loss of precision and detail in the data, as well as non-additivity and consistency between the tables.

Implementation

More than half of the countries state that the chosen methods are easy to implement.

The use of specialised SDC software such as TAU-ARGUS was mentioned by several countries, with some others using more general statistical software such as SAS. One country reported using custom-made SDC software.

Countries reported that key factors in the choice of SDC methods were often the ease of implementation and the fact that the methods could readily be understood by users. However, other priority characteristics were also selected as reasons for the choice of methods by some countries.

Based on the short descriptions given by several countries, it appears that the detailed implementation of the basic methods varies considerably. For example, the records selected for record swapping may be selected based on specific rules of varying degrees of complexity, or may instead be selected far more on a random basis. It is unclear though whether this apparent result is due to different levels of detail supplied by countries in response to the questionnaire.

Method-specific questions

Only 12 of the 19 countries that completed the questionnaire said that a choice had been made regarding the SDC method(s) to be used.

The questionnaire listed five broad SDC methods that might commonly be used for SDC of census data:

  • Swappingis used by four countries (BE, ES, AT, UK)
  • Perturbation (including the SAFE method) was mentioned by two countries (DE, EE),
  • Cell suppressionis used by four countries (IE, CY, NL, RO)
  • Roundingis used by two countries (EE, NO)
  • Recodingis used as an additional method by the UK.

Hypercube 06 test

Only three countries had implemented the test on real data; other countries had implemented the test using dummy data. Countries reported an information loss of between 0 and 43%.

4. Conclusions and issues for possible discussion by the Working Group

It is surprising that, at this relatively advanced stage in the programme for the production of the census hypercube data, a significant number of countries (7 out of the 19 that completed the questionnaire) report that decisions as to the SDC method(s) to be used have not yet been taken. It is similarly surprising that only 3 countries report having undertaken the tests on Hypercube 06 using real data.

Clearly the selection of SDC method(s) to protect this innovative and detailed type of census output will take time. It is important though that this work is progressed to ensure adequate time for final implementation and operational testing before the hypercubes are due to be made available.

Members of the Working Group are invited to comment on the results of the questionnaire and on other issues relating to the SDC for the protection of the census hypercube data.

ANNEX

Results of the Eurostat questionnaire on SDC methods (as at 17/5/2013)

Questionnaire sent:31 countries

Answers received:20 countries (17 EU and 2 EFTA)

Questionnaire filled: 19

SDC methods already chosen: 12

Hypercube 06 test on real data: 3

1. General questions

a) Priority characteristics

Amongst the priority characteristics proposed in the questionnaire, the most reported were (in descending order):

- Minimization of the disclosure risk (15)

- Minimization of information loss (13)

- Ease of implementation (13)

- Consistency between the tables (7)

- Understanding of the SDC methods by the users (8)

- Availability and consistency of data at the higher aggregate levels (6)

b) Characteristics actually ensure by specific SDC methods

- Minimization of the disclosure risk (8)

- Consistency between the tables (6)

- Availability and consistency of the data at the higher aggregates level (5)

Seven countries out of 19 answering had not yet decided the SDC method(s) to be used.

c) Information loss (open question): how do you evaluate the information loss?

Out of 19 answers, 13 responses are available for this question

Methods to evaluation information losses are:

- Deviations between protected and unprotecteddata

- Measuring the number of suppressed cells

- Analysis of the distribution of sensitive variables

- Comparing deviations on primary marginal tables

d) SDC methods will result in:

“Less precise data” (7) is the most common answer followed by “less detailed information” (4) and “non additivity” (4).

“Less data available” (2) and “non-consistency” (2) were less selected.

12 countries provided an answer to one or moreitems.

e) Ease of implementation

Out of 13 answers, 5 were negative, 6 were positive and 2 neither positive nor negative.

f) On the use of a software to protect census hypercubes

Received answers: 14

Both SAS and Tau-Argus software were mentioned, although many replies did not mention software. One country developed in-house software.

g) Discussion with the users

Out of 15 answers, few reported discussions with users (although the understanding of the methods by users was selected as a priority characteristic). In general a negative answer is provided.

Nevertheless, internal discussion with experts or information initiatives were implemented or planned.

h) Data treated as non-confidential

Out of 14 answers, three countries mention data or characteristics which might be treated as non-confidential. However, in the majority of countries, no data are specifically treated as non-confidential.

i) Obligation to publish data at certain aggregated level

Out of 14 answers, 8 answered positively that there was a national obligation to publish data at specific aggregate levels. Other replies referred only to the EU requirements.

j) The reason why the SDC method was chosen

Of 13 answers, eases of implementation and understanding by users were reported as the main reasons for selecting particular SDC methods. Some countries reported making a technical selection among various acceptable possibilities.

2. Method-specific questions

Method
(one or more) / Number of countries / Comments
A. SWAPPING / 4
(BE, ES, AT, UK) / The countries using different methods to define the 'risky' records, usually by by implementing different rules but also randomly. The swapping rate is generally low or is not released.
B. PERTURBATION INCLUDING SAFE / 2
(DE, EE); / The proportion of data perturbed depends on the topic concerned.
Generally, the tables are considered to be consistent or to have only small discrepancies due to the method applied.
C.
CELL SUPPRESSION / 4
(IE, CY, NL, RO) / Primary cell suppression is generally based on a minimum cell count (often 3).
Rules for secondary confidentiality vary but are generally intended to minimise information loss.
D. RECODING / 1
(UK) / Focus on 'visible' variables that may be used to identify persons and disclose other non-visible information.
E. ROUNDING / 2
(EE, NO) / Both controlled and random rounding are reported.
F. Other methods? / - / -
G. Hypercube 06 / 3 / Information losses of 0% to 43% are reported.