Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Guidance Document

Prepared by: PCORnet Data Privacy Task Force

Submitted to the PMO / March 31, 2015
Approved by the PMO / April 2, 2015
Submitted to PCORI / <Date Submitted to PCORI>
Approved by PCORI / <Date Approved by PCORI>

Data Privacy Task Force

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Table of Contents

Executive Summary iii

1.0 Minimum Threshold - 1 -

2.0 Perturbation of Query Results - 1 -

3.0 Obfuscation of Identifiers for Record Linkage - 2 -

4.0 De-identification of Record-Level Data - 3 -

A. CAPriCORN approaches - 3 -

B. NephCure PPRN’s approaches to de-identification - 4 -

C. PEDSnet approaches to de-identification - 4 -

Tables and Figures - 5 -

References - 6 -

Executive Summary

PCORnet is a federated network, with PCORnet network partners retaining discretion and responsibility with respect to the collection, access, use, and disclosure of patient information; network partners also make determinations about when they will participate in any particular PCORnet query.

The Data Privacy Task Force is working collectively with the CDRNs and PPRNs to develop a set of privacy policies to govern data sharing by PCORnet.

This guidance is intended to augment the PCORnet policies to provide examples of methods to reduce the risk of re-identification with respect to the generation, collection, maintenance, or return of Network Data. Terms used in this guidance are defined in the PCORnet policies.

This guidance is intended to be modified over time as the PCORnet Distributed Research Network gains experience. The guidance covers the following privacy protective techniques:

(Threshold) Minimum count thresholds for Aggregate Data;

(Perturb) Perturbation of PCORnet Data;

(Obfuscate) Obfuscation of identifiers for record linkage; and

(De-identify) De-identification of record-level research participant information.

The Data Privacy Task Force - i - Technical Approaches for Protecting

Privacy in the PCORnet DRN v1.0

Minimum Threshold

One of the manners by which personal information can be exploited for re-identification is by the triangulation on small groups of individuals. In order to mitigate such attacks, PCORnet Policy currently states that Network Data Affiliates cannot release Network Data with cell counts of five or less, unless authorized by the research protocol and IRB(s) approving the query. (See PCORnet Policy 6.2.2). PCORnet policies permit network partners to apply their local rules for masking cell counts, or for rejecting queries where the return of results would not match their thresholds for releasing Aggregate Data. Such local policies must be consistent with commitments made to patients/data subjects with respect to use of their information.

Other examples of thresholds are shown in Table 1.

Perturbation of Query Results

Another manner by which personal information can be exploited for re-identification is by overlapping queries to remove the intersection and disclose the remaining individuals. Consider an example of how this might be achieved. First, an Authorized User issues a query for how many juvenile diabetics were on drug A and drug B with an adverse outcome and the answer is X, which, for this case, let us assume corresponds to 31. The User then issues a subsequent query in which they ask how many juvenile diabetics were on drug A with an adverse outcome, such that the answer is now 30. At this point, the User learns that there is only 1 juvenile diabetic on both drug A and drug B with the adverse outcome.

There are a number of ways in which this type of attack could be prevented. In practice, systems tend to apply either 1) rounding (or coarsening) or 2) injection of a certain degree of noise to the query result. As noted in PCORnet policies, the PCORnet query should specify the approach to be used to de-identify data or reduce re-identification risks (see PCORnet Policy 5.2.1.1).

If a rounding (or coarsening approach is used), the result X could be rounded to the nearest value of 10. For instance, in the above scenarios, the answers to the queries would both be 30. However, it should be noted that the degree to which the utility of the query answers would be tied directly to the rounding values. An initial rounding value of 10 is recommended.

An alternative to rounding is the injection of a certain amount of noise into the results. This is the strategy that query-response tools such as i2b2 [Murphy 2009] (specifically in SHRINE [Lowe 2009]) apply in their system. In this scheme, the result would be reported as 30 + e, where e is a random value selected from a known distribution. This distribution could be uniform, Gaussian, Laplacian, or something else. It should be noted that i2B2 applies a Gaussian distribution. If random noise is to be added, the approach needs to specify the standard deviation of the distribution from which the value is selected.

Obfuscation of Identifiers for Record Linkage

To mitigate bias in investigations, it is important to resolve when a patient’s data resides in multiple resources. This process, called record linkage, is non-trivial because a patient’s record often contains typographical and semantic errors. Sophisticated record linkage strategies have been proposed to resolve these problems, but they rely on patient identifiers, such as personal name and Social Security Number. To overcome this barrier, a growing list of techniques has been proposed to support private record linkage (PRL).

From a high level, the PRL process has a lifecycle that entails (but is not necessarily limited to) the following steps [Toth 2014]:

1.  Generation and storage of keys for cryptosystems, or salt values for hash functions, invoked in a PRL protocol;

2.  Communication of keys and salt to the entities encoding the records upon request;

3.  Transformation of identifiers into their protected form as specified by the protocol;

4.  Separation of salt hosting and de-duplication trusted entities for enhanced security

5.  Execution of the record linkage framework (e.g., feature weighting, blocking, and comparison of record pairs to predict which correspond to the same individual); and

6.  Transfer of records and parameters related to the linkage protocol (i.e., all communication between parties).

Under no circumstances can the keys or salt values be disclosed to any entity beyond PCORnet network partners.

A number of network partners are exploring different approaches to private record linkage. Some network partners report using NIH’s Global Unique Identifier (GUID) Tool (https://fitbir.nih.gov/jsp/contribute/guid-overview.jsp). The CAPriCORN Clinical Data Research Network has developed private record de-duplication software [insert link to JAMIA paper when it is available]. The Secure Open Master Patient Indexing System (SOEMPI), developed researchers at Vanderbilt University and the University of Texas at Dallas, is another approach. Private companies also offer de-duplication software options. Although it is too early to require that all PCORnet participants adopt a specific approach, evolving to the same approach would be beneficial, as it would allow for centralized de-duplication to occur, versus having network participants individually engage in these efforts.

To apply such an approach, PCORnet would need to agree on:

1.  Who is the third party (trusted party A) who generates the keys/salt values of the functions?

2.  Who is the third party (trusted party B) who gets to perform the linkage?

3.  Who gets to see the linkage results? In other words, do the member sites get to know when their constituents went to other sites?

4.  What is the similarity threshold by which we could claim that two records correspond to the same individual?

There are no standards and no standard software available at this time. SOEMPI is one option, but it will require either PCORnet or some organization to adopt the source code and support is operations. An alternative solution would be to piggyback on the software developed by the Chicago CDRN – the paper describing this system is under review at JAMIA and is provided separately. There are benefits and drawbacks to both systems in their design and linkage algorithms.

De-identification of Record-Level Data

A predominant model for research using the PCORnet Distributed Research Network is one where the individual, record-level or patient-level data remains under the control of the network partner (or Network Data Affiliate); the research query is run on the Network Data, and only Aggregate Data is returned in response. This privacy-preserving architecture reduces the need to adopt de-identification strategies for data shared in response to a query. [Mini Sentinel 2012]

However, PCORnet policies recognize that at times, responses to queries may require the sharing of record- or patient-level de-identified data. In addition, network partners (particularly those consisting of disparate organizations) may choose as a matter of local policy to create de-identified datasets for research purposes. There a number of ways by which de-identification can be achieved. Follow this link for the latest guidance from the HHS office for Civil Rights on HIPAA de-identification: http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html

In circumstances where the query requires the return of de-identified data, PCORnet policies require the query to specify the “definition and approach or procedures required to de-identify data.” In addition, some network partners may be required to abide by NIH’s recently released Genomic Data Sharing Policy, which includes specifications on the de-identification approach to be used. http://gds.nih.gov/03policy2.html.

For initial queries requiring the return of de-identified data, the PCORnet Coordinating Center (CC), with input from network partners participating in the queries, may need to set the approach to be used; however, over time, PCORnet should develop a robust set of policies and best practices that over time may reduce or eliminate the need for CC control.

These approaches focus on reducing risk of re-identification using demographic identifiers; future iterations of the guidance may need to deal with risk of re-identification from exposure of clinical data.

PCORnet network partners are invited to share their approaches to de-identification of record level data, in order to share resources and begin to develop a library of best practices. The following record-level de-identification approaches have been shared and are also available on the PCORnet Central Desktop:

A.  CAPriCORN approaches

CAPriCORN proposes initially to validate and use limited data sets with randomly seeded, time-shifted temporal references and geographical references restricted to the first three digits of zip codes. Expert statistical determination will be sought for the method of time-stamping events to confirm that it also meets the Safe Harbor de-identification criteria of the HIPAA Privacy Rule. Until such determination has been achieved, the data sets will be considered limited, rather than de-identified, datasets. In the event that this proves infeasible, CAPriCORN will adhere to Safe Harbor until the situation has evolved and use of date shifting is accepted.

A separate important piece of information useful for epidemiologic investigations is geographic location. We may need to incorporate these data through IRB approval of limited data sets rather than addresses that can be geocoded. ZIP code level data will need to be considered when applying our minimum threshold and perturbation of query rules.

B.  NephCure PPRN’s approaches to de-identification

  1. Encrypted hash (SHA1) on a sequential ID number assigned as the surveys come in.
  2. Randomizing birth dates within six months, with a new random birth date generated for each query.
  3. The Common Data Model has been constructed as views in a separate schema, so no queries can get to the underlying data.

C.  PEDSnet approaches to de-identification

  1. Institution replaces PHI with a site encrypted identifier, and maintains link between the two.
  2. DCC replaces “site encrypted identifier” with a PEDSnet encrypted identifier (PEI) to insure uniqueness across sites.
  3. All datasets stored or sent out of the DCC use the PEI.

What this means in the study context is that the investigator gets a set of PEIs in response to a case-finding query. If they want to re-identify patients, they tell the DCC, who translates that back to a site and site encrypted identifier, and sends that back to the site of origin. That site is then able to link to PHI and re-contact the patient or provide additional data (e.g., chart review).

We’re planning to cycle a test of this process in December, if the DUAs get sorted by then.

Tables and Figures

Refer to tables and figures throughout the document and place them here. Use capital T’s and F’s when referring to tables and figures (e.g., ‘As mentioned in Table 1’, etc.).

Table 1. Examples of thresholds applied in the minimum threshold rule

AGENCY / MINIMUM THRESHOLD /
Washington State Department of Health [WA 2012] / 10
Centers for Disease Control Healthy People 2010 [Klein 2002] / 5 - 10
Arkansas HIV/AIDS Data Release Policy [AR 2012] / 5
Colorado State Department of Public Health and Environment [CO 2012] / 5
National Center for Health Statistics [NCHS 2004] / 5
UK Department of Enterprise, Trade, and Investment [DETI 2012] / 5
Utah State Department of Health [UT 2005] / 5
Iowa Department of Public Health [IA 2005] / 4
NASA [SEDAC 2005] / 3

References

[AR 2010] Arkansas HIV/AIDS Surveillance Section. Arkansas HIV/AIDS Data Release Policy. Available Online: http://www.healthy.arkansas.gov/programsServices/healthStatistics/Documents/STDSurveillance/Datadeissemination.pdf. First published: May 2010. Last Accessed: April 29, 2014.

[CO 2010] Colorado State Department of Public Health and Environment. Guidelines for working with small numbers. Available online: http://www.cdphe.state.co.us/cohid/smnumguidelines.html. Last Accessed: April 29, 2014.

[DETI 2010] U.K. Department of Enterprise, Trade, and Investment. DETI Data Confidentiality Statement. Available online: http://www.detini.gov.uk/deti-stats-index/stats-national-statistics/data-security.htm. Last Accessed: April 29, 2014.

[Klein 2002] R. KLEIN, S. Proctor, M. Boudreault, K. Turczyn. Healthy people 2010 criteria for data suppression. Centers for Disease Control Statistical Notes Number 24. 2002.

[Mini Sentinel 2012] J RASSEN, et al., Mini Sentinel Methods: Evaluating Strategies for Data Sharing and Analyses in Distributed Data Settings, November 2012, http://www.mini-sentinel.org/work_products/Statistical_Methods/Mini-Sentinel_Methods_Evaluating-Strategies-for-Data-Sharing-and-Analyses.pdf.