Excerpt
[Federal Register: August 14, 2002 (Volume 67, Number 157)]
[Rules and Regulations]
[Page 53181-53273]
From the Federal Register Online via GPO Access [wais.access.gpo.gov]
[DOCID:fr14au02-32]
G. Section 164.514--Other Requirements Relating to Uses and Disclosures
of Protected Health Information
1. De-Identification of Protected Health Information
December 2000 Privacy Rule. At Sec. 164.514(a)-(c), the Privacy
Rule permits a covered entity to de-identify protected health
information so that such information may be used and disclosed freely,
without being subject to the Privacy Rule's protections. Health
information is de-identified, or not individually identifiable, under
the Privacy Rule, if it does not identify an individual and if the
covered entity has no reasonable basis to believe that the information
can be used to identify an individual. In order to meet this standard,
the Privacy Rule provides two alternative methods for covered entities
to de-identify protected health information.
First, a covered entity may demonstrate that it has met the
standard if a person with appropriate knowledge and experience applying
generally acceptable statistical and scientific principles and methods
for rendering information not individually identifiable makes and
documents a determination that there is a very small risk that the
information could be used by others to identify a subject of the
information. The preamble to the Privacy Rule refers to two government
reports that provide guidance for applying these principles and
methods, including describing types of techniques intended to reduce
the risk of disclosure that should be considered by a professional when
de-identifying health information. These techniques include removing
all direct identifiers, reducing the number of variables on which a
match might be made, and limiting the distribution of records through a
``data use agreement'' or ``restricted access agreement'' in which the
recipient agrees to limits on who can use or receive the data.
Alternatively, covered entities may choose to use the Privacy
Rule's safe harbor method for de-identification. Under the safe harbor
method, covered entities must remove all of a list of 18 enumerated
identifiers and have no actual knowledge that the information remaining
could be used, alone or in combination, to identify a subject of the
information. The identifiers that must be removed include direct
identifiers, such as name, street address, social security number, as
well as other identifiers, such as birth date, admission and discharge
dates, and five-digit zip code. The safe harbor requires removal of
geographic subdivisions smaller than a State, except for the initial
three digits of a zip code if the geographic unit formed by combining
all zip codes with the same initial three digits contains more than
20,000 people. In addition, age, if less than 90, gender, ethnicity,
and other demographic information not listed may remain in the
information. The safe harbor is intended to provide covered entities
with a simple, definitive method that does not require much judgment by
the covered entity to determine if the information is adequately de-
identified.
The Privacy Rule also allows for the covered entity to assign a
code or other means of record identification to allow de-identified
information to be re-identified by the covered entity, if the code is
not derived from, or related to, information about the subject of the
information. For example, the code cannot be a derivation of the
individual's social security number, nor can it be otherwise capable of
being translated so as to identify the individual. The covered entity
also may not use or disclose the code for any other purpose, and may
not disclose the mechanism (e.g., algorithm or other tool) for re-
identification.
The Department is cognizant of the increasing capabilities and
sophistication of electronic data matching used to link data elements
from various sources and from which, therefore, individuals may be
identified. Given this increasing risk to individuals' privacy, the
Department included in the Privacy Rule the above stringent standards
for determining when information may flow unprotected. The Department
also wanted the standards to be flexible enough so the Privacy Rule
would not be a disincentive for covered entities to use or disclose de-
identified information wherever possible. The Privacy Rule, therefore,
strives to balance the need to protect individuals' identities with the
need to allow de-identified databases to be useful.
March 2002 NPRM. The Department heard a number of concerns
regarding the de-identification standard in the Privacy Rule. These
concerns generally were raised in the context of using and disclosing
information for research, public health purposes, or for certain health
care operations. In particular, concerns were expressed that the safe
harbor method for de-identifying protected health information was so
stringent that it required removal of many of the data elements that
were essential to analyses for research and these other purposes. The
comments, however, demonstrated little consensus as to which data
elements were needed for such analyses and were largely silent
regarding the feasibility of using the Privacy Rule's alternative
statistical method to de-identify information.
Based on the comments received, the Department was not convinced of
the need to modify the safe harbor standard for de-identified
information. However, the Department was aware that a number of
entities were confused by potentially conflicting provisions within the
de-identification standard. These entities argued that, on the one
hand, the Privacy Rule treats information as de-identified if all
listed identifiers on the information are stripped, including
[[Page 53233]]
any unique, identifying number, characteristic, or code. Yet, the
Privacy Rule permits a covered entity to assign a code or other record
identification to the information so that it may be re-identified by
the covered entity at some later date.
The Department did not intend such a re-identification code to be
considered one of the unique, identifying numbers or codes that
prevented the information from being de-identified. Therefore, the
Department proposed a technical modification to the safe harbor
provisions explicitly to except the re-identification code or other
means of record identification permitted by Sec. 164.514(c) from the
listed identifiers (Sec. 164.514(b)(2)(i)(R)).
Overview of Public Comments. The following provides an overview of
the public comment received on this proposal. Additional comments
received on this issue are discussed below in the section entitled,
``Response to Other Public Comments.''
All commenters on our clarification of the safe harbor re-
identification code not being an enumerated identifier supported our
proposed regulatory clarification.
Final Modifications. Based on the Department's intent that the re-
identification code not be considered one of the enumerated identifiers
that must be excluded under the safe harbor for de-identification, and
the public comment supporting this clarification, the Department adopts
the provision as proposed. The re-identification code or other means of
record identification permitted by Sec. 164.514(c) is expressly
excepted from the listed safe harbor identifiers at
Sec. 164.514(b)(2)(i)(R).
Response to Other Public Comments
Comment: One commenter asked if data can be linked inside the
covered entity and a dummy identifier substituted for the actual
identifier when the data is disclosed to the external researcher, with
control of the dummy identifier remaining with the covered entity.
Response: The Privacy Rule does not restrict linkage of protected
health information inside a covered entity. The model that the
commenter describes for the dummy identifier is consistent with the re-
identification code allowed under the Rule's safe harbor so long as the
covered entity does not generate the dummy identifier using any
individually identifiable information. For example, the dummy
identifier cannot be derived from the individual's social security
number, birth date, or hospital record number.
Comment: Several commenters who supported the creation of de-
identified data for research based on removal of facial identifiers
asked if a keyed-hash message authentication code (HMAC) can be used as
a re-identification code even though it is derived from patient
information, because it is not intended to re-identify the patient and
it is not possible to identify the patient from the code. The
commenters stated that use of the keyed-hash message authentication
code would be valuable for research, public health and bio-terrorism
detection purposes where there is a need to link clinical events on the
same person occurring in different health care settings (e.g. to avoid
double counting of cases or to observe long-term outcomes).
These commenters referenced Federal Information Processing Standard
(FIPS) 198: ``The Keyed-Hash Message Authentication Code.'' This
standard describes a keyed-hash message authentication code (HMAC) as a
mechanism for message authentication using cryptographic hash
functions. The HMAC can be used with any iterative approved
cryptographic hash function, in combination with a shared secret key. A
hash function is an approved mathematical function that maps a string
of arbitrary length (up to a pre-determined maximum size) to a fixed
length string. It may be used to produce a checksum, called a hash
value or message digest, for a potentially long string or message.
According to the commenters, the HMAC can only be breached when the
key and the identifier from which the HMAC is derived and the de-
identified information attached to this code are known to the public.
It is common practice that the key is limited in time and scope (e.g.
only for the purpose of a single research query) and that data not be
accumulated with such codes (with the code needed for joining records
being discarded after the de-identified data has been joined).
Response: The HMAC does not meet the conditions for use as a re-
identification code for de-identified information. It is derived from
individually identified information and it appears the key is shared
with or provided by the recipient of the data in order for that
recipient to be able to link information about the individual from
multiple entities or over time. Since the HMAC allows identification of
individuals by the recipient, disclosure of the HMAC violates the Rule.
It is not solely the public's access to the key that matters for these
purposes; the covered entity may not share the key to the re-
identification code with anyone, including the recipient of the data,
regardless of whether the intent is to facilitate re-identification or
not.
The HMAC methodology, however, may be used in the context of the
limited data set, discussed below. The limited data set contains
individually identifiable health information and is not a de-identified
data set. Creation of a limited data set for research with a data use
agreement, as specified in Sec. 164.514(e), would not preclude
inclusion of the keyed-hash message authentication code in the limited
data set. The Department encourages inclusion of the additional
safeguards mentioned by the commenters as part of the data use
agreement whenever the HMAC is used.
Comment: One commenter requested that HHS update the safe harbor
de-identification standard with prohibited 3-digit zip codes based on
2000 Census data.
Response: The Department stated in the preamble to the December
2000 Privacy Rule that it would monitor such data and the associated
re-identification risks and adjust the safe harbor as necessary.
Accordingly, the Department provides such updated information in
response to the above comment. The Department notes that these three-
digit zip codes are based on the five-digit zip Code Tabulation Areas
created by the Census Bureau for the 2000 Census. This new methodology
also is briefly described below, as it will likely be of interest to
all users of data tabulated by zip code.
The Census Bureau will not be producing data files containing U.S.
Postal Service zip codes either as part of the Census 2000 product
series or as a post Census 2000 product. However, due to the public's
interest in having statistics tabulated by zip code, the Census Bureau
has created a new statistical area called the Zip Code Tabulation Area
(ZCTA) for Census 2000. The ZCTAs were designed to overcome the
operational difficulties of creating a well-defined zip code area by
using Census blocks (and the addresses found in them) as the basis for
the ZCTAs. In the past, there has been no correlation between zip codes
and Census Bureau geography. Zip codes can cross State, place, county,
census tract, block group and census block boundaries. The geographic
entities the Census Bureau uses to tabulate data are relatively stable
over time. For instance, census tracts are only defined every ten
years. In contrast, zip codes can change more frequently. Because of
the ill-defined nature of zip code boundaries, the Census Bureau has no
file (crosswalk) showing the relationship
[[Page 53234]]
between US Census Bureau geography and US Postal Service zip codes.
ZCTAs are generalized area representations of U.S. Postal Service
(USPS) zip code service areas. Simply put, each one is built by
aggregating the Census 2000 blocks, whose addresses use a given zip
code, into a ZCTA which gets that zip code assigned as its ZCTA code.
They represent the majority USPS five-digit zip code found in a given
area. For those areas where it is difficult to determine the prevailing
five-digit zip code, the higher-level three-digit zip code is used for
the ZCTA code. For further information, go to:
geo/www/gazetteer/places2k.html.
Utilizing 2000 Census data, the following three-digit ZCTAs have a
population of 20,000 or fewer persons. To produce a de-identified data
set utilizing the safe harbor method, all records with three-digit zip
codes corresponding to these three-digit ZCTAs must have the zip code
changed to 000. The 17 restricted zip codes are: 036, 059, 063, 102,
203, 556, 692, 790, 821, 823, 830, 831, 878, 879, 884, 890, and 893.