Big Data Deidentification, Reidentification and Anonymization

Author: Mohammed Khan, CISA, CRISC, CDPSE, CIPM, Six Sigma Certified Green Belt
Date Published: 1 January 2018
español | 日本語

Big data seems indeterminate due to its constant use in intellectual data science fields and science, technology and humanities enterprises. There is a growing need to understand what big data can do for society at large. Not only can it improve human life by innovating speedier medical releases in the marketplace, but it can also utilize computing power to analyze large data sets and improve the efficiency of current technologies.

The use of big data is possible only with the proper dissemination and anonymization of publicly accessible data. To facilitate and administer the implementation of controls around the subject of big data, one must truly understand the concepts of deidentification, reidentification and anonymization. One famous study demonstrated that 87 percent of the American population can be uniquely identified by their gender, ZIP code and date of birth.1 This illustrates the idea that anonymization, while practical, requires further study and due diligence. It is important that personal data that have been anonymized are anonymized correctly before being used as part of a publicly available big data set. Auditing professionals who work with big data, deal with global privacy implications and handle sensitive research data require the knowledge and technical aptitude to audit the big data space to stay relevant. Almost all enterprises are now taking on big data projects, and staying compliant with growing regulatory risk requirements is causing internal compliance, risk and audit functions at these enterprises to demand auditors with these necessary skill sets.

Deidentification, Reidentification and Anonymization

It is critical to reflect that the Data Protection Directives (DPD) definition of personal data is personal information relating to an identified or identifiable natural person.2 It is possible for the controller or a third party to identify the data subject at hand, directly or indirectly, by referencing the data subject’s identification number or through one or more concepts specific to the physical, mental, economic, cultural or social identity of the data subject. Therefore, it is important to consider the deidentification, reidentification and anonymization of data in big data sets when considering data use for enterprise projects and external-facing studies.

Deidentification is the altering of personal data to establish an alternate use of personal data so it is next to impossible to identify the subject from which the data were derived. Figure 1 is an example of deidentification where the column “Student Name” is removed.

Reidentification is the method of reversing the deidentification by connecting the identity of the data subject. For example (building on the previous example), one could use LinkedIn to determine that Mark Smith graduated high school in 1996. This allows for the reidentification of Mark Smith’s record (it is the only one showing a graduating year of 1996), thereby revealing his grade average and number of classes failed.

Anonymization is the ability for the data controller to anonymize the data in a way that it is impossible for anyone to establish the identity of the data.

Figure 1 could be anonymized as shown in figure 2 (using the techniques of generalization, noise addition and permutation, which will be explained).

European and US Laws Related to Data Anonymization Concepts

As mentioned earlier, the DPD definition of personal data is information relating to an identified or identifiable person. Specifically, Article 2(a) of the DPD states

“Personal data” shall mean any information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.3

Directive 95/46/EC refers to anonymization in Recital 26 to exclude anonymized data. Recital 26 signifies that to anonymize any data, the data must be stripped of sufficient elements such that the data subject can no longer be identified. The e-Privacy Directive (Directive 2002/58/EC) also refers to “anonymization” and “anonymous data” very much in the same regard.4

The US Department of Health and Human Services (HHS) enforces the US Health Insurance Portability and Accountability Act (HIPAA) and establishes specific and strict standards for deidentification of covered health data or protected health information (PHI).5 The deidentification standard requires that PHI must remove all 18 specified patient identifiers6 and apply statistical or scientific principles to validate the reidentification of the deidentified data prior to using it for big data purposes.

Methods of Pseudonymizing and Anonymizing Data

Pseudonymization is the process of deidentifying data sets by replacing all identifying attributes, that are particularly unique (e.g., race, gender) in a record with another. However, the data subject owner in this case (the owner of the original data set) can still identify the data directly, allowing for reidentification. For example, if one were to eliminate all identifying data elements and leave an internal numerical identifier it would make reidentification impossible for a third party, but very easy for the data controller. Thus, such identifiers, that is, all pseudonymized data, are still personal data.

The pseudonymized data are not normally supposed to be used as test data; they must be anonymized. One can rely on randomly generated data from some key sites that specialize in such use.7 Pseudonymization reduces the linkage of data sets with the original identity of the data subject, thereby avoiding any legal issues with the deidentification and anonymization of personal data prior to releasing it into the big data space. The implementation of pseudonymization to secure the data from being identifiable at the data-subject level requires basic guidelines including:

  • Eliminating the ability to connect data sets to other data sets, making identification of anonymized data uniquely identifiable
  • Storing the encryption key securely and separately from the encrypted data
  • Data protection using administrative, physical and technical security measures

Figure 3 demonstrates how pseudonymization works.

Anonymization is achieved when the data can no longer be used to identify a natural person by using “all the means likely reasonable to be used by the controller or by any other person.”8 Compared to pseudonymization, anonymization of data is irreversible. It is virtually impossible to reestablish the anonymized data once the links between the subject and the subject’s records are broken and anonymized. Anonymization is essentially the destruction of identifiable data; therefore, it is virtually impossible to reestablish the data together.

For example, every day, John attends a yoga class at the same yoga studio and, on his way, he buys a donut from the store next to the studio. John also uses the same method of payment and, once a week, he uses the pay phone next to the donut store to call his wife to let her know he is going to pick her up a donut to bring home. Even if the data owner of the previous example has “anonymized” John’s personally identifiable data (e.g., name, address, phone number), the behavior he displays can possibly be used to directly identify him. Hence, it is important to anonymize his data by stating facts through grouping. For instance, “10 people went to the yoga studio and purchased donuts every day from the store next to the studio” and “20 people called from the pay phone one day out of the week.” This data now is anonymized, since one can no longer identify John’s predictable pattern of behavior. The solution of anonymizing data truly prevents the owner of the data and enterprises using the data to identify individual data sets. Randomization changes the accuracy of the data by removing the unique identifier between the data and the individual. There are two methods to perform this technique:

  • Noise addition—Alters the attributes by adding or subtracting a different random value for each record (e.g., adding a different random value between A+ and C- for the grade of the data subject)
  • Permutation—Consists of swapping the values of attributes from one data subject to another (e.g., exchanging the incomes of data subjects with failed grades of data subject A with data subject B)

Conclusion

Big data will exponentially grow and, as studies show, “A full 90 percent of all the data in the world has been generated over the last two years.”9 The use of big data to capitalize on the wealth of information is already happening, and this can be seen from the daily use of technology platforms such as Google Maps or predictive search patterns while on a website. As auditors, it is important to understand the basic concepts of big data to properly address personally identifiable data with anonymizing or deidentifying. Growing regulations around data usage, including specific changes to the regulatory and privacy landscape in both Europe and in the United States, will require careful technical and legal frameworks. As data keep exponentially increasing, while new regulations requiring data owners to properly protect the identity of their data subjects emerge, it is more important than ever to carefully tread through such topics in big data to better the technology and innovations that will come with the use of big data.

Endnotes

1 Sweeney, L.; “Simple Demographics Often Identify People Uniquely,” Data Privacy Working Paper 3, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, 2000, https://dataprivacylab.org/projects/identifiability/paper1.pdf
2 Office of the Data Protection Commissioner, “EU Directive 95/46/EC—The Data Protection Directive,” European Union, https://www.dataprotection.ie/docs/EU-Directive-95-46-EC-Chapter-1/92.htm
3 Ibid.
4 Data Protection Working Party, “Opinion 05/2014 on Anonymisation Techniques,” Article 29 Data Protection Working Party, European Union, 10 April 2014, http://ec.europa.eu/justice/data-protection/article-29/documentation/opinionrecommendation/files/2014/wp216_en.pdf
5 Department of Health and Human Service, “45 CFR Subtitle A (10–1–10 Edition),” USA, https://www.gpo.gov/fdsys/pkg/CFR-2010-title45-vol1/pdf/CFR-2010-title45-vol1-sec164-502.pdf
6 Department of Health and Human Services, “Guidance Regarding Methods for Deidentification of Protected Health Information in Accordance With the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule,” USA, https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
7 Mockaroo, Realistic Data Generator, www.mockaroo.com
8 Office of the Data Protection Commissioner, “Anonymisation and Pseudonymisation,” European Union, https://www.dataprotection.ie/docs/Anonymisation-and-pseudonymisation/1594.html
9 Dragland, A.; “Big Data—For Better or Worse,” SINTEF, https://www.sintef.no/en/latest-news/big-data-for-better-or-worse/

Mohammed J. Khan, CISA, CRISC, CIPM
Is a global audit manager at Baxter, a global medical device and health care company. He works with C-suites across audit, security, medical device engineering (cyber) and privacy offices. He has spearheaded multinational global audits and assessments in several areas, including enterprise resource planning systems, global data centers, cloud platforms (AWS, SFDC, etc.), third-party manufacturing and outsourcing reviews, process re-engineering and improvement, global privacy assessments (EUDD, HIPAA, GDPR), and medical device cyber security initiatives in several markets over the past five years. Most recently, he has taken on further expertise in the area of medical device cyber security. Khan has previously worked as a senior consultant for Ernst & Young and Deloitte and as a technology expert for global ERP/supply chain systems at Motorola. He frequently speaks at national and international conferences in the space of data privacy, cyber security and risk advisory.