Reidentifying the Anonymized: Ethical Hacking Challenges in AI Data Training

Author: Isla Sibanda
Date Published: 16 September 2024
Read Time: 2 minutes

In an era where data is likened to the new oil, the intersection of big data, artificial intelligence (AI), and privacy raises profound ethical questions.

A startling revelation from the Georgetown Law Technology Review shows that 63% of Americans can be identified using their gender, birth date, and ZIP code.1 This vulnerability is exacerbated by findings demonstrating that 99.98% of Americans could be reidentified from a dataset using 15 basic attributes.2 These statistics highlight the fragility of personal information in the digital age and underscore the critical challenges that come with safeguarding user data and privacy.

The stakes are even higher when it comes to genetic data, where the potential for identification carries dire implications for individual privacy and societal discrimination. These implications are the reason why everyone was so infuriated at the thought of 23andme sharing data with GlaxoSmithKline to develop medical treatments.3

63% of Americans can be pinpointed using their gender, birth date, and ZIP code. This vulnerability is exacerbated by findings that demonstrate that 99.98% of Americans could be reidentified.

The emergence of sophisticated AI methodologies capable of identifying individuals through intricate behavioral patterns amplifies these concerns and underscores the pressing need for robust privacy mechanisms and ethical frameworks in AI data training.4 As AI continues to evolve, the imperative to balance innovation with individual privacy rights becomes increasingly important, shaping the discourse on ethical hacking and the responsible use of AI in data training.

The Fragility of Anonymized Data

The aforementioned research from Georgetown Law unveils a concerning reality: Simple pieces of information, when combined, can compromise anonymity. This is not merely a hypothetical risk; it has real-world implications for everyone who uses the Internet today.

The study’s indication that more than half of the US population could be reidentified from minimal data points such as gender, date of birth, and ZIP code illustrates the fragility of anonymized datasets, despite the good intentions behind them.5

Adding to this complexity, it has been demonstrated that a comprehensive set of demographic attributes could potentially ensure the reidentification of virtually any individual in America.6

These revelations challenge the core assumption that anonymized data is inherently safe and non-identifiable.7 In many datasets used today, the breadth of information collected can include hundreds of attributes per person, creating a dense web of data that, while ostensibly anonymized, can still reveal individual identities when analyzed with advanced statistical methods.

Advanced Reidentification Techniques in AI

AI algorithms8 have demonstrated remarkable efficiency in discerning patterns and correlations within data that are not immediately apparent to human analysts. This includes triplet-loss learning—a technique that involves selecting three reference samples per training instance and is useful for facial recognition and image retrieval.

This capability extends to behavioral data, wherein AI can identify unique patterns of behavior that serve as indirect identifiers even when direct identifiers (e.g., names, and social security numbers) are absent.

Likewise, there has been a recent influx of penetration testing tools that claim to harness the power of AI to uncover vulnerabilities more efficiently.9 However, no one ponders the imperviousness of the application programming interface (API) used by the tool, nor the potential exploits that are just waiting to happen.

These AI methodologies do not operate in isolation; they are often augmented by the vast amounts of data available on the Internet, enhancing their ability to profile and reidentify individuals. This new frontier of reidentification presents a formidable challenge: how to harness the benefits of AI and big data for societal progress while safeguarding individual privacy.10

The Implications of Privacy Vulnerability

The global cost of cybercrimes, encompassing a broad spectrum of illegal activities including those leveraged through privacy exploitation,11 is expected to surge in the next 5 years, escalating from US$3 trillion in 2015 to a staggering US$10.5 trillion annually by 2025.12 This highlights the growing importance and vulnerability of data, particularly genetic data, which is especially sensitive and could be a prime target for misuse.

The ethical concerns surrounding genetic data are twofold: There is the risk of individual harm through privacy breaches, which could be financially motivated given the soaring costs of cybercrime, and the broader societal risk, such as discrimination or stigmatization based on genetic traits.

These concerns highlight the urgency of developing robust methods to proactively protect individuals’ genetic privacy data. It is not merely about safeguarding personal information, but also about preventing the potentially devastating economic impacts associated with data breaches.13

The challenge lies in developing methods to protect individuals' genetic privacy data without hindering the scientific research and progress that can benefit from such data.14 This task is becoming increasingly complex in the face of rising cybercrime costs, necessitating a nuanced approach that recognizes the unique nature of genetic data and the multifaceted considerations it demands.15

Amplifying Reidentification Risk

The ability to integrate various datasets poses a significant reidentification risk,16 particularly when disparate sources of anonymized data can be combined and cross-referenced. In our digital world, the availability of extensive datasets online provides a rich resource for those looking to reidentify individuals.

For instance, anonymized health records could be combined with voter registration lists, social media activity, or other publicly accessible data to identify individuals despite initial efforts to actively anonymize the data.

This risk is amplified by the increasing sophistication of AI tools that can analyze and cross-reference vast datasets far more efficiently than ever before. Powerful AI analytics combined with the wealth of available digital data creates a formidable challenge in protecting individual privacy.17

Similarly, it is crucial for organizations to meticulously evaluate their selected hosting solutions and the agreed terms for data storage.18 If the current partners lack the necessary infrastructure or expertise to work in tandem with the organization on optimal security measures, the ensuing vendor lock-in could immobilize the organization’s security initiatives for the duration of the contract.19

Deidentification Techniques: Pros and Cons

In response to these privacy concerns, various deidentification techniques have been employed to obscure personally identifiable information. Three well-established methods to mitigate reidentification risk are generalization, perturbation, and aggregation:20

  • Generalization replaces specific data values with broader categories. For example, exact ages are replaced with age ranges, or specific locations are replaced with larger geographical areas. The advantages include preserving data utility by retaining significant patterns and trends, as well as being easy to implement and understand. However, the lack of precision results in less granular data, potentially creating more reidentification risk with the additional data.
  • Perturbation adds noise to the data to mask the original values. This involves randomly altering values within a certain range. Perturbation adds noise to the data to mask the original values by randomly altering them within a certain range. This technique enhances privacy by making it difficult to trace data back to individuals and maintains the overall statistical properties and distribution of the data. However, it introduces inaccuracies that can affect data quality and requires careful calibration to balance privacy and utility.
  • Aggregation combines data from multiple individuals into summary statistics, such as averages or totals. This method offers a high level of privacy protection by focusing on group-level data and is useful for identifying trends without exposing individual data points. However, there is a trade-off between robust protection and accuracy, as this method results in a significant loss of individual data details and may obscure important variations and outliers.

These techniques are crucial in AI data training to ensure privacy while maintaining the usefulness of the data. Each method has its advantages and limitations, and the choice of technique depends on the specific requirements and context of the data usage.

Ethical Imperatives and Future Directions

The ongoing developments in data re-identification call for a re-evaluation of ethical standards and privacy protections21 in the age of AI and big data.

Ensuring the privacy and security of individual data is not solely a technical challenge, but a fundamental ethical imperative that must guide the development and application of AI technologies going forward.

Future directions in this field must prioritize the development of more robust privacy-preserving technologies22 alongside transparent ethical guidelines to govern the use of AI in data analysis. Policymakers, technologists, and ethicists must collaborate to create frameworks that balance the benefits of AI and big data with the imperative to protect individual privacy.

To mitigate risk, several pertinent regulations and tools have been established. The General Data Protection Regulation (GDPR)23 in the European Union mandates stringent data protection measures, requiring explicit consent for data collection and granting individuals rights over their data. The California Consumer Privacy Act (CCPA)24 provides California residents the right to know what personal data is being collected, its purpose, and the ability to request deletion of their data. In the USA, the Health Insurance Portability and Accountability Act (HIPAA)25 protects sensitive patient health information from unauthorized disclosure.

Several tools have been developed to enhance privacy preservation. Differential privacy adds random noise to data to ensure individual data points cannot be distinguished, while homomorphic encryption allows computations on encrypted data without decrypting it, keeping data secure throughout the analysis process.

As big tech enterprises are looking to intensify their reliance on AI, the effects will be felt throughout IT teams, as more and more organizations will be looking for alternatives and, in general, more open-source tools that are not tainted by excessive AI input.26

Conclusion

In addressing the reidentification risk associated with anonymized data, we confront a complex environment where ethical, privacy and economic dimensions intersect.

The sensitivity of genetic data demands robust privacy safeguards, especially in an era where cybercrime's financial impact is escalating dramatically.

Balancing the utility of data analytics with the imperative to protect individual privacy is critical, requiring a nuanced approach that accounts for the unique vulnerabilities associated with genetic information.27 The future demands a collaborative effort from policymakers, enterprise leaders, technologists, and ethicists to establish and enforce privacy standards that coexist with the drive for innovation.28

As the capabilities of AI and data science grow,29 so does the responsibility to use these tools ethically, ensuring that advancements benefit society while respecting individual rights and privacy.

Endnotes

1 Lubarsky, B.; “Re-Identification of “Anonymized” Data, Georgetown Law Technology Review, April 2017
2 Rocher, L.; Hendrickx, J.; et al.; “Estimating the Success of Re-identifications in Incomplete Datasets Using Generative Models,” Nature, 23 July 2019
3 Molteni, M.; “23andMe's Pharma Deals Have Been the Plan All Along,” Wired, 3 August 2018
4 Sarker, I.; “AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems,” SN Computer Science, vol. 3, 2022
5 Sharma, N.; “Categorizing and Handling Sensitive Data,” ISACA Now Blog, 24 January 2023
6  Rocher; “Estimating the Success of Re-identifications in Incomplete Datasets Using Generative Models”
7 Khan, M.; “Big Data Deidentification, Reidentification and Anonymization,ISACA® Journal, vol. 1, 2018
8 Shah, D.; “Triplet Loss: Intro, Implementation, Use Cases,” V7 Labs, 14 April 2023
9 Basu, S.; “17 Most Popular Penetration Testing Tools for Companies and Pentesters,” Astra, 21 February 2024
10 Walch, K.; “How Do Big Data and AI Work Together?,” Enterprise AI, 21 December 2023
11 Agarwal, A.; “The Top 5 Cybersecurity Threats and How to Defend Against Them,” ISACA®, 27 February 2024
12 Stratton, B.; “Exploring the Explosive Growth Of the Cybersecurity Market,” BlueTree, 2 December 2023
13 Carmichael, M.; “The Difference Between Data Privacy and Data Security,” ISACA, 28 February 2023
14 Kazi, S.; “Legally Permissible Does Not Mean Ethical,” ISACA, 30 October 2023
15 Brady, M.; “Trustworthy Tactics for Unlocking the Value of Genetic Data,” ISACA Journal, vol. 5, 2022 
16 Google Cloud, Sensitive Data Protection—Re-identification Risk Analysis
17 Rende, J.; “Track These 7 Trends for Proactive Cybersecurity in 2024,” ISACA, 26 December 2023
18 Kamprianis, M.; “Third-Party Risk Management: The Security Blind Spot No One Wants to Discuss,” ISACA, 28 November 2023
19 cast.ai, “What Is Cloud Vendor Lock-In (And How To Break Free)?,” 6 January 2023
20 Johns Hopkins Sheridan Libraries, Protecting Human Subject Identifiers —Steps for De-identifying Data
21 Koerner, K.; “Privacy and Responsible AI,” IAPP, 11 January 2022
22 Hindi, R.; “Privacy-Preserving Technologies are About to Have Their Hockey-Stick Moment,” Tech.eu, 27 January 2024
23 Intersoft consulting, General Data Protection (GDPR)
24 State of California Department of Justice, California Consumer Privacy Act
25 U.S. Department of Health and Human Services, “Summary of the HIPAA Privacy Rule
26 Platform.sh, “Platform.sh Delivers More With Azure
27 Anant, V.; Donchak, L.; et al.; “The Consumer-Data Opportunity and the Privacy Imperative,” McKinsey and Company, 27 April 2020
28 Parsons, K.; “AI Policymaking Must Include Business Leaders”  Financial Times, 30 Oct 2023
29 Stewart-Rattray, J.; “Taking Action to Create a Future Where AI Benefits Us All,” ISACA Now Blog, 2 February 2024  

Isla Sibanda

Is an ethical hacker and cybersecurity specialist based in Pretoria, South Africa. For over twelve years, Sibanda has worked as a cybersecurity analyst and penetration testing specialist for several reputable companies - including Standard Bank Group, CipherWave, and Axxess.

Additional resources