Please enjoy reading this archived article; it may not include all images.

Data Privacy and Big Data—Compliance Issues and Considerations

Author: William Emmanuel Yu, Ph.D., CRISC, CISM, CISSP, CSSLP
Date Published: 1 May 2014

The big data1 craze has taken the industry by storm. With the advent of cost-effective technologies and solutions for longer-term storage of vast amounts of transaction data, more and more companies are investing in keeping more and more data for longer and longer periods. It is important to look into the intersection of data privacy in the context of this emerging big data trend, reviewing existing data privacy regulations and considering big data in the context of an accepted data privacy framework.

In the past, due to limited space availability in expensive data warehouses, considerable effort was put into choosing and organizing data to ensure that only valuable data were kept for extended periods of time. Now, this view has changed. With many new technologies and tools, companies are beginning to store everything in horizontally scaled, commercial (commodity) off-the-shelf hardware. The value of the big data ecosystem is to collect and make sense of this large volume of raw data and convert it into useful information.

At the other end of the spectrum, regulators and society as a whole are increasingly concerned about how data are being handled by business. The area of data privacy is becoming a greater concern in the post-Snowden2 era. These giant pools of data represent tempting targets for surveillance by various security agencies, not to mention repurposing by commercial entities. As a result, there is an ongoing upwelling of outrage and calls for improved data privacy protection.

The biggest source of data is end-user personal and transaction data. A mobile network operator (MNO) could potentially keep all cellular location update (LU) information3 as opposed to just keeping information on calls or text transactions.4 An Internet service provider (ISP) could decide to keep a log of all sites visited by users for a much longer period of time (i.e., years). Most ISPs just keep a small fraction (i.e., a few hours) of this information for troubleshooting or caching purposes.5 Much information can be inferred from transaction data that end users would like to keep private and not made available indiscriminately.

Most users have been unaware of the volume of personal data retained by entities for various purposes. This is beginning to change as awareness of the data privacy debate is increasing. The two trends—increasing popularity of big data and increasing awareness of data privacy—are beginning to come to a head and companies that intend to capitalize on this era of big data need to be conscious about and address these basic ethical concerns.

There are two fundamental areas where one can look for guidance when it comes to enforcing data protection: existing regulation (written rules) and privacy frameworks (implied rules). Existing regulations form the base of compliance requirements. Many of these rules (i.e., consent requirements, declaration of purpose) still hold in the big data world. However, there are also implied rules. These are implicit expectations made between data handlers and data owners (i.e., purpose). For example, if one downloads a single-person arcade game from a mobile application store, one has a reasonable expectation that his/her address book or email information will not be retrieved and stored by the data handler.

Fortunately, there are privacy frameworks to help put some structure around these implied rules.

Written Rules

Even before the era of big data, there had been substantial work done on the issue of data protection and privacy. These are the written rules with which data-handling organizations must comply. Since increasing amounts of personal data started being stored during the advent of computers in the 1970s and 1980s, there has been growing awareness of the need to protect the individual’s right to privacy.

As electronic commerce becomes more pervasive, concerns have grown about the compatibility of various data privacy and protection regulations in the context of cross-border trade among entities under differing data privacy and protection regimes. Thus, there have been moves to make the various regulatory frameworks more compatible and consistent. For example, the European Union (EU) Data Protection Directive (Directive 95/46/EC)6 was released in October 1995 to provide a basic framework for proper handling of personal information, and now work has begun on a draft of the General Data Protection Regulation to supersede this directive. This would allow all EU states to subscribe to a common set of principles and coordinate on enforcement.

Various countries such as Malaysia,7 Singapore8 and the Philippines9 have, in turn, explicit legislation on data protection and privacy. A number of these countries have passed regulation in response to the EU Data Protection Directive and have aligned their legislation with the Asia Pacific Economic Cooperation (APEC) Privacy Framework10 or the Organisation for Economic Co-operation and Development (OECD) Privacy Principles.11 The EU Data Protection Directive contains adequacy requirements that prevent the transfer of personal data to entities outside the EU that do not comply with EU standards for privacy protection. The APEC and OECD frameworks were created to ensure that states would create compatible regulation to ensure smooth interstate commerce and other forms of interaction.

Other countries, such the US, have taken a sectoral approach to data protection legislation. The US has specific data privacy regulations for particular sectors, such as the Health Insurance Portability and Accountability Act (HIPAA)12 for the health care sector and the US-EU Safe Harbor Framework13 for the export sector that needs to exchange data with EU entities.

The aforementioned are examples of the written rules with which various entities must comply. These explicit rules and regulations continue to apply in the era of big data and have gained greater importance as more and more data are collected, retained and exchanged

Implied Rules

The presence of ever-increasing amounts of data that are being retained for longer periods of time has caused more concern among data-privacy advocates. Regulations may not have been updated or kept pace in the era of big data. Thus, it would be good at this point to take a step back and survey the implied rules for data protection and privacy. A good place to begin is the interaction of the APEC and OECD principles (figure 1).

Both APEC and OECD have similar data protection and privacy principles. There are some areas where aspects are moved from one principle to another, but in general they are compatible. Together these principles serve as a good framework that embodies the implied expectations of privacy that any individual may reasonably possess.

The following is a review of these principles in the context of big data:

  • Collection limitation—This is the first principle in both the OECD and APEC frameworks, and it is also the principle that big data can potentially violate the most. Basically, it requires that only the minimum amount of data required for a specific purpose be collected and then retained only for the minimum amount of time required. One of the key selling points of big data and the advent of cheap storage is to collect everything and throw away nothing, with the further manipulation and analysis of data occurring later. It is important that organizations moving toward big data harvesting of information and update the purposes of their applications to ensure that they remain within the spirit of this principle. An additional approach that is taken by some is to anonymize data. This process is sometimes called de-identification, where identifying ties to an individual are removed prior to the storage of large volumes of transaction data. However, care must be taken here. The simple removal of primary customer indexes might not suffice, as customer-specific information might be extrapolated from seemingly anonymous transaction data. This form of reidentification is a growing risk. Thus, some organizations additionally aggregate the data to further obscure traces of individual behavior. This anonymize-and-aggregate process requires preprocessing and results in a coarser resolution of data, which may be less useful but more protective of privacy. Anonymization is applied in the context of retrieval and long-term storage of data for which users have already provided their explicit consent. As a general rule, organizations wishing to comply with these principles should aim to collect only data necessary and properly destroy unnecessary data as soon as possible.
  • Purpose specification—This principle requires that the purpose for the collection of data be clearly and exclusively stated. As more data are being retained with big data, the stated purposes for collecting and retaining data must be periodically and carefully reviewed to ensure continued compliance. Original purpose specifications might be too limiting and do not cover the newer use cases offered by big data. It is tempting for organizations to collect data now and find alternative uses for it much later. There have been a number of high-profile cases14 involving applications collecting address book information and using this information for nondisclosed purposes. This is a typical scenario as address book information is still a manageable volume that does not require big-data-level scale. However, there are now cases of application-collecting usage and location15 information without proper disclosure of purpose. Historically, information such as this would likely be discarded due to its volume. With big data tools available today, this information can be kept longer. Organizations should clearly state and abide by their data collection purpose to avoid potential regulatory pitfalls.
  • Use limitation—This principle generally covers disclosure rules, particularly where data must not be shared with other parties or otherwise repurposed without consent. An important action with respect to this principle is onward transfer, which means care must be taken when sharing data with third parties. The big data era has also popularized the concept of selling or monetizing data. In particular, transaction data might be anonymized, but taken together with other data from other sources, may be used to identify individual customers. It is crucial to consider that there are many readily accessible tools, algorithms, application programming interfaces (APIs) and data sets that can be used for reidentification (i.e., combining Twitter postings and Netflix usage to determine customers based on what they are watching).
  • Data quality—In the traditional data warehousing analytics space, it was required that data be structured upfront and preprocessed into appropriate data models. This provided some initial effort to validate the integrity of the data. In the new big data era, some approaches involve just storing the data as collected without preprocessing. Thus, errors may potentially remain within the stored data set that will be discovered only when the data are used. In some cases, applications are not adjusted to consider the potential “dirtiness” of the data because they were originally written for traditional data warehouses. These applications and services must be reviewed in the context of moving toward modelless data storage and larger amounts of dirtier data.
  • Security safeguards—This principle requires that organizations that handle personal data provide the necessary safeguards and mechanisms to ensure that personal information does not fall into the wrong hands. As organizations put more data into low-cost commodity storage (e.g., cloud) solutions, it is crucial to review the data access controls on these external systems. A good number of these solutions do not provide the same levels of access control as more mature data-warehousing products. In some solutions, controls are enforced only at the interface level, but not at the lower levels (i.e., Hadoop clusters generally have no fine-grained Hadoop distributed file system [HDFS] access controls or security for metadata). It is important that organizations implement their own controls to plug these potential compliance gaps.
  • Openness—This principle requires information, developments and updates to be communicated to stakeholders in the most expedient manner. The implementation of this principle should be as transparent and timely as is implemented today by more mature, enterprise-class data warehouses. Organizations are encouraged to properly and promptly inform users of policy changes and developments. They are also encouraged to remind users of the consent they have already provided for the existing data sets.
  • Individual participation—This principle emphasizes the role of the individual in the management of his/her data. The customer has the right to request personal data collected through reasonable procedures and receive a timely response. The customer also has the right to erase, rectify, complete and otherwise amend personal data. In the big data era, a good amount of data is not preprocessed in a similar fashion as traditional data warehouses. This creates a number of potential compliance problems such as difficulty erasing, retrieving or correcting data. A typical big data system is not built for interactivity, but for batch processing. This also makes the application of changes on a (presumably) static data set difficult. Organizations may find this particular requirement challenging to implement because of the potentially complex consent mechanisms required for multiple various pieces of collected information and its use. However, if they do find this challenging they might want to reconsider even handling the data in the first place because compliance is likely harder.
  • Accountability—This principle requires that organizations that collect and store personal data be held accountable for enforcement of the other principles in this policy. This includes actions such as breach notification. The implementation of this principle should be as implemented today by more mature, enterprise-class data warehouses.

Additional principles that are gaining acceptance and being introduced into regulation include:

  • A priori consent and explicit opt-in—This requires that organizations ask for up-front consent and requires explicit opt-in by the individual. Organizations are encouraged to have configuration interfaces that allow their users to manage their privacy consent settings. Big data implementations normally collect data from mediation platforms or raw and unprocessed logging services, which make it difficult to remove customers who have not opted in. This may entail a substantial amount of preprocessing.
  • Data sovereignty—Some states have created regulation that affirms that data considered personal should not leave the territory of that state. This creates problems when implementing applications that are essentially global, but whose users may be citizens of such a state.
  • Extrapersonal protection—In some jurisdictions, there may be additional, distinct classes of personal information that require additional protections or controls. This class of information is normally called sensitive personal information (i.e., medical records, political views, race, religion).

Conclusion

Big data provides numerous opportunities for organizations to maximize the potential of the data they already have. A new era of store-everything-and-determine-its-use-(and perhaps monetize it)-later has begun. As organizations mobilize to take advantage of these developments, care must be taken to ensure compliance with both the current written rules (statutory and regulatory requirements) and the implicit rules when handling data covered under various data privacy and protection frameworks.

Data privacy and protection rules and regulations still need to be updated for the era of big data. As a matter of fact, many existing regulations have not been reviewed in the context of data warehousing (e.g., privacy laws covering only wiretapping). Extra care must be taken by organizations and constant vigilance is required in this ever-changing regulatory landscape. This creates more impetus for review in the context of data. However, it would be good to be grounded in key principles aligned with those advocated by OECD and APEC. With these core principles updated in the context of big data, organizations can reap the benefits of big data without potential compliance pitfalls, now or in the future.

Endnotes

1 Meer, David; “What Is ‘Big Data’ Anyway?,” Forbes, 5 November 2013, www.forbes.com/sites/boozandcompany/2013/11/05/what-is-big-data-anyway/
2 Gallegos, Raul; “Edward Snowden’s Sad and Lonely Future,” Bloomberg Publishing, 5 November 2013, www.bloomberg.com/news/2013-11-05/edward-snowden-s-sad-and-lonely-future.html
3 LU information is generated every time a mobile moves from cell to cell.
4 Every step made with a mobile phone is known to the MNO. This is the only way an MNO can route a call.
5 The Guardian, “Harvard Bomb Scare to Ditch Exam,” 18 December 2013, www.theguardian.com/education/2013/dec/18/harvard-bomb-threat-student-eldo-kim
6 The European Parliament and the Council of the European Union, EU Data Protection Directive, “Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data,” 1995
7 Parliament of Malaysia, Personal Data Protection Act 2010 (Act 709), “An Act to regulate the processing of personal data in commercial transactions and to provide for matters connected therewith and incidental thereto,” 2010
8 Republic of Singapore, Personal Data Protection Act 2012 (No. 26 of 2012), “An Act to govern the collection, use and disclosure of personal data by organisations, and to establish the Personal Data Protection Commission and Do Not Call Register and to provide for their administration, and for matters connected therewith, and to make related and consequential amendments to various other Acts,” 2012
9 Congress of the Philippines, Data Privacy Act of 2012, RA 10173, “An act protecting individual personal information in information and communications systems in the government and the private sector, creating for the purpose a national privacy commission, and for other purposes,” 2012
10 APEC Secretariat, APEC#205-SO-01.2, Asia Pacific Economic Coorperation (APEC) Privacy Framework, 2005
11 The Organisation for Economic Co-operation and Development (OECD), OECD Privacy Principles, OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, 1980
12 US Congress, Health Insurance Portability and Accountability Act of 1996 (HIPAA), Pub.L. 104–191 110 Stat. 1936, 1996
13 US Department of Commerce, US-EU Safe Harbor Framework, Safe Harbor Principles and Related Annexes, 2000
14 Schnell, Joshua; “Path Fined by FTC for Illegally Collecting Information From Children,” MacGASM, 1 February 2013, www.macgasm.net/2013/02/01/path-fined-ftc-for-illegally-collecting-information-from-children/
15 Smith, Chris; “FTC Finds Popular Flashlight App for Android Illegally Sharing Data With Advertisers,” BGR, 6 December 2013, http://bgr.com/2013/12/06/flashlight-app-sharing-data-illegally-ftc/

William Emmanuel Yu, Ph.D., CISM, CRISC, CISSP, CSSLP, is technology senior vice president at Novare Technologies. Yu is working on next-generation telecommunications services, valued-added systems integration, and consulting projects focusing on fixed mobile convergence and enterprise mobility applications with mobile network operators and technology providers. He is actively involved in Internet engineering, mobile platforms and information security research. Yu is also a faculty member at the Ateneo de Manila University, Philippines, and the Asian Institute of Management (Manila, Philippines).