Three More Vs of Big Data

Author: Leighton Johnson, CISSP, CISM, CTO at ISFMT, Inc.
Date Published: 11 December 2019

In my 12 June 2019 @ISACA column, I identified the 4 Vs of big data as areas of focus and understanding to properly review, assess and audit big data systems and organizational efforts. There are at least 3 other Vs to consider when reviewing big data systems. They are:

  1. Variability—Variability refers to changes in a data set, which could include data flow rate, format/structure, semantics and/or quality, that impact the analytics application. This characteristic is different from variety, which was previously identified as variability in my last column, and applies to the wide range of changes present in the various data set inputs. The variations in format of unstructured data provide challenges to data scientists and custodians as the data are received and then processed. The quality of the data elements can cause the input data to be rejected, which then will cause issues in consideration of the veracity of the data.
  2. Volatility—Volatility refers to data management over time. How data management changes over time directly affects provenance. Big data is transformational in part because systems may produce indefinitely persisting data, meaning data that outlive the instruments on which they were collected; the architects who designed the software that acquired, processed, aggregated and stored it; and the sponsors who originally identified the project’s data consumers. Some of the criteria surrounding these data changes include:
    • Roles
    • Security
    • Privacy
    • Governance
    Roles are time-dependent in nature. Security and privacy requirements can shift accordingly. Governance can shift as responsible organizations merge or even disappear.
    While research has been conducted into how to manage temporal data (e.g., in e-science for satellite instrument data), there are few standards beyond simplistic time stamps and even fewer common practices available as guidance. To manage security and privacy for long-lived big data, data temporality should be taken into consideration.
  3. Validity—Validity refers to the accuracy and correctness of data. Traditionally, this is referred to as data quality. In the security world, this is called integrity. In the big data security scenario, validity refers to a host of assumptions about data from which analytics are being applied. For example, continuous and discrete measurements have different properties. The field “gender” can be coded as 1=Male, 2=Female, but 1.5 does not mean halfway between male and female. In the absence of such constraints, an analytical tool can make inappropriate conclusions. There are many types of validity whose constraints are far more complex. By definition, big data allows for aggregation and collection across disparate data sets in ways not envisioned by system designers. This affects the validity of the data in several areas, which include:
    • Data quality
    • Aggregation issues
    • Disparate data sets that cause interpretation issues, corrupted data and/or translation issues
    These examples show that what passes for valid big data can be innocuously lost in translation, interpretation or intentionally corrupted for malicious intent.

The wide range of data types, constructs, formats and sizes used in today’s big data systems cause reviewers and auditors to adjust their approaches to validating and verifying the accuracy, completeness and security of these types of systems.

Leighton Johnson, CISA, CISM, CIFI, CISSP, is a senior security consultant for the Information Security and Forensics Management Team of Bath, South Carolina, USA.