Data audit: goals, legal frameworks, and database credibility

Data audit: goals, legal frameworks, and database credibility

August 9, 2021


Data audit: goals, legal frameworks, and database credibility

Martin Anderson

Independent AI expert

Despite its many benefits, the emergence of high-performance machine learning systems over the last ten years has led to a growing 'plug-and-play' analytics culture, where high volumes of opaque data are thrown arbitrarily at an algorithm until it yields useful business intelligence. A lot of collateral material is generated during these processes, and, in a results-driven culture, not all of it is well-curated.

Thanks to the black box nature of a typical machine learning workflow, it can be difficult to understand or account for the extent of the 'dark' data that survives these processes; or the extent to which the unacknowledged provenance or unexplored scope of the data sources could legally expose a downstream application later on.

This raises a number of questions:

  • Did the data pass through jurisdictions which encumber the enterprise with legal obligations in regard to its storage?
  • Have later regulations affected the extent to which historical data can continue to be used?
  • Are the data's evolving schema and origins sufficiently well-understood to placate the concerns of potential partners, or to satisfy the 'due diligence' phase of a buy-out?
  • Is the data ready for a coming wave of regulatory frameworks around machine learning and algorithmic data analytics systems?
  • …or is its opacity a potentially fatal liability in the face of coming regulatory standards that did not exist when the data was first introduced?

Also, from a more self-serving standpoint, are there hidden opportunities in the data that are being ignored because the machine systems are targeting specific criteria and discarding the rest?

Here we'll look at possible answers to some of these questions (note some key points to consider in relation to carrying out a data audit either in-house or requesting data science consulting) and review some approaches to building a long-term data quality management pipeline.

Who initiates a data audit?

In most jurisdictions a data audit is not currently an 'official' and prescribed event, like an IRS audit or a heuristics-based antivirus scan. Rather, it's a process that may involve varying standards of transparency and disclosure, and may come about for a number of reasons, including:

  • A company's need to improve a process or utilize untapped data resources to develop new processes.
  • The wish to develop filters that will automatically discard useless data not needed for compliance purposes as part of the big data governance strategy.
  • The wish to reduce a company's storage burden by identifying non-actionable and legally irrelevant data (usually a one-time event, rather than a modification to an ongoing data-gathering pipeline, as in the previous point).
  • The need to ensure compliance standards in reasonable anticipation of possible third-party audits at a later date.
  • When a government agency (such as police authorities or a standards body) requires an audit, either as a systematic or ad-hoc inspection, or in response to complaints or other prompts.
  • When new regulations change the legal or actionable status of currently stored data, which may or may not exist in a company's data storage systems. If a company has a great deal of dark unstructured data that it has not yet processed, indexed or understood, it may be internally constrained to conduct an audit.

Goals of a data audit

Though the objectives for a data audit may vary depending on whether the audit is being conducted for compliance (external demands) or performance (internal, commercial review of processes), either type of audit is a worthwhile opportunity to tune your data-gathering and governance procedures and policies, and to take both sets of needs into consideration.

Therefore, some of the 'grouped' objectives of a data audit may include:

  • The identification of non-indexed material, with a view to developing a forward plan for it (such as deletion, governance requirement evaluation, or general indexing).
  • Identification of material where provenance is uncertain (i.e. metadata is insufficient or has been stripped by unsuitable preprocessing routines), potentially exposing the company to later legal liabilities.
  • Identifying and removing malicious data (a scenario likely to increase as adversarial machine learning model attacks become more common over the next 5-10 years) and securing the channels and protocols that allowed it in.
  • Comparing retained data with the terms of privacy policies that may have been updated since the data was included in the system, and determining appropriate action.
  • Identifying material from public datasets that has been used against the terms of a license that was applicable at the time of collation, irrespective of whether state guidelines permitted 'fair use' at that time.
  • The establishment of workflows for automatically handling data anomalies in future audits (such as a schema where non-compliant or inadequately tagged data triggers a manual alert, or at least becomes explicitly logged as an issue and quarantined from use until review).

Regulatory data audits

Even in these early years of the formation of national data governance strategies and legislation, proprietary company data is subject to inspection by standards bodies depending on the jurisdiction, especially in the European Union (see below).

Data does not need expression via a 'public' face (such as a mobile app or web service) or an API in order to be auditable, since inspections operate on the assumption that a company's data will affect its relationship with its consumers and their rights to privacy, among many other considerations.

Two growing trends are set to contribute to an increased demand for data inspection over the next ten years: an increase in regulation and the emerging ability of new techniques to identify data that was used to train a machine learning system whether through a supervised or unsupervised approach.

Since machine learning and data science regulation is in a nascent phase, with the emergence of coherent and globally-enforced standards a distant prospect, it's not possible here to cover all the current state regulatory processes to which your data may be liable, and for which it may one day be audited — either under existing regulations or new ones that are set to emerge over the next decade.

Nonetheless, we can consider some of the existing statutes and review others that are coming into focus in this period.

GDPR as a template for coming governance frameworks

The EU General Data Protection Regulation (GDPR) levies fines of 4% of revenue on companies that infract its rules on data protection, while the Draft AI Regulations proposed by the European Commission in April of 2021 promise a 6% bite of global revenue for companies that contravene subsequent laws derived from it.

Eventually, companies whose data impinges on European borders (even indirectly, such as through geo-oblivious cloud-hosted services) will be subject to both of these frameworks, each of which specifically deals with data provenance and governance (but not in a way that is necessarily consistent).

The GDPR is being considered around the world as a template for data privacy frameworks. In the US, the GDPR-style California Consumer Privacy Act of 2018 (CCPA) led the way for formal data oversight frameworks in the States three years ago, with frequent calls since for the US to match Europe's lead. Therefore, adhering to GDPR guidelines now may be the best preparation for future regulatory inspections, since even the EU's draft AI regulations cover a lot of the same territory.

The European Data Protection Supervisor (EDPS) provides various insights into the rationale and requirements for an on-the-spot or scheduled data audit, including a helpful overview, an inspection policy framework, and a set of general guidelines to follow.

The GDPR guidelines for a data audit are divided into four sections: lawful basis and transparency; data security; privacy rights; and accountability/governance.

Here we'll look at that part of the European Union's advice on the GDPR as relates directly to data auditing.

Accountability: The GDPR requires companies of more than 250 employees (or companies of any size, if the company handles sensitive data) to maintain an updated list of processing activities for inspection.  A data impact assessment (for which an official template is available) is the best way to gauge a company's obligations in this respect.

Justification: Additionally, a company must have legal justification for the data it records, retains or processes, as outlined in Article 6 of the GDPR.

Disclosure: Article 12 requires robust transparency mechanisms to inform people that their data is being collected, and to define who may access the data and how the data is being secured.

Security: Stored data must adhere to the data protection principles of the GDPR Article 5 and follow the principles laid out there.

Encryption: The GDPR requires the use of encryption or pseudonymization for stored data, and an organization may need to provide evidence that this has been implemented.

Internal access: Besides the data itself, operational security will be included in any data audit, to establish the existence of a strong internal security policy.

Breach disclosure: Undisclosed breaches revealed by a data audit will invoke some of the strongest penalties, depending on the extent to which it can be established that the company was aware of the breach.

Data Officer: The GDPR requires that someone in the organization be accountable for compliance, and authorized to make necessary changes in policy. In certain circumstances, a Data Protection Officer must be employed and dedicated to these matters.

Sharing agreements: Where company data is disclosed to third parties, agreements must be in place, and the EU provides a draft template for this purpose.

User control of existing data: Where company data holds material on individuals, a data audit will need to demonstrate that the end-user can access and correct data held on them. Processes must also be in place to stop sharing or to delete the data, if requested. 

Oversight of automated processes: This section of the GDPR crosses most into the coming European AI regulations, mandating that any automated decision processes that have a legal or 'similarly significant' effect on an individual's rights must have human-led processes in place in the event of a challenge from end users.

GDPR data auditing requirements

Accountability Have an updated list of processing activities ready for inspection
Justification Have a legal justification for processing customer data
Disclosure Be transparent about collecting, using and protecting data
Security Apply GDPR-compliant data protection mechanisms
Encryption Encrypt and/or pseudonymize data and have evidence of thereof
Internal access Have a strong internal security policy in place
Breach disclosure Report breach events to avoid penalty
Data Officer Employ a professional responsible for data processing compliance
Sharing agreements Have formal agreements to support data disclosure to third parties
User control of existing data Be ready to prove users are allowed to manage and delete data you store on them
Oversight of automated processes Ensure automated data processing algorithms are human-led

Data regulation in the UK and US

In the UK, the GDPR was copy-pasted into national law at the time of Brexit, with no obligation to retain the European standards in the future. Nonetheless, the Joint Information Systems Committee (JISC) offers the Data Audit Framework Development (DAFD) guideline document as a policy guide and preparatory checklist for companies researching data audit liabilities.

It's uncertain when specific machine learning-related regulation will come to the US. Currently a company's data liability is still largely subject there to older statutes, such as the data protection component of the Health Insurance Portability and Accountability Act (HIPAA); the Gramm-Leach-Bliley Act (GLBA, for financial services); the US Privacy Act of 1974; the Children’s Online Privacy Protection Act (COPPA, which has at least specifically addressed issues around data retention in recent years); and, in the most general terms possible, section 5 of the 1914 Federal Trade Commission Act.

Revealing source data

By nature, machine learning algorithms absorb and obscure their data sources (datasets), defining desired features to be extracted from a dataset and generalizing those features in the 'latent space' of the training process. The resulting algorithms are therefore 'representative' and abstract and generally considered incapable of explicitly exposing their contributing source data.

However, reliance on this 'automatic obscurity' is increasingly coming under challenge from recent methods to expose source data from algorithmic output — a technique known as 'model inversion'. We'll look at that in a moment, but first let's examine possible relevant scenarios.

FOSS datasets

If your analytics system has used a free or open-source (FOSS) dataset, its license may oblige you to disclose this fact and you'll need to assess the long-term viability of the license and the data. Perhaps the biggest risk in this respect is the use of a FOSS dataset whose provenance and IP-integrity is later challenged by third parties that lay claim to the data.

In the event of a new and restrictive change in license for an open-source dataset, it's possible, depending on the terms of the license, for any software (including machine learning algorithms) unwittingly developed with IP-locked data to become subject to restrictions as well.

Many cloud providers currently offer FOSS datasets as starter templates for datasets that, in development, may ultimately completely overwrite the original material but still be bound by later patent trolls laying claim to the legacy business value of the original data (see 'Check your weight' below).  

Synthetic datasets

One increasingly popular approach to data generation is the use of synthetic data, such as artificially-produced text or CGI-generated imagery. It's worth being aware too of the provenance of the information in a synthetic dataset that you did not create yourself.

Are all its contributing data sources publicly disclosed and available for inspection? Can you follow its entire chain of creation back to first source, including all ancillary material used in its creation, and be satisfied of the validity and perpetuity of the license terms?

Proprietary datasets

The safest possible way to develop unassailable source data is to generate it yourself. For instance, in response to the growing concern around IP on popular natural language processing datasets, a number of companies developing GPT-3-style language models are beginning to hire freelance writers to populate their natural language generation datasets rather than risk exposure through web-scraping.

However, since this is also the most expensive and time-consuming solution, there's a temptation to cut corners by relying on current lax regulations around data-scraping and exploiting online material that a domain might prohibit for such use — simply because so many governments currently offer a 'free pass' in this respect.

If an official dataset license becomes more restrictive later, prior uses of the material remain covered by the earlier and more permissive license; but if governments later remove or reduce protections around 'fair use' as the machine learning sector commercializes, no such 'hereditary protections' are likely to apply. At best, rules for disputes arising from such cases will eventually be defined in the legal arena, with no guarantees that new precedents will favor companies over IP holders.

Therefore, it's only prudent to anticipate this when designing long-term data extraction, custody and governance policies.

Decoding source data

Model inversion techniques are proving capable of disclosing confidential information that was intended to be protected by the way that machine learning models 'abstract' source data. It covers a variety of techniques that make it possible to 'poll' an AI system and piece together a picture of the contributing data from its various responses to different queries.

In this period, the model inversion sector is fueled by a growing crusade around privacy and AI security; but the history of patent trolling over the last 30 years suggests that researchers' 'free ride' on public data will come to the attention of copyright enforcers over the next ten years as national AI policies mature, and that growing data transparency requirements will coincide with the capabilities of model inversion to expose data sources.

Check your weight

The 'weights' of a model often represent the intrinsic value of a machine learning framework, as is the case, for instance, with GPT-3; if the weights were generated by material that later becomes IP-locked and can be mapped (i.e. their use of copyrighted data exposed) by model inversion, it will not matter if the current dataset is impeccable from a governance standpoint.