August 9, 2021
Data audit: goals, legal frameworks, and database credibility
Independent AI expert
Despite its many benefits, the emergence of high-performance machine learning systems over the last ten years has led to a growing 'plug-and-play' analytics culture, where high volumes of opaque data are thrown arbitrarily at an algorithm until it yields useful business intelligence. A lot of collateral material is generated during these processes, and, in a results-driven culture, not all of it is well-curated.
Thanks to the black box nature of a typical machine learning workflow, it can be difficult to understand or account for the extent of the 'dark' data that survives these processes; or the extent to which the unacknowledged provenance or unexplored scope of the data sources could legally expose a downstream application later on.
This raises a number of questions:
Also, from a more self-serving standpoint, are there hidden opportunities in the data that are being ignored because the machine systems are targeting specific criteria and discarding the rest?
Here we'll look at possible answers to some of these questions (note some key points to consider in relation to carrying out a data audit either in-house or requesting data science consulting) and review some approaches to building a long-term data quality management pipeline.
In most jurisdictions a data audit is not currently an 'official' and prescribed event, like an IRS audit or a heuristics-based antivirus scan. Rather, it's a process that may involve varying standards of transparency and disclosure, and may come about for a number of reasons, including:
Though the objectives for a data audit may vary depending on whether the audit is being conducted for compliance (external demands) or performance (internal, commercial review of processes), either type of audit is a worthwhile opportunity to tune your data-gathering and governance procedures and policies, and to take both sets of needs into consideration.
Therefore, some of the 'grouped' objectives of a data audit may include:
Even in these early years of the formation of national data governance strategies and legislation, proprietary company data is subject to inspection by standards bodies depending on the jurisdiction, especially in the European Union (see below).
Data does not need expression via a 'public' face (such as a mobile app or web service) or an API in order to be auditable, since inspections operate on the assumption that a company's data will affect its relationship with its consumers and their rights to privacy, among many other considerations.
Two growing trends are set to contribute to an increased demand for data inspection over the next ten years: an increase in regulation and the emerging ability of new techniques to identify data that was used to train a machine learning system whether through a supervised or unsupervised approach.
Since machine learning and data science regulation is in a nascent phase, with the emergence of coherent and globally-enforced standards a distant prospect, it's not possible here to cover all the current state regulatory processes to which your data may be liable, and for which it may one day be audited — either under existing regulations or new ones that are set to emerge over the next decade.
Nonetheless, we can consider some of the existing statutes and review others that are coming into focus in this period.
The EU General Data Protection Regulation (GDPR) levies fines of 4% of revenue on companies that infract its rules on data protection, while the Draft AI Regulations proposed by the European Commission in April of 2021 promise a 6% bite of global revenue for companies that contravene subsequent laws derived from it.
Eventually, companies whose data impinges on European borders (even indirectly, such as through geo-oblivious cloud-hosted services) will be subject to both of these frameworks, each of which specifically deals with data provenance and governance (but not in a way that is necessarily consistent).
The GDPR is being considered around the world as a template for data privacy frameworks. In the US, the GDPR-style California Consumer Privacy Act of 2018 (CCPA) led the way for formal data oversight frameworks in the States three years ago, with frequent calls since for the US to match Europe's lead. Therefore, adhering to GDPR guidelines now may be the best preparation for future regulatory inspections, since even the EU's draft AI regulations cover a lot of the same territory.
The European Data Protection Supervisor (EDPS) provides various insights into the rationale and requirements for an on-the-spot or scheduled data audit, including a helpful overview, an inspection policy framework, and a set of general guidelines to follow.
The GDPR guidelines for a data audit are divided into four sections: lawful basis and transparency; data security; privacy rights; and accountability/governance.
Here we'll look at that part of the European Union's advice on the GDPR as relates directly to data auditing.
Accountability: The GDPR requires companies of more than 250 employees (or companies of any size, if the company handles sensitive data) to maintain an updated list of processing activities for inspection. A data impact assessment (for which an official template is available) is the best way to gauge a company's obligations in this respect.
Justification: Additionally, a company must have legal justification for the data it records, retains or processes, as outlined in Article 6 of the GDPR.
Disclosure: Article 12 requires robust transparency mechanisms to inform people that their data is being collected, and to define who may access the data and how the data is being secured.
Security: Stored data must adhere to the data protection principles of the GDPR Article 5 and follow the principles laid out there.
Encryption: The GDPR requires the use of encryption or pseudonymization for stored data, and an organization may need to provide evidence that this has been implemented.
Internal access: Besides the data itself, operational security will be included in any data audit, to establish the existence of a strong internal security policy.
Breach disclosure: Undisclosed breaches revealed by a data audit will invoke some of the strongest penalties, depending on the extent to which it can be established that the company was aware of the breach.
Data Officer: The GDPR requires that someone in the organization be accountable for compliance, and authorized to make necessary changes in policy. In certain circumstances, a Data Protection Officer must be employed and dedicated to these matters.
Sharing agreements: Where company data is disclosed to third parties, agreements must be in place, and the EU provides a draft template for this purpose.
User control of existing data: Where company data holds material on individuals, a data audit will need to demonstrate that the end-user can access and correct data held on them. Processes must also be in place to stop sharing or to delete the data, if requested.
Oversight of automated processes: This section of the GDPR crosses most into the coming European AI regulations, mandating that any automated decision processes that have a legal or 'similarly significant' effect on an individual's rights must have human-led processes in place in the event of a challenge from end users.
|Accountability||Have an updated list of processing activities ready for inspection|
|Justification||Have a legal justification for processing customer data|
|Disclosure||Be transparent about collecting, using and protecting data|
|Security||Apply GDPR-compliant data protection mechanisms|
|Encryption||Encrypt and/or pseudonymize data and have evidence of thereof|
|Internal access||Have a strong internal security policy in place|
|Breach disclosure||Report breach events to avoid penalty|
|Data Officer||Employ a professional responsible for data processing compliance|
|Sharing agreements||Have formal agreements to support data disclosure to third parties|
|User control of existing data||Be ready to prove users are allowed to manage and delete data you store on them|
|Oversight of automated processes|
Ensure automated data processing algorithms are human-led
In the UK, the GDPR was copy-pasted into national law at the time of Brexit, with no obligation to retain the European standards in the future. Nonetheless, the Joint Information Systems Committee (JISC) offers the Data Audit Framework Development (DAFD) guideline document as a policy guide and preparatory checklist for companies researching data audit liabilities.
It's uncertain when specific machine learning-related regulation will come to the US. Currently a company's data liability is still largely subject there to older statutes, such as the data protection component of the Health Insurance Portability and Accountability Act (HIPAA); the Gramm-Leach-Bliley Act (GLBA, for financial services); the US Privacy Act of 1974; the Children’s Online Privacy Protection Act (COPPA, which has at least specifically addressed issues around data retention in recent years); and, in the most general terms possible, section 5 of the 1914 Federal Trade Commission Act.
By nature, machine learning algorithms absorb and obscure their data sources (datasets), defining desired features to be extracted from a dataset and generalizing those features in the 'latent space' of the training process. The resulting algorithms are therefore 'representative' and abstract and generally considered incapable of explicitly exposing their contributing source data.
However, reliance on this 'automatic obscurity' is increasingly coming under challenge from recent methods to expose source data from algorithmic output — a technique known as 'model inversion'. We'll look at that in a moment, but first let's examine possible relevant scenarios.
If your analytics system has used a free or open-source (FOSS) dataset, its license may oblige you to disclose this fact and you'll need to assess the long-term viability of the license and the data. Perhaps the biggest risk in this respect is the use of a FOSS dataset whose provenance and IP-integrity is later challenged by third parties that lay claim to the data.
In the event of a new and restrictive change in license for an open-source dataset, it's possible, depending on the terms of the license, for any software (including machine learning algorithms) unwittingly developed with IP-locked data to become subject to restrictions as well.
Many cloud providers currently offer FOSS datasets as starter templates for datasets that, in development, may ultimately completely overwrite the original material but still be bound by later patent trolls laying claim to the legacy business value of the original data (see 'Check your weight' below).
One increasingly popular approach to data generation is the use of synthetic data, such as artificially-produced text or CGI-generated imagery. It's worth being aware too of the provenance of the information in a synthetic dataset that you did not create yourself.
Are all its contributing data sources publicly disclosed and available for inspection? Can you follow its entire chain of creation back to first source, including all ancillary material used in its creation, and be satisfied of the validity and perpetuity of the license terms?
The safest possible way to develop unassailable source data is to generate it yourself. For instance, in response to the growing concern around IP on popular natural language processing datasets, a number of companies developing GPT-3-style language models are beginning to hire freelance writers to populate their natural language generation datasets rather than risk exposure through web-scraping.
However, since this is also the most expensive and time-consuming solution, there's a temptation to cut corners by relying on current lax regulations around data-scraping and exploiting online material that a domain might prohibit for such use — simply because so many governments currently offer a 'free pass' in this respect.
If an official dataset license becomes more restrictive later, prior uses of the material remain covered by the earlier and more permissive license; but if governments later remove or reduce protections around 'fair use' as the machine learning sector commercializes, no such 'hereditary protections' are likely to apply. At best, rules for disputes arising from such cases will eventually be defined in the legal arena, with no guarantees that new precedents will favor companies over IP holders.
Therefore, it's only prudent to anticipate this when designing long-term data extraction, custody and governance policies.
Model inversion techniques are proving capable of disclosing confidential information that was intended to be protected by the way that machine learning models 'abstract' source data. It covers a variety of techniques that make it possible to 'poll' an AI system and piece together a picture of the contributing data from its various responses to different queries.
In this period, the model inversion sector is fueled by a growing crusade around privacy and AI security; but the history of patent trolling over the last 30 years suggests that researchers' 'free ride' on public data will come to the attention of copyright enforcers over the next ten years as national AI policies mature, and that growing data transparency requirements will coincide with the capabilities of model inversion to expose data sources.
The 'weights' of a model often represent the intrinsic value of a machine learning framework, as is the case, for instance, with GPT-3; if the weights were generated by material that later becomes IP-locked and can be mapped (i.e. their use of copyrighted data exposed) by model inversion, it will not matter if the current dataset is impeccable from a governance standpoint.
Find out how anomaly detection with machine learning can help us predict and proactively address upcoming threats.
Learn how Itransition helped a German automotive startup launch a SaaS platform to tracking and selling vehicle data, scoring 12M+ registered vehicles.
What’s under the hood of reinforcement learning applications? Learn their value for business use cases with Itransition.
Review the anatomy of big data governance and learn why it’s an essential component of any big data strategy.
Itransition delivered a SaaS product that enable analytical processing of bulk data uploaded online.
Find out how machine learning in ecommerce is reshaping the way we buy and sell on the web.
Learn how RPA and AI can work together to achieve superior business efficiency within the framework of cognitive automation.
We discuss several approaches to healthcare data security during the pandemic and offer some tips to ensure it. Read on to learn more.