Reuben Binns, our Research Fellow in Artificial Intelligence (AI), and Valeria Gallo, Technology Policy Adviser, discuss some of the techniques organisations can use to comply with data minimisation requirements when adopting AI systems.
This post is part of our ongoing Call for Input on developing the ICO framework for auditing AI. We encourage you to share your views by leaving a comment below or by emailing us at AIAuditingFramework@ico.org.uk.
AI systems generally require large amounts of data. However, organisations must comply with the minimisation principle under data protection law if using personal data. This means ensuring that any personal data is adequate, relevant and limited to what is necessary for the purposes for which it is processed.
What is adequate, relevant and necessary in relation to AI systems will be use-case specific. However, there are a number of techniques that organisations can adopt in order to develop AI systems which process as little personal data as possible, while still remaining functional. In this blog, we explore some of the most relevant techniques for supervised Machine Learning (ML) systems, which are currently the most common type of AI in use.
Within organisations, the individuals accountable for the risk management and compliance of AI systems need to be aware that such techniques exist and be able to discuss different approaches with their technical staff. The default approach of data scientists in designing and building AI systems will not necessarily take into account any data minimisation constraints. Organisations must therefore have in place risk management practices to ensure that data minimisation requirements, and all relevant minimisation techniques, are fully considered from the design phase, or, if AI systems are bought or operated by third parties, as part of the procurement process due diligence.
However, data minimisation techniques do not completely eliminate risk. Also, while some techniques will not require any compromise to deliver data minimisation benefits, some will require organisations to balance data minimisation with other compliance or utility objectives, eg making more accurate and non-discriminatory ML models. Our previous trade-offs blog discusses our current thinking about how organisations could approach this balancing act.
The first step organisations should take towards compliance with data minimisation is to understand and map out all the ML processes in which personal data might be used.
How personal data is used in ML models
Supervised ML algorithms can be trained to identify patterns and create models from data sets (‘training data’) which include past examples of the type of instances the model will be asked to classify or predict. Specifically, the training data contains both the ‘target’ variable, i.e. the thing that the model is aiming to predict or classify, and several ‘predictor’ variables, i.e. the input used to make the prediction. For instance, in the training data for a bank’s credit risk ML model, the predictor variables might include the age, income, occupation, and location of previous customers, while the target variable will be whether or not the customers repaid their loan.
Once trained, ML systems can then classify and make predictions based on new data containing examples that the system has never seen before. A query is sent to the ML model, containing the predictor variables for a new instance (eg a new customer’s age, income, occupation, etc.). The model responds with its best guess as to the target variable for this new instance (eg whether or not the new customer will default on a loan).
Supervised ML approaches therefore use data in two main phases:
- Training phase, when training data is used to develop models based on past examples; and
- Inference phase, when the model is used to make a prediction or classification about new instances.
If the model is used to make predictions or classifications about individual people, then it is very likely that personal data will be used at both the training and inference phases.
Techniques to minimise personal data
When designing and building ML applications, data scientists will generally assume that all data used in training, testing and operating the system will be aggregated in a centralised way, and held in its full and original form by a single entity throughout the AI system’s lifecycle.
However, there are in fact different approaches and a number techniques which can be adopted instead to minimise the amount of data an organisation needs to collect and process, or the extent to which data is identifiable with a particular individual.
Data minimisation in the training phase
As we have explained, the training phase involves applying a learning algorithm to a dataset containing a set of features for each individual which are used to generate the prediction or classification.
However, not all features included in a dataset will necessarily be relevant to the task. For example, not all financial and demographic features will be useful to predict credit risk.
There are a variety of standard feature selection methods used by data scientists to select features which will be useful for inclusion in a model. These methods are good practice in data science, but they also go some way towards meeting the data minimisation principle.
Also, as discussed in the ICO’s previous report on AI and Big Data, the fact that some data might later in the process be found to be useful for making predictions is not enough to establish its necessity for the purpose in question, nor does it retroactively justify its collection, use, or retention.
There are also a range of techniques for preserving privacy which can be used to minimise data processing at the training phase.
Some of these techniques involve modifying the training data to reduce the extent to which it can be traced back to specific individuals, while retaining its utility for the purposes of training well-performing models. This could involve changing the values of data points belonging to individuals at random – known as ‘perturbing’ or adding ‘noise’ to the data – in a way that preserves some of the statistical properties of those features (see eg the RAPPOR algorithm).
These types of privacy-preserving techniques can be applied to the training data after it has already been collected. Where possible, however, they should be applied before the collection of any personal data, to avoid the creation of large personal datasets altogether.
For instance, smartphone predictive text systems are based on the words that users have previously typed. Rather than always collecting a user’s actual keystrokes, the system could be designed to create ‘noisy’ (i.e. false) words at random. This would mean that an organisation could not be sure about which words were ‘noise’ and which words were actually typed by a specific user. Although data would be less accurate at individual level, provided the system has enough users, patterns could still be observed, and used to train the ML model, at an aggregate level.
The effectiveness of these privacy-preserving techniques in balancing the privacy of individuals and the utility of a ML system, can be measured mathematically using methods such as differential privacy. Differential privacy is a way to measure whether a model created by an ML algorithm significantly depends on the data of any particular individual used to train it.
A related privacy-preserving technique is federated learning. This allows multiple different parties to train models on their own data (‘local’ models), and then combine some of the patterns that those models have identified (known as ‘gradients’) into a single, more accurate ‘global’ model, without having to share any training data with each other. Federated learning is relatively new, but has several large scale applications. These include auto correction and predictive text models across smartphones, but also for medical research involving analysis across multiple patient databases.
While sharing the gradient derived from a locally trained model presents a lower privacy risk than sharing the training data itself, a gradient can still reveal some personal information relating to the data subjects it was derived from, especially if the model is complex with a lot of fine-grained variables. Data controllers will therefore still need to assess the risk of re-identification. In the case of federated learning, participating organisations are likely to be to be considered joint controllers even though they don’t have access to each other’s data.
Minimising personal data at the inference stage
To make a prediction or classification about an individual, ML models usually require the full set of predictor variables for that person to be included in the query. As in the training phase, there are a number of techniques which can be used to minimise data at the inference stage. Here we cover several of the most promising approaches.
Converting personal data into less “human readable” formats
In many cases the process of converting data into a format that allows it to be classified by a model can go some way towards minimising it. Raw personal data will usually first have to be converted into a more abstract format for the purposes of prediction. For instance, human-readable words would normally be translated into a series of numbers (called a ‘feature vector’). This means that the organisation deploying an AI model may not need to process the human-interpretable version of the personal data contained in the query, for example, if the conversion happens on the user’s device.
However, the fact that it is no longer easily human-interpretable does not imply that the converted data is no longer personal. Consider Facial Recognition Technology (FRT), for example. In order for a facial recognition model to work digital images of the faces being classified have to be converted into ‘faceprints’. These are mathematical representations of the geometric properties of the underlying faces – eg the distance between a person’s nose and upper lip. Rather than sending facial images themselves to an organisation’s server, photos could be converted to faceprints directly on the device which captures them before sending them to the model for querying. These faceprints would be less easily identifiable to any humans than face photos. However, faceprints are still personal (indeed, biometric) data and therefore very much identifiable within the context of the facial recognition models that use them.
Making inferences locally
Another way to avoid the risks involved in sharing predictor variables is to host the model on the device from which the query is generated and which already collects and stores the data subject’s personal data. For example, an ML model could be installed on the user’s own device and make inferences ‘locally’, rather than being hosted on a cloud server.
For instance, models for predicting what adverts a user might be interested in could be run locally on their smartphone (see PrivAd and MobiAd for proof of concept examples). When an advertising opportunity arises, a range of different adverts could be sent from an advertising network, and the local model would select the most relevant one to show to the user, without having to reveal the user’s actual personal habits or profile information to the advertisers.
The constraint is that ML models need to be sufficiently small and computationally efficient to run on the user’s own hardware. However, recent advances in purpose-built hardware for smartphones and embedded devices mean that this is an increasingly viable option.
It is important to flag that local processing is not necessarily out of scope of data protection law. Even if the personal data involved in training is being processed on the user’s device, the organisation which creates and distributes the model would still be a data controller in so far as it determines the means and purposes of processing.
Privacy-preserving query approaches
If it is not feasible to deploy the model locally, other privacy-preserving techniques exist to minimise the data that is revealed in a query sent to a ML model (for an example, see eg TAPAS). These allow one party to retrieve a prediction or classification without revealing all of this information to the party running the model; in simple terms, they allow you to get an answer without having to fully reveal the question.
There are conceptual and technical similarities between data minimisation and anonymization. In some cases, application of privacy-preserving techniques means that certain data used in ML systems is rendered pseudonymous or anonymous. Our Anonymisation Code of Practice can provide organisations with information on these concepts. The ICO is also currently developing new updated guidance on anonymisation to take into account of new recent developments and techniques in this field.
We would like to hear your views on this topic and genuinely welcome any feedback on our current thinking. Please share your views by leaving a comment below or by emailing us at AIAuditingFramework@ico.org.uk.
 For an overview, see chapter 11 of ‘The Algorithmic Foundations of Differential Privacy’).
Dr Reuben Binns, a researcher working on AI and data protection, joined the ICO on a fixed term fellowship in December 2018. During his two-year term, Reuben will research and investigate a framework for auditing algorithms and conduct further in-depth research activities in AI and machine learning.