Accuracy of AI system outputs and performance measures

Continuing our AI auditing framework blog series, Reuben Binns, our Research Fellow in Artificial Intelligence (AI), and Valeria Gallo, Technology Policy Adviser, explore how the data protection principle of accuracy applies to AI systems, and propose some steps organisations should take to ensure compliance.

This
blog forms part of our ongoing consultation on developing the ICO framework for
auditing AI. We are keen to hear your views in the comments below or via 
email.

Accuracy is
one of the key principles of data protection. It requires organisations to take
all reasonable steps to make sure the personal data they process is not “incorrect
or misleading as to any matter of fact” and, where necessary, is corrected or
deleted without undue delay.
Accuracy is especially
important when organisations use AI to process personal data and profile
individuals. If AI systems use or generate inaccurate personal data, this may
lead to the incorrect or unjust treatment of a data subject.
Discussions about
accuracy in an AI system often focus on the accuracy of input data, ie t
he personal data of a specific individual
used by an AI system to make decisions or predictions about an individual.
However, it is important to
understand accuracy requirements also apply to AI outputs, both in terms of accuracy of decisions or predictions about
a specific person and across a wider population.
In this blog, we take a
closer look at what accuracy of AI outputs means in practice and why selecting
the appropriate accuracy performance measures is critical, in order to ensure
compliance and to protect data subjects.


Accuracy of AI outputs


If the output of an AI
system is personal data, any inaccuracies as to “any matter of fact” can be
challenged by data subjects. For example, if a marketing AI application predicted
a particular individual was a parent when they in fact have no children, its
output would be inaccurate as to a matter of fact. The individual concerned
would have the right
to ask the controller to rectify the AI output, under article 16 of the General
Data Protection Regulation (GDPR).
Often however, AI outputs may generate personal data where there
is currently no matter of fact. For example, an AI system could predict that someone
is likely to become a parent in the next three years. This kind of prediction cannot
be accurate or inaccurate in relation to ‘any matter of fact’. However, the AI
system may be more or less accurate as a matter of statistics, measured in
terms of how many of its predictions turn out to be correct for the population they
are applied to over time.
The European Data Protection Board’s
(EDPB) guidance
says in these cases individuals still have the right to challenge the accuracy of
predictions made about them, on the basis of the input data and/or the model(s)
used. GDPR also provides a right for the data subject to complement such personal
data with additional information.
In addition, accuracy
requirements are more stringent in the case of solely
automated AI systems
, if the AI outputs have a legal or similar effect on
data subjects (article 22 of the GDPR). In such cases, the GDPR recital 71 states
organisations should put in place “appropriate mathematical and statistical procedures”
for the profiling of data subjects as part of their technical measures. They should
ensure any factor that may result in inaccuracies in personal data is corrected
and the risk of errors is minimised.

While it is not the role
of the ICO to determine the way A
I systems should be built,
it is our role to understand how accurate they are and the impact on data subjects.
O
rganisations should
therefore understand and adopt appropriate accuracy measures when building and
deploying AI systems, as these measures will have important data protection implications.


Accuracy as a performance measure: the impact on data protection compliance


Statistical accuracy is
about how closely an AI system’s predictions match the truth. For example, if
an AI system is used to
classify
emails as spam, a simple measure of accuracy would be the number of emails that
were correctly classified as spam as a proportion of all the emails that were
analysed.

However such a measure could
be misleading. For instance, if 90% of all emails are spam, then you could create
a 90% accurate classifier by simply labelling everything as spam. For this
reason, alternative measures are usually used to assess how good a system is,
which reflect the balance between two different kinds of errors:

  • A
    false positive or ‘type I’ error:
    these are cases that the AI system incorrectly labels as positive (eg emails
    classified as spam, when they are genuine)
  •  A
    false negative or ‘type II’ error:
    these are cases that the AI system incorrectly labels as negative when they are
    actually positive (e.g. emails classified as genuine, when they are actually
    spam).

The balance between these two types of errors can be captured through various measures, including:
Precision: the percentage of cases identified as positive that are in fact positive (also called ‘positive predictive value’). For instance, if 9 out of 10 emails that are classified as spam are actually spam, the precision of the AI system is 90%.

Recall (or sensitivity): the percentage of all cases that are in fact positive that are identified as such. For instance, if 10 out of 100 emails are actually spam, but the AI system only identifies seven of them, then its recall is 70%.

There are trade-offs between
precision and recall. If you place more importance on finding as many of the
positive cases as possible (maximising recall), this may come at the cost of
some false positives (lowering precision).
In addition, there may be
important differences between the consequences of false positives and false
negatives on data subjects. For example, if a CV filtering system selecting qualified
candidates for an interview produces a false positive, then an unqualified
candidate will be invited to interview, costing the employer and the applicant’s
time unnecessarily. If it produces a false negative, a qualified candidate will
miss an employment opportunity and the organisation will miss a good candidate.
Organisations may therefore wish to prioritise avoiding certain kinds of error
based on the severity and nature of the risks.
In general, accuracy as a
measure depends on it being possible to compare the performance of a system’s
outputs to some “ground truth”, i.e. checking the results of the AI system against
the real world. For instance, a medical diagnostic tool designed to detect
malignant tumours could be evaluated against high quality test data, containing
known patient outcomes. In some other areas, a ground truth may be unattainable.
This could be because no high quality test data exists or because what you are
trying to predict or classify is subjective (e.g. offense), or socially constructed
(e.g. gender).
Similarly, in many cases
AI outputs will be more like an opinion than a matter of fact, so accuracy may
not be the right way to assess the acceptability of an AI decision.
In
addition, since accuracy is only relative to test data, if the latter isn’t
representative of the population you will be using your system on, then not
only may the outputs be inaccurate, but they may also lead to bias and discrimination.
These will be the subject of future blogs, where we will explore how, in such
cases, organisations may need to consider other principles like fairness and the
impact on fundamental rights, instead of (or as well as) accuracy.


Finally, accuracy is not a
static measure, and while it is usually measured on static test data, in real
life situations, systems will be applied to new and changing populations. Just
because a system is accurate with respect to an existing population (e.g.
customers in the last year), it may not continue to perform well if the
characteristics of the future population changes. People’s behaviours may
change, either of their own accord, or because they are adapting in response to
the system, and therefore the AI system may become less accurate with time. This
phenomenon is referred to in machine learning as ‘concept drift’, and various
methods exist for detecting it. For instance, you can measure the estimated
distance between classification errors over time
Further explanation of
concept drift can be found on the Cornell University website.

What should organisations do?

Organisations
should always think carefully from the start whether it is appropriate to
automate any prediction or decision making process. This should include assessing
if acceptable levels of accuracy can be achieved.
If
an AI system is intended to complement, or replace, human decision-making then any
assessment should compare human and algorithmic accuracy to understand the
relative advantages, if any, various AI systems might bring. Any potential accuracy
risk should be considered and addressed as part of any Data
Protection Impact Assessment
.  


While
accuracy is just one of multiple considerations when determining whether and
how to adopt AI, it should be a key element of the decision-making process.
This is particularly true if the subject matter is, for example, subjective or
socially contestable. Organisations also need to consider if high quality test
data can be obtained on an ongoing basis to establish a “ground truth”. Senior
leaders should be aware that left to their own devices data scientists may not to
distinguish between data labels that are objective or subjective, but this may
be an important distinction in relation to the data protection accuracy
principle.


If
organisations decide to adopt an AI system, then they should:

  • ensure
    that all functions and individuals responsible for its development, testing, validation,
    deployment, and monitoring are adequately trained to understand the associated accuracy
    requirements and measures; and
  • adopt
    an official common terminology that staff can use to discuss accuracy
    performance measures, including their limitations and any adverse impact on data
    subjects.
Accuracy
and the appropriate measures to evaluate it should be considered from the
design phase, and should also be tested throughout the AI lifecycle. After
deployment, monitoring should take place, the frequency of which should be proportional
to the impact an incorrect output may have on data subjects, so the higher the
impact the more frequently it is monitored. Accuracy measures should also be
regularly reviewed to mitigate the risk of concept drift and change policy
procedures should take this into account from the outset.


Accuracy
is also an important consideration if organisations outsource the development
of an AI system to a third party (either fully or partially) or purchase an AI
solution from an external vendor. In these cases, any accuracy claim made by third
parties needs to be examined and tested as part of the procurement process. Similarly,
it may be necessary to agree regular updates and reviews of accuracy to guard
against changing population data and concept drift.


Finally,
the vast quantity of personal data organisations will need to hold and process
as part of their AI systems is likely to put pressie on 
any pre-AI processes to identify and, if
necessary, rectify/delete inaccurate personal data, whether it is used as input
or training/test data. Therefore organisations will need to review their data
governance practices and systems to ensure they remain fit for purpose.


Your feedback

We are keen to hear your thoughts on this topic and welcome any
feedback on our current thinking. In particular, we would appreciate your views
on the following two questions:
1)   Are there any additional
compliance challenges in relation to accuracy of AI systems outputs and
performance measures we have not considered?

2)   What other technical and
organisational controls or best practice do you think organisations should
adopt to comply with accuracy requirements? 

Source

Spread the love

Related posts