Ensuring statistical models command public confidence

Learning lessons from the approach to developing models for awarding grades in the UK in 2020

Executive Summary

Purpose of this report

In March 2020 the ministers with responsibility for education in England, Scotland, Wales and Northern Ireland announced the closure of schools as part of the UK’s response to the coronavirus outbreak. Further government announcements then confirmed that public examinations in summer 2020 would not take place.

The four UK qualification regulators – Ofqual (England), Scottish Qualifications Authority (Scotland), Qualifications Wales (Wales) and the Council for the Curriculum, Examinations & Assessment (Northern Ireland) – were directed by their respective governments to oversee the development of an approach to awarding grades in the absence of exams. While the approaches adopted differed, all approaches involved statistical algorithms.

When grades were released in August 2020, there was widespread public dissatisfaction centred on how the grades had been calculated and the impact on students’ lives. The grades in all four countries were re-issued based on the grades that schools and colleges had originally submitted as part of the process for calculating grades.

The public acceptability of algorithms and statistical models had not been such a prominent issue for so many people before, despite the rise in their use. As the regulator of official statistics in the UK, it is our role to uphold public confidence in statistics.

Statistical models and algorithms used by government and other public bodies are an increasingly prevalent part of contemporary life. As technology and the availability of data increase, there are significant benefits from using these types of models in the public sector.

We are concerned that public bodies will be less willing to use statistical models to support decisions in the future for fear of a public acceptability backlash, potentially hindering innovation and development of statistics and reducing the public good they can deliver. This is illustrated by the emphasis placed on not using algorithms during discussions of how grades will be awarded in 2021 following the cancellation of exams this year. For example, the Secretary of State for Education, when outlining the approach to awarding grades in January 2021, stated that “This year, we will put our trust in teachers rather than algorithms.”[1]

It is important therefore that lessons are learned for government and other public bodies who may wish to use statistical models to support decisions. This review identifies lessons for model development to support public confidence in statistical models and algorithms in the future.

The broader context: Models and algorithms

Throughout this report we have used the terms statistical model and algorithm when describing the various aspects of the models used to deliver grades. It should be noted, however, that terms such as statistical model, statistical algorithm, data-driven algorithms, machine learning, predictive analytics, automated decision making and artificial intelligence (AI), are frequently used interchangeably, often with different terms being used to describe the same process.

We consider that the findings of this review apply to all these data-driven approaches to supporting decisions in the public sector whatever the context.

Our approach: Lessons on building public confidence

This review centres on the importance of public confidence in the use of statistical models and algorithms and looks in detail at what it takes to achieve public confidence. The primary audiences for this review are public sector organisations with an interest in the use of models to support the delivery of public policy, both in the field of education and more broadly. This includes statisticians and analysts; regulators; and policy makers who commission statistical models to support decisions.

In conducting our review, we have adopted the following principles.

Our purpose is not to pass definitive judgments on whether any of the qualification regulators performed well or badly. Instead, we use the experiences in the four countries to explore the broader issues around public confidence in models.
The examples outlined in this report are included for the purposes of identifying the wider lessons for other public bodies looking to develop or work with statistical models and algorithms. These examples are therefore not an exhaustive description of all that was done in each country.
In considering these case studies, we have drawn on the principles of the Code of Practice for Statistics. While not written explicitly to govern the use of statistical algorithms, the Code principles have underpinned how we gathered and evaluated evidence, namely:

Trustworthiness: the organisational context in which the model development took place, especially looking at transparency and openness

Quality: appropriate data and methods, and comprehensive quality assurance

Value: the extent to which the models served the public good.

We considered the end-to-end processes, from receiving the direction from Ministers to awarding of grades and the planned appeals processes, rather than just the technical development of the algorithms themselves.
We have drawn on evidence from several sources. This included meeting with the qualification regulators and desk research of publicly available documents.
We have undertaken this review using our regulatory framework, the Code of Practice for Statistics. It is outside our remit to form judgments on compliance or otherwise with other legal frameworks.

We have also reviewed the guidance and support that is available to organisations developing statistical models and algorithms to identify whether it is sufficient, relevant and accessible and whether the available guidance and policies are coherent. Independent reviews of the grade awarding process have been commissioned by the Scottish Government, Welsh Government and Department of Education in Northern Ireland. Whilst there are some overlaps in scope with our review, there are also key differences – most notably, the reviews sought to review the approach to awarding grades in order to make recommendations for the approach to exams in 2021. Our review goes wider: it seeks to draw lessons from the approaches in all four countries to ensure that statistical models, whatever they are designed to calculate, command public confidence in the future.

Findings

The approaches to awarding grades were regulated by four bodies:

In England, Office of Qualifications and Examinations Regulation (Ofqual)
In Scotland, Scottish Qualifications Authority (SQA)
In Wales, Qualifications Wales
In Northern Ireland, Council for the Curriculum, Examinations & Assessment (CCEA).

Although the specific approaches differed in the four countries, the overall concepts were similar, in that they involved the awarding of grades based on a mix of teacher predicted grade, rankings of students within a subject, and prior attainment of the 2020 students and/ or previous cohorts at the same centre (i.e. school or college).

It was always going to be extremely difficult for a model-based approach to grades to command public confidence

The task of awarding grades in the absence of examinations was very difficult. There were numerous challenges that the qualification regulators and awarding organisations had to overcome. These included, but are not limited to:

The novelty of the approach, which meant that it was not possible to learn over multiple iterations and that best practice did not already exist.
The constraints placed on the models by the need to maintain standards and not disadvantage any groups.
The variability in exams results in a normal year due to a range of factors other than student ability as measured by prior attainment.
Tight timescales for the development and deployment of the model.
Decisions about young people’s lives being made on the day the grades were released.
Limited data on which to develop and test the model.
The challenges of developing the models while all parts of the UK were in a lockdown.
Teacher estimated grades varied significantly from historic attainment for some schools or colleges.

These challenges meant that it was always going to be difficult for a statistical algorithm to command public confidence.

Whilst we understand the unique and challenging context in which the models were developed, we also recognise that the grade awarding process in summer 2020 had a fundamental impact on young people’s lives.

Public confidence was influenced by a number of factors

Against the background of an inherently challenging task, the way the statistical models were designed and communicated was crucial. This demonstrates that the implementation of models is not simply a question of technical design. It is also about the overall organisational approach, including factors like equality, public communication and quality assurance.

Many of the decisions made supported public confidence, while in some areas different choices could have been made. In our view, the key factors that influenced public confidence were:

The teams in all of the qualification regulators and awarding organisations acted with honesty and integrity. All were trying to develop models that would provide students with the most accurate grade and enable them to progress through the education system. This is a vital foundation for public confidence.

Confidence in statistical models in this context – whilst we recognise the unique time and resource constraints in this case, a high level of confidence was placed in the ability of statistical models to predict a single grade for each individual on each course whilst also maintaining national standards and not disadvantaging any groups. In our view the limitations of statistical models, and uncertainty in the results of them, were not fully communicated. More public discussion of these limitations and the mechanisms being used to overcome them, such as the appeals process, may have helped to support public confidence in the results.
Transparency of the model and its limitations – whilst the qualification regulators undertook activities to communicate information about the models to those affected by them and published technical documentation on results day, full details around the methodology to be used were not published in advance. This was due a variety of reasons, including short timescales for model development, a desire not to cause anxiety amongst students and concerns of the impact on the centre assessed grades had the information been released sooner. The need to communicate about the model, whilst also developing it, inevitably made transparency difficult.
Use of external technical challenge in decisions about the models – the qualification regulators drew on expertise within the qualifications and education context and extensive analysis was carried out in order to make decisions about the key concepts in the models. Despite this, there was, in our view, limited professional statistical consensus on the proposed method. The methods were not exposed to the widest possible audience of analytical and subject matter experts, though we acknowledge that time constraints were a limiting factor in this case. A greater range of technical challenge may have supported greater consensus around the models.
Understanding the impact of historical patterns of performance in the underlying data on results – in all four countries the previous history of grades at the centre was a major input to calculating the grades that the students of 2020 received for at least some of their qualifications. The previous history of grades would have included patterns of attainment that are known to differ between groups. There was limited public discussion ahead of the release of results about the likely historical patterns in the underlying data and how they might impact on the results from the model. All the regulators carried out a variety of equality impact analyses on the calculated grades for potentially disadvantaged categories of students at an aggregate level. These analyses were based on the premise that attainment gaps should not widen, and their analyses showed that gaps did not in fact widen. Despite this analytical assurance, there was a perception when results were released that students in lower socio-economic groups were disadvantaged by the way grades were awarded. In our view, this perception was a key cause of the public dissatisfaction.
Quality Assurance – in the exam case, there were clear examples of good quality assurance of both input and output data. For input data, centres were provided with detailed guidance on the data they should supply. For output data, the regulators undertook a wide range of analysis, largely at an aggregate level. There was limited human review of outputs of the models at an individual level prior to results day. Instead, the appeal process was expected to address any issues. There was media focus on cases where a student’s grade was significantly different from the teacher prediction. In our view, these concerns were predictable and, whilst we recognise the constraints in this scenario, such cases should be explored as part of quality assurance.
Public engagement – all the qualification regulators undertook a wide range of public engagement activities, particularly at the outset. They deployed their experience in communicating with the public about exams and used a range of communication tools including formal consultations and video explainers, and the volume of public engagement activity was significant. Where acceptability testing was carried out, however, the focus was primarily on testing the process of calculating grades, and not on the impact on individuals. This, and the limited testing in some countries, may have led to the regulators not fully appreciating the risk that there would be public concern about the awarding of calculated grades.
Broader understanding of the exams system: in a normal year, individuals may not get the results they expect. For example, they may perform less well in an exam than anticipated. Statistical evidence and expert judgments support the setting of grade boundaries in a normal year. These may not be well understood in general but, as well-established processes they are able to command public confidence. As a result, when the unfamiliar 2020 approach was presented publicly, people may have assumed that an entirely new, machine-led approach was being introduced, and this may have raised their concerns. This issue of broader understanding would have been very hard for the regulators to address in the time available.

Overall, what is striking is that, while the approaches and models in the four countries had similarities and differences, all four failed to command public confidence. This demonstrates that there are key lessons to be learned for government and public bodies looking to develop statistical models to support decisions. These lessons apply to those that develop statistical models, policy makers who commission statistical models and the centre of government.

Lessons for those developing statistical models

Our review found that achieving public confidence is not just about delivering the key technical aspects of a model or the quality of the communication strategy. Rather, it arises through considering public confidence as part of an end-to-end process, from deciding to use a statistical model through to deploying it.

We have identified that public confidence in statistical models is supported by the following three principles:

Be open and trustworthy – ensuring transparency about the aims of the model and the model itself (including limitations), being open to and acting on feedback and ensuring the use of the model is ethical and legal.
Be rigorous and ensure quality throughout – establishing clear governance and accountability, involving the full range of subject matter and technical experts when developing the model and ensuring the data and outputs of the model are fully quality assured.
Meet the need and provide public value– engaging with commissioners of the model throughout, fully considering whether a model is the right approach, testing acceptability of the model with all affected groups and being clear on the timing and grounds for appeal against decisions supported by the model.

Specific learning points, which are of relevance to all those using data-driven approaches to support decisions in the public sector underpin each principle. These are detailed in Part 3 of this report.

Lessons for policy makers who commission statistical models

We have identified lessons for ensuring public confidence for commissioners of statistical models from the perspective of supporting those developing them.

A statistical model might not always be the best approach to meet your need. Commissioners of statistical models and algorithms should be clear what the model aims to achieve and whether the final model meets the intended use, including whether, even if they are “right”, they are publicly acceptable. They should ensure that they understand the likely strengths and limitations of the approach, take on board expert advice and be open to alternative approaches to meeting the need.
Statistical models used to support decisions are more than just automated processes. They are built on a set of assumptions and the data that are available to test them. Commissioners of models should ensure that they understand these assumptions and provide advice on acceptability of the assumptions and key decisions made in model development.
The development of a statistical model should be regarded as more than just a technical exercise. Commissioners of statistical models and algorithms should work with those developing the model throughout the end to end process to ensure that the process is open, rigorous and meets the intended need. This should include building in regular review points to assess whether the model will meet the policy objective.

Lessons for centre of Government

For statistical models used to support decisions in the public sector to command public confidence, the public bodies developing them need guidance and support to be available, accessible and coherent.

The deployment of models to support decisions on services is a multi-disciplinary endeavour. It cuts across several functions of Government, including the Analysis function (headed by the National Statistician) and the Digital and data function, led by the new Central Digital and Data Office, as well as others including operational delivery and finance. As a result, there is a need for central leadership to ensure consistency of approach.

The Analysis Function aims to improve the analytical capability of the Civil Service and enable policy makers to easily access advice, analysis, research and evidence, using consistent, professional standards. In an environment of increasing use of models, there is an opportunity for the function to demonstrate the role that analysis standards and professional expertise can play in ensuring these models are developed and used appropriately.

Our review has found that there is a fast-emerging community that can provide support and guidance in statistical models, algorithms, AI and machine learning. However, it is not always clear what is relevant and where public bodies can turn for support – the landscape is confusing, particularly for those new to model development and implementation. Although there is an emerging body of practice, there is only limited guidance and practical case studies on public acceptability and transparency of models. More needs to be done to ensure there is sufficient access for public bodies to available, accessible and coherent guidance on developing statistical models

Professional oversight support should be available to provide support to public bodies developing statistical models. This should include a clear place to go for technical expertise and ethics expertise.

Our recommendations

These recommendations focus on the actions that organisations in the centre of Government should take. Those taking forward these recommendations should do so in collaboration with the administrations in Scotland, Wales and Northern Ireland, which have their own centres of expertise in analysis, digital and data activities.

Recommendation 1: The Heads of the Analysis Function and the Digital Function should come together and ensure that they provide consistent, joined-up leadership on the use of models.

Recommendation 2: The cross-government Analysis and Digital functions, supported by the Centre for Data Ethics and Innovation should work together, and in collaboration with others, to create a comprehensive directory of guidance for Government bodies that are deploying these tools.

Recommendation 3: The Analysis Function, Digital Functions and the Centre for Data Ethics and Innovation should develop guidance, in collaboration with others, that supports public bodies that wish to test the public acceptability of their use of models.

Recommendation 4: In line with the Analysis Function’s Aqua Book, in any situation where a model is used, accountability should be clear. In particular, the roles of commissioner (typically a Minister) and model developer (typically a multi-disciplinary team of officials) should be clear, and communications between them should also be clear.

Recommendation 5: Any Government body that is developing advanced statistical models with high public value should consult the National Statistician for advice and guidance. Within the Office for National Statistics there are technical and ethical experts that can support public bodies developing statistical models. This includes the Data Science Campus, Methodology Advisory Service, National Statistician’s Data Ethics Committee and The Centre for Applied Data Ethics.

We will produce our own guidance in 2021 which sets out in more detail how statistical models should meet the Code of Practice for Statistics. In addition, we will clarify our regulatory role when statistical models and algorithms are used by public bodies.

Conclusion

The grade awarding process in 2020 was a high-profile example of public bodies using statistical models to make decisions.

In our view, the teams within the qualification regulators and awarding organisations worked with integrity to try to develop the best method in the time available to them. In each country there were aspects of the model development that were done well, and aspects where a different choice may have led to a different outcome. However, none of the models were able to command public confidence and there was widespread public dissatisfaction of how the grades had been calculated and the impact on students’ lives.

Our review has identified lessons to ensure that statistical models, whatever they are designed to calculate, can command public confidence in the future. The findings of this review apply to all public bodies using data-driven approaches to support decisions, whatever the context.

Our main conclusion is that achieving public confidence in statistical models is not just about the technical design of the model – taking the right decisions and actions with regards transparency, communication and understanding public acceptability throughout the end to end process is just as important.

We also conclude that guidance and support for public bodies developing models should be improved. Government has a central role to play in ensuring that models developed by public bodies command public confidence. This includes directing the development of guidance and support, ensuring that the rights of individuals are fully recognised and that accountabilities are clear.

[1] The Secretary of State for Education, Covid-19: Educational Settings

Volume 686: debated on Wednesday 6 January 2021, Hansard