Microsoft responsible machine learning capabilities build trust in AI systems, developers say

Anyone who runs a business knows that one of the hardest things to do is accuse a customer of malfeasance. That’s why, before members of Scandinavian Airlines’ (SAS) fraud detection unit accuse a customer of attempting to scam the carrier’s loyalty points program, the detectives need confidence that their case is solid.

“It would hurt us even more if we accidentally managed to say that something is fraud, but it isn’t,” said Daniel Engberg, head of data analytics and artificial intelligence for SAS, which is headquartered in Stockholm, Sweden.

The airline is currently flying a reduced schedule with limited in-flight services to help slow the spread of COVID-19, the disease caused by the novel coronavirus. Before the restrictions, SAS handled more than 800 departures per day and 30 million passengers per year. Maintaining the integrity of the EuroBonus loyalty program is paramount as the airline waits for regular operations to resume, noted Engberg.

EuroBonus scammers, he explained, try to gain as many points as quickly as possible to either book reward travel for themselves or to sell. When fraud occurs, legitimate customers lose an opportunity to claim seats reserved for the loyalty program and SAS loses out on important business revenue.

Today, a large portion of leads on EuroBonus fraud come from an AI system that Engberg and his team built with Microsoft Azure Machine Learning, a service for building, training and deploying machine learning models that are easy to understand, protect and control.

The SAS AI system processes streams of real-time flight, transaction, award claims and other data through a machine learning model with thousands of parameters to find patterns of suspicious behavior.

To understand the model predictions, and thus chase leads and build their cases, the fraud detection unit relies on an Azure Machine Learning capability called interpretability, powered by the InterpretML toolkit. This capability explains what parameters were most important in any given case. For example, it could point to parameters that suggest a scam of pooling points from ghost accounts to book flights.

Model interpretability helps take the mystery out of machine learning, which in turn can build confidence and trust in model predictions, noted Engberg.

“If we build the trust in these models, people start using them and then we can actually start reaping the benefits that the machine learning promised us,” he said. “It’s not about explainability for explainability’s sake. It’s being able to provide both our customers and our own employees with insights into what these models are doing and how they are taking positions for us.”

A graphic for Azure Machine Learning, three sections for Understand, Control, and Protect

Graphic courtesy of Microsoft.

Understand, protect and control your machine learning solution

Over the past several years, machine learning has moved out of research labs and into the mainstream, and has transformed from a niche discipline for data scientists with Ph.D.s to one where all developers are expected to be able to participate, noted Eric Boyd, corporate vice president of Microsoft Azure AI in Redmond, Washington.

Microsoft built Azure Machine Learning to enable developers across the spectrum of data science expertise to build and deploy AI systems. Today, noted Boyd, all developers are increasingly asked to build AI systems that are easy to explain and that comply with non-discrimination and privacy regulations.

“It is very challenging to have a good sense of, ‘Hey, have I really assessed whether my model is behaving fairly?’ or ‘Do I really understand why this particular model is predicting the way it is?’” he said.

To navigate these hurdles, Microsoft today announced innovations in responsible machine learning that can help developers understand, protect and control their models throughout the machine learning lifecycle. These capabilities can be accessed through Azure Machine Learning and are also available in open source on GitHub.

The ability to understand model behavior includes the interpretability capabilities powered by the InterpretML toolkit that SAS uses to detect fraud in the EuroBonus loyalty program.

In addition, Microsoft said the Fairlearn toolkit, which includes capabilities to assess and improve the fairness of AI systems, will be integrated with Azure Machine Learning in June.

Microsoft also announced a toolkit for differential privacy is now available to developers to experiment with in open source on GitHub and can also be accessed through Azure Machine Learning. The differential privacy capabilities were developed in collaboration with researchers at the Harvard Institute for Quantitative Social Science and School of Engineering.

Differential privacy techniques make it possible to derive insights from private data while providing statistical assurances that private information such as names or dates of birth can be protected.

For example, differential privacy could enable a group of hospitals to collaborate on building a better predictive model on the efficacy of cancer treatments while at the same time helping to adhere to legal requirements to protect the privacy of hospital information and helping to ensure that no individual patient data leaks out from the model.

Azure Machine Learning also has built-in controls that enable developers to track and automate their entire process of building, training and deploying a model. This capability, known to many as machine learning and operations, or MLOps, provides an audit trail to help organizations meet regulatory and compliance requirements.

“MLOps is really thinking around the operational, repeatable side of machine learning,” said Boyd. “How do I keep track of all the different experiments that I have run, the parameters that were set with them, the datasets that were used in creating them. And then I can use that to recreate those same things.”

Sarah Bird standing in front of a city skyline with a river in the background

Sarah Bird, Microsoft’s responsible AI lead for Azure AI based in New York City, helps create tools that make responsible machine learning accessible to all developers. Photo courtesy of Sarah Bird.

Contextual bandits and responsibility

In the mid-2010s, Sarah Bird and her colleagues at Microsoft’s research lab in New York were working on a machine learning technology called contextual bandits that learn through exploration experiments how to perform specific tasks better and better over time.

For example, if a visitor to a news website clicks on a story about cats, the contextual bandit learns to present the visitor more stories about cats. To keep learning, the bandit performs experiments such as showing the visitor stories about the Jacksonville Jaguars, a sports team, and the hit musical “Cats.” What story the visitor clicks is another learning data point that leads to greater personalization.

“When it works, it is amazing, you get personalization lifts that you’ve never seen before,” said Bird, who now leads responsible AI efforts for Azure AI. “We started talking to customers and working with our sales team to see who wants to pilot this novel research tech.”

The sales leads gave Bird pause. As potential customers floated ideas about using contextual bandits to optimize the job interview process and insurance claim adjudications, she realized that many people lacked understanding on how contextual bandits work.

“I started saying, ‘Is it even ethical to do experimentation in those scenarios?’” Bird recalled.

The question led to discussions with colleagues in the Fairness, Accountability, Transparency and Ethics in AI research group, or FATE, and a research collaboration on the history of experimental ethics and the implications for reinforcement learning, the type of machine learning behind contextual bandits.

“The technology is good enough that we are using it for real use cases, and if we are using it for real use cases that affect people’s lives, then we better make sure that it is fair and we better make sure that it is safe,” said Bird, who now focuses full time on the creation of tools that make responsible machine learning accessible to all developers.

Huskies, wolves and scammers

Within a few years, ethical AI research had exploded around the world. Model fairness and interpretability were hot topics at major industry gatherings and responsible machine learning tools were being described in the academic literature.

In 2016, for example, Marco Tulio Ribeiro, now a senior researcher at Microsoft’s research lab in Redmond, presented a technique in an academic conference paper to explain the prediction of any classifier, such as computer vision models trained to classify between objects in photos.

To demonstrate the technique, he deliberately trained a classifier to predict “wolf” if a photo had a snowy background and “husky” if there was no snow. He then ran the model on photos of wolves mostly in snowy backgrounds and huskies mostly without snow and showed the results to machine learning experts with two questions: Do you trust the model? How is it making predictions?

A collage of images of wolves and huskies that a machine learning model tried to decipher

Microsoft senior researcher Marco Tulio Ribeiro found that many machine learning experts trusted this model that predicts whether an image is of a wolf or husky. Then he gave them the model explanation, which shows the predictions are based on whether there is snow in the background. “Even experts are likely to be fooled by a bad model,” he said. Graphic courtesy of Microsoft. Photos via Getty.

Many of the machine learning experts said they trusted the model and presented theories on why it was predicting wolves or huskies such as wolves have pointier teeth, noted Ribeiro. Less than half mentioned the background as a potential factor and almost no one zeroed in on the snow.

“Then I showed them the explanations, and after seeing the explanations, of course everyone basically got it and said, ‘Oh, it is just looking at the background,’” he said. “This is a proof-of-concept; even experts are likely to be fooled by a bad model.”

A refined version of Ribeiro’s explanation technique is one of several interpretability capabilities available to all developers using interpretability on Azure Machine Learning, the toolkit that SAS’s fraud detection unit uses to build cases against scammers in the EuroBonus loyalty program.

Other AI solutions that SAS is creating with Azure Machine Learning include one for ticket sales forecasting and a system that optimizes fresh food stocking for in-flight purchases. The fresh food solution reduced food waste by more than 60% before fresh food sales were halted as part of global efforts to slow the spread of COVID-19.

Engberg and his data analytics and artificial intelligence team continue to build, train and test machine learning models, including further experimentation with the Azure Machine Learning capabilities for interpretability and fairness.

“The more we go into things affecting our customers or us as individuals, I think these concepts of fairness, explainable AI, responsible AI, will be even more important,” said Engberg.

Assessing and mitigating unfairness

Bird’s colleagues in FATE pioneered many of the capabilities in the Fairlearn toolkit. The capabilities allow developers to examine model performance across groups of people such as those based on gender, skin tone, age and other characteristics.

“It could be you have a great idea of what fairness means in an application and because these models are so complex, you might not even notice that it doesn’t work as well for one group of people as another group,” explained Bird. “Fairlearn is allowing you to find those issues.”

Eric Boyd stands with arms folded in a white background

Eric Boyd, Microsoft corporate vice president of Azure AI in Redmond, Wash., said innovations in responsible machine learning can help developers build AI systems that are easy to explain and comply with non-discrimination and privacy regulations. Photo courtesy of Microsoft.

EY, a global leader in assurance, tax, transaction and advisory services, piloted fairness capabilities in the Fairlearn toolkit on a machine learning model the firm built for automated lending decisions.

The model was trained on mortgage adjudication data from banks that includes transaction and payment history and credit bureau information. This type of data is generally used to enable assessment of the client’s capability and willingness to pay back a loan. But it also raises concerns about regulatory, legal issues and potential unfairness against applicants of specific demographics.

EY used Fairlearn to evaluate the fairness of model outputs with regards to biological sex. The toolkit, which surfaces results on a visual and interactive dashboard, revealed a 15.3 percentage point difference between positive loan decisions for males versus females.

The Fairlearn toolkit allowed the modelling team at EY to quickly develop and train multiple remediated models and visualize the common trade-off between fairness and model accuracy. The team ultimately landed on a final model that optimized and preserved overall accuracy but reduced the difference between males and females to 0.43 percentage points.

The ability for any developer to assess and mitigate unfairness in their models is becoming essential across the financial industry, noted Boyd.

“Increasingly we’re seeing regulators looking closely at these models,” he said. “Being able to document and demonstrate that they followed the leading practices and have worked very hard to improve the fairness of the datasets are essential to being able to continue to operate.”

Responsible machine learning

Bird believes machine learning is changing the world for the better, but she said all developers need the tools and resources to build models in ways that put responsibility front and center.

Consider, for example, a research collaboration within the medical community to compile COVID-19 patient datasets to build a machine learning model that predicts who is at high risk of serious complications from the novel coronavirus.

Before such a model is deployed, she said, the developers need to make sure they understand how it makes decisions in order to explain the process to doctors and patients. The developers will also want to asses fairness, ensuring the model captures the known elevated risks to males, for example.

“I don’t want a model that never predicts that men are high risk, that would be terrible,” said Bird. “Then, obviously, I want to make sure that the model is not revealing the data of the people it was trained on, so you need to use differential privacy for that.”

Top image: An SAS AI-powered fraud detection tool processes streams of real-time flight information along with transaction, award claims and other data through a machine learning model to find patterns of suspicious behavior. An Azure Machine Learning capability called interpretability explains what model parameters were most important in any given case of suspected fraud. Photo courtesy of SAS.

Editor’s note: A previous version of this story referred to the differential privacy toolkit as WhiteNoise.

Related:

John Roach writes about Microsoft research and innovation. Follow him on Twitter.