Featured

Issues with AI software

Every software comes with issues & errors. No software can be defect-free and it is one of the guiding principles behind testing any product. It will be impossible, even theoretically, to identify all defects in software. Hence a tester would prioritize itself to find the most important defects early.

When it comes to AI, most of QA professionals stumbled to get across product quality information out of it. This is due to difference in nature of building AI applications as compare to traditional products.

developers and data scientists would say – “we don’t need to test quality , we already have accuracy metrics in place to verify outcomes.”

Is testing of AI software only restricted to model accuracy metrics? Definitely not.There are loads of ways AI software can fail – and in this post we are going to list some of the most common ways they fail.

Understand this A machine learning algorithm can never work in isolation, in fact it’s a small part of whole system. Like a human body. We can use various heuristics and quality parameters to test all body parts – because each one does some very specific tasks. Now, given the assignment, how will you go on testing brain of human body? ( or for that matter a machine learning feature which learns and gets better? )

You see small black box over there? That’s an actual AI part in whole system. Rest all blocks are body parts supporting it.

I won’t be covering issues related to other blocks as of now. And all issues related to small ML box will be impacting system functioning as a whole. It would be worth looking at how the ML part fails, have some limitations or susceptible to failure.

10.Interpretation of machine learning results is difficult to impossible

Machine learning Systems, especially, ML blocks do have testability limitations. Most machine learning models are difficult to interpret, meaning we would not know why the application is giving specific results. This is especially true for popular Deep learning algorithms. There are some tools like this that help interpretability, however, we still don’t have a robust solution for model explainability. In fact, this is one of the research areas in Artificial Intelligence. Few algorithms like decision trees are helpful, but they aren’t stable too.

This limitation makes machine learning systems weak contender to implement in enterprise applications where stability and robustness of application is major concern.

9.One model doesn’t work for other types of predictions

We build models based on the data we have. And this model gives results only on a “similar” dataset. Change the data and boom! model and application will start throwing error.

All models learn from a specific dataset. And they aren’t flexible enough to work on different datasets. In fact, this is a research problem and we are far away from solving it yet.

8.ML model fairness

Model fairness is my favorite subject. In a famous example, AI chatbot built by Microsoft learns to tweet based on what users were tweeting about. And the results were horrifying.

Access it here – https://twitter.com/geraldmellor/status/712880710328139776

There is another example from Amazon recruiting tool. This is similar to Garbage In, Garbage out principle. So Project Managers will definitely required to keep a watch on nature of data we are training on.

7.Results are sometimes better than tossing coin

In one of the projects I worked on, where I need to predict if the employee would leave the organisation or not (attrition) – I got 40% accuracy. This is after months of research into data, feature engineering, and state of art machine learning algorithms.

If I ask a monkey ( no offense to monkeys) to predict if someone will either “go” or “stay” in the organisation – he would also get 50% accuracy. Or I could just toss a coin. Heads -you stay, Tails-you go.

6.A 99.99% accurate model

A data scientist knows if he shows 100% accuracy of model , either he has done a mistake while model training or if not, there is actually no need of Machine learning technology in that kind of problem. What if he gets 99.99% accuracy?

Even, most of the time – this is also an case of over-fitting. Model has learnt too much from existing datasets that it simply won’t work correctly on new data that he hasn’t seen yet. Best of models I have seen yet falls between 70%-80% range of accuracy, though it’s not the rule. In Deep learning applications of images/videos/text datasets, accuracy falls in ranges of 50%-60% .

By the way, in past I had clients who expected 100% accuracy on models. We had to say no to them while explaining that even theoretically it would not be possible.

Even if your model has one of the best accuracy percentages, we need to connect the dots to real world problem.Here is the real world problem where things get serious.

5.High CPU, GPU usage

Machine learning systems tend to take a much larger piece of overall resources you have while development and productions. As all models use “statistics” to find patterns within data & do complex vector operations on large matrices, it would require a large GPU, lots of memory. If you have continuous training enabled, it would take more memory.

This is called as infrastructure debt. You can read more about various debts that Machine learning systems has.

4.Suddenly dropping results

Your production ML model would suddenly start producing bad results after some time. Finding out the reasons behind it won’t be straight forward. Maybe your model has become stale, may be input data from the real world is quite different than previously anticipated and we haven’t put on unit tests to identify them.

Dissecting the model would take a long time – depend on if you have structured datasets models or deep learning-based models. Many times you won’t get this much time to identify the root cause and you would prefer building a new model again.

3.Different results for same inputs

For the same data inputs, your model ideally should give the same results every time data passes through it. This is not the case always in reality.

All ML models are not deterministic but are Stochastic in nature.

A deterministic system doesn’t possess any randomness. All parameters are fixed, logic to results are fixed and everything is pre-defined. ML models don’t have this luxury. ML system behavior strongly depends on data and models that can’t be specified a priori.

2.High maintenance & technical debt

Not only Infrastructure debt, but Machine learning systems are also a high-interest credit card of technical debt. Data dependencies cost, Data Pipeline Jungles, Configuration debt are some of the important debts. For full list of technical debts introduced in ML systems see here.

1.You can fool model easily.

As major logic is hidden in “data”, machine learning systems can be fooled, quite easily by just manipulating data. If the system is accepting input data from production as is, it is susceptible to failure or hacking. The earlier example of a Twitter bot built by Microsoft is something that model has learned,which it shouldn’t. My previous blog post about Toaster is adding noise data into neural network model training and fool the model easily. Adversarial attacks on ML systems possess a big threat.

Fast gradient sign method, an approach of manipulation by adding or subtracting small error to each pixel, in order to introduce perturbations to the datasets, resulting in miss classification. If the model depends on such datasets, such as Face recognition systems, personal identification systems can be attacked easily.

Ending my blog post here. There could be a lot more issues in ML systems which I missed here. Please comment below if you know some.

In next blog post , we would look at prerequisites for a testing any AI system.

References:

Featured

AI model staleness – A Tester perspective

AI model staleness ( meaning having lost it’s freshness) is an common issue in all machine learning applications. In this post – we are going to learn about what it means, how it can be identified and ways to solve it.

Model Freshness and Quality curve ( Credits- shorturl.at/BJQZ1 )

What is model staleness?

Suppose you have built ( okay , not you – data scientists has ) an Financial fraud detection model. A model was trained on 3 years of data – he used 80% of data to train the model ( around 29 months) and remaining 7 months to validate the model if it’s working correctly.

Now – this model is in production. Since last one year.

At this stage, if there are any variation in data ( new ways of frauds , customer types , new locations information where bank opened branches, new deposit/loan schemes ) that happened in last one year – but model has not been trained on it , model would lost its freshness. It becomes stale.

Another example.

Imagine a recommendation model for e-commerce site like Amazon. This Model does one thing very efficiently. It tells you what products you would like with highest probability. And shows you. It does it so by grouping together similar types of customers ( this is called clustering or Collaborative recommendation systems).

Now data scientists has built this model on data of US customers, and based on orders, transaction history & customers data.Model has learnt a lot about US customers and what products they might like. Now, amazon went live with this model in India. Guess, what would happen.

Recommendations would go wrong for many reasons – behaviour of Indian customers, what products they would like, culture , buying patterns, pricing structures everything differs. Hence this recommendation model would be stale for new geography.

You get the idea.

How to identify such staleness in your project?

Time parameter – check how old model is, means when Data scientist has trained this model. Especially look at till when data was used for training purpose. Remember that model only learns from training data , not test or validation dataset. This is especially true for forecasting models – where we are predicting future growth, revenue, weather based on certain time events.
Accuracy – Compare accuracy of training phase and testing phase. If accuracy of training phase is very high but for testing/validation phase is low – raise issue for staleness. Remember this may be case of overfitting. We would talk about it later.
Data engineering pipeline – Even if model has been trained on new data – it may not be taking in all new categories of data points (suddenly new type of product category added into dataset for sales ). If Model is trained but all parameters are not mentioned explicitly for tuning , and unit tests are not updated for new data points , we would not know why model showing staleness signs.
Deployment options – In multiple deployment options, various versions of model to be presented for serving to understand impact of each model. And later most effective model – after comparing with business metrics , will be deployed for all places. Verify that this is completely automated and we are looking at newest version of model.
New Features addition – See where in whole data pipeline new features being added , preferably in automated fashion. If there isn’t, you need to check if all features are represented in model. ( How to view any model ? I would cover that in my further posts in this series .)
See Exceptions handling – If new trained model throws an error, was not compiled check what happens. Most time, older working model was accepted instead of new model in this scenario. Check logs are available for model training. ( Logs are useful to debug things and improve application Testability)

Way to improve model staleness –

A/B comparison tests – For model edges, draw age-quality curve. You would get a line and use (business given) threshold value of accuracy to determine minimum accuracy needs to provide by model in validation dataset. (Not in the training dataset) and thereby minimum bearable age of model.
Add feature pipeline , make sure all new features are added – Alert when new features not available , write unit tests to verify all required features are collected from dataset to train.
Use cross -validation while training – Cross validation iterates through data multiple times are randomly select data points to train. This is rule out possibility of model being not trained for certain parts of data points all times.

Model staleness is a common problem and needs to be detected early in the project. I hope you have learn few things to identify and solve this issue. In case of any queries, let me know in comments or email at riyajshaikh at gmail do com.

References-

Featured

Toasters for AI

Some Background

I was tester for brief period of time in 2011. I was also part of an organization where quality of products has been rigorously tested.

Later, around 2014, chose Artificial Intelligence (AI) as my field that I want to excel in and working in it still in 2020.

In the last decade, I have built teams of AI, delivered some of the awesome projects for various clients & build products. Be it predictive machine learning, recommendation engines, chatbots, deep learning applications, big data applications. I tried my hands wherever I can.

The need of testing Bringing Quality in AI products

The experience I gained – taught me lot of about value of testing these products. While I was building products and delivering projects – I simply couldn’t find a tester who can do testing of my AI products. This was around 2018. I tried asking few thought leaders into testing & started working towards how we can bring effective quality procedures for AI applications.

I also found there aren’t much information available on internet who can help me learning testing for AI. It made my task harder and challenging. Fortunately – one of good piece that I found is Martin Fowers CD4ML – where he has written extensively about building a continuous delivery application for Machine learning systems. Please read it, it’s a great article.

Toasters – to be toasted by testers!

Title of the post – Toasters of AI , is an analogy of scenario where AI could fail. In fact we can fool AI to recognise/behave in certain way without explicitly mentioning.

Imagine you are building a image classification or video object detection applications using deep neural networks. And you have a test dataset where you can evaluate an accuracy of model by observing predictions on test dataset.

Now you tweaked few images to make it look like a toaster. But they aren’t actually a toaster. If you pass the image through model – you will probably get recognition as a toaster, when there is nothing.

“Toaster” is just an analogy and frequent word in AI teams to mention that machine learning model that was trained is not robust yet and can fail to noisy signals or data.

In upcoming blog posts – I will be introducing good number of ways you can toast your AI applications – by bringing (or adding) toasters. I will be writing about AI and specific things testers needs to learn so that they can be well-equipped with knowledge before starting a testing project for AI.

Hope you enjoyed till here – if you are a tester , do subscribe for the future blog posts.

Until then, stay safe and healthy!

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31