A Guide to Automated Testing of Machine Learning Models — Chapter 1

7 min readApr 15, 2023

Machine learning models are becoming increasingly prevalent in many applications, from image and speech recognition to fraud detection and medical diagnosis. As these models become more complex and are applied to more critical tasks, it is important to ensure their correctness and robustness. One way to achieve this is through automated testing.

Automated testing of machine learning models involves testing the model’s behavior under a wide range of scenarios and inputs, to ensure that it is working as intended and that its output is correct and consistent.

This mini-series will be divided into multiple parts, the idea is to keep it short and simple so that person with no experience or knowledge around machine learning can pick it up very easily.

In this chapter we will be focusing on testing ‘DecisionTreeClassifier’ which is a classification model based on the decision tree algorithm. It works by recursively splitting the data into subsets based on the values of a feature, until each subset is pure (contains only one class label) or a maximum depth is reached. It makes decisions by traversing the tree from the root to a leaf node based on the values of the features of the input data, and the class label associated with the leaf node is the predicted class label of the input data. Before that lets start with.

Introduction to Machine Learning and its Types

Machine Learning (ML) is a subfield of artificial intelligence that deals with the design, development, and implementation of algorithms that enable machines to learn and improve from data. Machine learning algorithms are designed to recognize patterns in data, make predictions based on those patterns, and improve their accuracy over time through a process of trial and error.

Types of Machine Learning

Machine Learning can be broadly categorized into three types based on the learning process involved:

Supervised Learning
Unsupervised Learning
Reinforcement Learning

Supervised Learning

It is a type of machine learning in which the algorithm learns to map input data to output labels by using a set of labeled training data. In supervised learning, the algorithm is provided with input data and the corresponding output labels. The algorithm then learns to recognize patterns in the input data and predict the output labels for new, unseen data.

Supervised Learning can be further divided into two types:

a. Classification: In classification, the algorithm learns to classify input data into different categories or classes. For example, a classification algorithm may be trained to identify whether an email is spam or not spam.

b. Regression: In regression, the algorithm learns to predict a continuous output variable based on input data. For example, a regression algorithm may be trained to predict the price of a house based on its size, number of bedrooms, and location.

Unsupervised Learning

Unsupervised Learning is a type of machine learning in which the algorithm learns to find patterns in the input data without any labeled output data. In unsupervised learning, the algorithm is provided with input data and is asked to find patterns or structure in the data.

Unsupervised Learning can be further divided into two types:

a. Clustering: In clustering, the algorithm learns to group similar data points together based on their similarities. For example, a clustering algorithm may be used to group customers into different segments based on their purchase history.

b. Association: In association, the algorithm learns to discover rules that describe associations between different items in the input data. For example, an association algorithm may be used to discover that customers who buy milk and bread are likely to buy eggs as well.

Reinforcement Learning

Reinforcement Learning is a type of machine learning in which the algorithm learns to make decisions based on rewards and punishments. In reinforcement learning, the algorithm is provided with an environment in which it can take actions and receive rewards or punishments based on its actions. The algorithm then learns to take actions that maximize its rewards over time.

Action

Enough of theory lets get into action, there are many types of classification models in machine learning like Decision Trees (e.g. decisionTreeClassifier), Random Forests, Support Vector Machines (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Logistic Regression, Gradient Boosting Machines (GBM), Neural Networks (e.g. Multi-Layer Perceptron) etc.

DecisionTreeClassifier

DecisionTreeClassifier is a class in scikit-learn (sklearn) Python library that implements the decision tree algorithm for classification problems. It constructs a tree-like model of decisions and their possible consequences, which can be used for making predictions on new data. Decision trees are a type of supervised learning algorithm used for both classification and regression problems.

The decision tree algorithm works by recursively splitting the data based on the feature that maximally reduces the impurity of the target variable. The impurity is typically measured using the Gini impurity or entropy measures. The algorithm continues splitting the data until a stopping criterion is reached, such as reaching a maximum depth or minimum number of samples in a leaf node. The resulting tree can be used to make predictions on new data by following the decision path from the root node to the appropriate leaf node.

The DecisionTreeClassifier class in sklearn allows the user to specify various hyperparameters that control the behavior of the algorithm, such as the maximum depth of the tree, the minimum number of samples required to split an internal node, and the splitting criterion (Gini impurity or entropy). The class also includes methods for fitting the model to training data, making predictions on new data, and evaluating the performance of the model using various metrics such as accuracy, precision, recall, and F1 score.

We will start with automated testing of a DecisionTreeClassifier machine learning model typically involves the following steps:

Split the data into training and testing sets: This step involves randomly dividing the available data into two sets — a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the model’s performance.
Train the model: In this step, the model is trained on the training data using the fit method of the DecisionTreeClassifier class.
Make predictions on the test data: Once the model is trained, it can be used to make predictions on the test data using the predict method.
Evaluate the model’s performance: The performance of the model is evaluated by comparing its predictions on the test data to the actual labels of the test data. The most common metric used to evaluate the performance of a classification model is accuracy.
Automate the process: To automate this process, you can use a testing framework such as unittest or pytest to write test cases that automatically split the data into training and testing sets, train the model, make predictions, and evaluate the model's performance.

Code

import pytest
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

@pytest.fixture(scope='module')
def model():
    iris = load_iris()
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
    dtc = DecisionTreeClassifier()
    dtc.fit(X_train, y_train)
    return dtc, X_test, y_test

def test_accuracy(model):
    dtc, X_test, y_test = model
    y_pred = dtc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    assert accuracy > 0.9

def test_shape(model):
    dtc, X_test, y_test = model
    assert X_test.shape[1] == 4
    assert y_test.shape[0] == 30

if __name__ == '__main__':
    pytest.main()

Output

model = (DecisionTreeClassifier(), array([[6.1, 2.8, 4.7, 1.2],
       [5.7, 3.8, 1.7, 0.3],
       [7.7, 2.6, 6.9, 2.3],
    ... 1.6, 0.2]]), array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0]))

    def test_shape(model):
        dtc, X_test, y_test = model
>       assert X_test.shape[1] == 3
E       assert 4 == 3

ML_Automated _Testing_DecisionTreeClassifier.py:24: AssertionError
=========================================================================================== short test summary info =========================================================================================== 
FAILED ML_Automated _Testing_DecisionTreeClassifier.py::test_shape - assert 4 == 3
======================================================================================== 1 failed, 1 passed in 24.42s =========================================================================================

In this example, we use pytest to write two test functions: test_accuracy and test_shape. The model fixture loads the iris dataset, splits it into training and testing sets, creates an instance of the DecisionTreeClassifier class, and fits the classifier to the training data.

The test_accuracy function uses the trained classifier to predict the target variable for the test data, calculates the accuracy of the model's predictions using the accuracy_score function, and asserts that the accuracy is greater than 0.9.

The test_shape function asserts that the shapes of the testing data are correct.

To run the tests, we use the pytest.main() function.

Above code is available on Github to play around at https://github.com/toniramchandani1/MLTestingPython

Gini impurity

Gini impurity is a measure of the impurity or randomness of a set of labels or classes in a classification problem. It is commonly used in decision trees and other machine learning algorithms to evaluate the quality of a split in the data.

The Gini impurity is defined mathematically as follows:

Gini impurity = 1 — ∑(p_i)²

where p_i is the proportion of samples in the ith class in a particular node or subset of the data. The Gini impurity ranges from 0 to 1, where 0 represents perfect purity (i.e., all samples in the node belong to the same class) and 1 represents maximum impurity (i.e., the samples are evenly distributed across all classes).

When building a decision tree, the goal is to find the splits in the data that minimize the Gini impurity of the resulting subsets. A split that results in subsets with lower Gini impurity is considered to be more informative and useful for making accurate predictions. The reduction in Gini impurity due to a split is commonly used as a criterion for selecting the best feature to split on at each node of the tree.

This is a very basic and simple example of series as we make progress this will turn more complex and more practical, stay tuned !!