Using Machine Learning to Accelerate the Creation of Test Data for Effective Software Testing

Published in

DBS Tech Blog

7 min readAug 14, 2023

How DBS leveraged machine learning in the data generation process to improve the efficiency and effectiveness of software testing, reducing time spent on test data creation by more than 4x

By Poorvi Ladha & Ankita Das

The effectiveness of software testing rests on three crucial factors:

Firstly, the test strategy needs to include a clearly defined scope, process, and execution plan, along with metrics, and exit criteria. This ensures that the software meets the required quality standards.
The second factor is the test scenario, which should encompass aspects such as positive, negative, boundary values, and end-to-end test scenarios.
Lastly, accurate and complete test data are essential to ascertain the effectiveness of the overall testing process.

Collectively, these three components determine the efficacy of software testing.

While the first two factors are easier to define, the last one can be slightly trickier to obtain. Additionally, the implications of inaccurate and incomplete test data are many, including:

· Incorrect validation results

· Critical defects not detected early on

· Toil spent investigating defects

· Descoping of test scenarios, which increases the risk of defects leaking into higher test environments or worse, the production/ live environment.

Taking these into account, to ensure accurate and complete test data, be it production or synthetic, it is essential to prioritise test data management early in the project lifecycle.

To Use Production Data Copy, or Synthetic Test Data?

As production data can’t be used to simulate negative and boundary value scenarios, a workaround is to create synthetic business-like test data via machine learning to deliver high test accuracy and completeness.

Pros and cons of using production data copy versus synthetic test data

Of course, production data copies aren’t without their merits. That said, while production data is useful for running business-like simulations, they are not effective at surfacing failures caused due to bad data (negative, edge cases) sent from upstream applications.

Benefits of Using Machine Learning for Quicker Test Data Creation

In the case of synthetic data, machine learning models can be trained to learn patterns and characteristics from existing data and generate business-like test data that is representative of the production environment.

Separately, the machine learning model can also be trained to identify patterns and relationships in data. These patterns then generate test cases that cover a wide range of scenarios, while uncovering negative and edge cases.

These models are scalable and have the ability to handle test data requirements as applications expand. They can be retrained to adapt to changing requirements and evolving systems, to create test data that aligns with the new features or functionalities.

By providing a continuous stream of test data, the testing process is also accelerated, enabling faster iterations, shorter development cycles, and quicker time-to-market.

There are just some of the numerous benefits of using creating machine-learning-driven test data, including the eight below.

DBS’ Use Case: Our Machine Learning Model That Reduced Time Spent on Test Data Creation by More than 4x

The bank’s Middle Office Technology’s Legal, Compliance Secretariat (MOT-LCS) platform’s quality assurance team developed a machine learning model to create test data that mimics real-life data to validate systems with complex business rules.

Manual test data creation was not only time-consuming – it took an average of 8 hours for each regression cycle – but also significantly reduced test coverage, which could result in defect slippage to production.

Due to the lack of business-like and diverse test data, it was not possible to run production-like simulations in the test environment to assess the behaviours of new features or their impact on existing features.

Through machine learning, the toil of manual test data creation was eliminated. It now takes less than two hours for each regression cycle. This new model can not only generate new test data, but also labels existing data in the test environment, which can be used for validation.

The model uses a classifier algorithm to predict business rules, based on a set of input variables or features and rule thresholds used for training the model. Different models were evaluated on metrics such as accuracy, precision, recall, and F1 score.

The following models were trained and evaluated using k-fold cross-validation:

1. Logistic Regression

2. Linear Discriminant Analysis

3. K-Nearest Neighbours

4. Naïve Bayes

5. Support Vector Machine

6. Gradient Boosting Classifier

#Spot-Check Algorithm 
models= []
models.append(('LR',LogisticRegression()))
models.append(('LDA',LinearDiscriminationAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC()))
models.append(('gb',GradientBoostingClassifier()))

Fig 1. Performance of different models based on mean, standard deviation, and accuracy using box plot

All models were evaluated using a dataset of 13,000 unique records of input datasets. All records in the dataset were labelled with the type of business rule expected to breach based on the input variables and rule threshold.

The dataset was split using a validation size of 0.2; 80% of the data was used for training the model, and the remaining 20% was used for validation. Feature engineering techniques such as encoding reference codes and converting rule thresholds to categorical values were applied to transform raw data into a format suitable for use by the machine learning algorithm.

The mean and standard deviation of the cross-validation scores across different folds of the dataset were measured to determine the accuracy and consistency of each model. A higher mean score indicated better performance, while a lower low standard deviation indicated that the model’s performance was consistent across different folds. The Gradient Boosting classifier algorithm was selected since it demonstrated the best performance on all metrics evaluated in the k-fold cross-validation.

#Make predictions on validation datasets
gb = GradientBoostingClassifier()
gb.fit(X_train.astype(float), Y_train)
predictions = gb.predict(X_validation)
print(accuracy_score(Y_validation,predictions))
print(confusion_matrix(Y_validation,predictions))
print(classification_report(Y_validation,predictions))

Fig 2. Performance of Gradient Boosting based on precision, F1-score, accuracy, and recall

After training, the model ran on existing/ new input datasets to predict records that would participate in rule detection. For this use case, based on the rule predicted by the model, the output was generated as SQL insert queries, used to directly patch records in the database.

Output formats can be changed based on the input mechanism used by the application. The model can also generate synthetic test data for rules/ labels that are not found in the test data used to predict the output.

For more business-like test data, the model can be trained using production data copies selected from diverse segments of the data population if the input features are unmasked. Similarly, it can be trained to generate synthetic test data used for simulating negative and boundary value scenarios.

When to Retrain the Model

The model should be retrained as and when new features are added, and when the application evolves. In the case of applications that are guided by regulatory requirements, the model should be retrained based on jurisdictions and changes in regulations, in addition to the implementation of new features, or the removal of obsolete ones.

Below are some examples of when the model should be retrained for a rule-based system:

1. Dataset changes: If the dataset used for training the model changes significantly, such as adding new reference codes or changing rule thresholds, the model may need to be retrained to reflect these changes.

2. Rule changes: If the rules change – be it the addition of a new rule, or changes to an existing rule – the model may need to be retrained to comply with the new rules and to detect new patterns.

3. Performance issues: If the model is not performing well – such as generating too many false positives or false negatives – based on the current dataset, updates would be required to improve its accuracy.

Conclusion

Test data generated through such machine learning models can simplify the functional and regression testing within an applications boundaries and be adapted to run integrated end-to-end tests with other downstream applications.

Doing so not only results in the early detection of defects by shifting left but also reduces the dependency on long cycles to download and regularly refresh production data copies for integration tests.

Poorvi is a senior vice president at DBS and spends most of her time helping teams improve the quality of their software products.

Ankita Das is a senior associate at DBS working as an automated quality assurance specialist. She is also an avid explorer of the potential of machine learning in enhancing QA practices.

#QualityAssurance #machinelearning #testdata #QA #softwaretesting #testautomation #MOTtheplacetobe

Using Machine Learning to Accelerate the Creation of Test Data for Effective Software Testing

Written by Poorvi Ladha