Data-Centric AI Case Study: Improving Model Accuracy by Only Cleaning Data

Active Learning Sep 06, 2022

When we first start in machine learning, we're excited at the prospect of creating models and algorithms. We're drawn in by the seemingly magical features these models can unlock in our applications.

Then, relatively soon after we begin, we hear or read something along the lines of:

Data science is 90% collecting, cleaning up, and labeling data and 10% creating algorithms and training models.

Wait, what? That can't be right. Data science is supposed to be fun and glamorous! It's the new(ish) hotness! What's the point of machine learning, if we have to do so much manual work?

Paying close attention to the quality of your data is the main principle behind data-centric AI. We keep hearing more and more stories about inaccuracies in open datasets and how correcting them improves the quality of the models trained on them.

This blog post is part of a series on data-centric AI and active learning. In our previous article, we gave an overview of active learning techniques and their advantages.

In this post, we’re going to show how important data quality is. We'll be training and comparing four squirrel detector models. Each successive model will incorporate another round of data clean-up. We're going to see for ourselves what a difference data-centric AI really makes!

Why squirrels? Because who can resist a cute, cuddly squirrel?

Image of a cute squirrel standing in grass — Love me!

Model 0 - Lazy Data

For our first (zero-th?) model, we need a baseline. We're going to scrape DuckDuckGo for images of squirrels using a simple script. Each time this script is run, it will download up to 1,000 images for the query we give it.

We're going to try three queries, meaning we'll have at most 3,000 images, with which to train a model. The three queries will be:

squirrels - 557 images
backyard squirrels - 732 images
field squirrels - 473 images

Three images of squirrels returned when querying "squirrels" on DuckDuckGo — Examples returned with the **squirrels** query

Three images of squirrels returned when querying "backyard squirrels" on DuckDuckGo — Examples returned with the **backyard squirrels** query

Three images of squirrels returned when querying "field squirrels" on DuckDuckGo — Examples returned with the **field squirrels** query

Why don't we have 1,000 images for each? When scraping images from DuckDuckGo's image search, many images may have broken links, temporarily down, or animated gifs. The latter we just remove for simplicity. So it’s not 3,000 images, but who cares? That's enough to start!

Now we’ve got data, but no labels. If we were training a classifier, that would be easy… but we want to train an object detector. We need bounding boxes. Unfortunately, that can be a lot of boring, mundane work.

However, since we’re smart, resourceful software developers, what if we just run our images through a model pre-trained on the COCO dataset?

I know what you’re thinking. COCO doesn’t have a squirrel label. If it did, we wouldn’t have to train our own squirrel detector. Look, we all know that… but does a YOLOv5 model, pre-trained on COCO, know that it’s not supposed to know what a squirrel is?

Interestingly enough, if you run images of squirrels through a model trained on COCO, it kinda gets it right. It doesn’t know what a squirrel is, per se, but it does recognize their “animal-ness”.

Six images of squirrels overlayed with bounding boxes and labels. The labels read either "sheep", "bear", "bird", or "dog", but never "squirrel" — Obviously a squirrel is nothing more than two sheep in a trench coat. Or a couple of bears.

All we have to do for our initial data labeling:

Run inference on all our scraped images
Map the animal categories (bird, bear, sheep, etc…) to a single class

To run inference, the YOLOv5 repository has a handy detect.py, we can use to get our initial set of labels. We can run this script with the following command:

> python detect.py --weights yolov5s.pt --source /path/to/images --save-txt

These parameters are:

weights - starting with a pretrained small version of YOLOv5
source - path to a single image or a directory of images
save-txt - save the inference results in YOLO-style text files (one text file per image)

After running, the results are under yolov5/runs/detect/exp/labels/.

Check out convert_labels.py for an example of how to map the classes in these YOLO-style label files.

After following the two simple steps, we have our initial dataset. We can now train our baseline model!

If you’re interested in training your own model, you can find this dataset under the squirrel0 branch in the repository.

We'll use the following command to train our model:

> python train.py --weights yolov5s.pt --seed 42 --batch-size 32 --imgsz 640 --data squirrels.yaml --epochs 300

The parameters from the above example are:

weights - starting with a pretrained small version of YOLOv5
seed - give all runs the same seed to make model comparisons more valid
batch-size - 32 images per batch
imgsz - use 640x640 pixels as the input image size
data - yaml file describing the location of our dataset
epochs - run for a maximum of 300 epochs

These models were trained on an NVIDIA GeForce RTX 2080 Ti, which has 11GB of VRAM. However, it can also easily be run on a CPU, but will take a few minutes per epoch to run.

Now we wait for training to finish.

Animated cartoon gif of Calvin staring at a clock in school

Looking at the results, from our training and validation sets we see:

Dataset	Precision	Recall	mAP@0.5	mAP@0.5:0.95
train	0.776	0.739	0.797	0.514
val	0.784	0.737	0.797	0.504

Here we are looking at 4 metrics:

Precision - measures how many of the predictions are relevant objects
Recall - measures how many of the relevant objects are correctly predicted
mAP@0.5 - mean average precision measured with an IoU (intersection over union) of 0.5
mAP@0.5:0.95 - mean average precision averaged for IoUs ranging from 0.5 to 0.95 in 0.05 increments (i.e. calculate mAP@0.5, mAP@0.55, mAP@0.6, … mAP@0.95 and average the results)

We have our start… It’s not bad at all, but there’s room for improvement!

Model 1 - Remove Obviously Bad Images

Ok... when we take a cursory glance through our downloaded squirrel images, we immediately spot a problem.

Eight example images that are either not squirrels or cartoon squirrels — Yeah... those aren't (real) squirrels.

The problem with actual images of birds, dogs, and other objects incorrectly labelled as such is that our script converts those labels to squirrel labels. That’s going to certainly confuse our model when it comes time to train!

Cartoon images of squirrels are also not super helpful when trying to detect real-life squirrels.

An easy clean up step would be to go through and remove images that are low resolution, obviously not squirrels, or cartoon images of squirrels.

Even though going through 1,762 images sounds like a lot of work, it goes by pretty quickly when you're just looking for these really bad ones. It shouldn’t take too long.

Animated gif of Mr. Bean standing on the side of a road next to a field impatiently checking his watch

After we've removed those images, we have a dataset with 427 images. We just deleted about 76% of our dataset! That was a ton of bad data! Maybe we shouldn't implicitly trust search engines? 🤔

This dataset can be found in the squirrel1 branch of the repository.

It's time to train our 2nd ML model, which should go much more quickly now that the dataset is smaller.

Animated gif of an animated sloth from Zootopia slowly turning her head. Text at the bottom reads "any day now..."

Let’s look at how these changes affected our model’s accuracy.

Model	Dataset	Precision	Recall	mAP@0.5	mAP@0.5:0.95
0	train	0.776	0.739	0.797	0.514
0	val	0.784	0.737	0.797	0.504
1	train	0.819	0.754	0.804	0.469
1	val	0.853	0.725	0.810	0.468

You can see that we improved precision by a good amount, but had a small regression in our recall metric. Since the validation sets are the same, this means we had less true positives but also less false positives.

One potential explanation may lie in how YOLOv5 initially labeled some squirrels. If you look at the image above with examples of how a COCO-trained YOLOv5 model interprets squirrels, occasionally, it would add a bounding box for the body and a second for the tail. These would count as two relevant objects.

If the data removed helped our model focus on just the squirrel as a whole, then it may only return a single bounding box instead of two. When compared to a ground truth of two bounding boxes, it would decrease the recall and the mean average precision at higher percentage values (mAP@0.75 for instance). This in turn, would decrease our mAP@0.5:0.95 metric.

Model 2 - Remove Duplicate Images

That wasn't too bad, was it? But there's more work to be done! You may have noticed, our dataset has a bunch of duplicated images in it. Duplicated data can be bad for a couple of reasons:

Repeated images causes the model training to focus unevenly on those samples compared to other unique samples.
Duplicate images can end up in your validation or test set, causing the model's accuracy to be artificially inflated.

Now, being software developer looking to automate as much as humanly (computerly?) possible, we don’t want to manually find the duplicates.

We could try using diff, but that will only work if two files are exactly alike. Images can be duplicates without having the exact same bits in their files. For instance, an image may be recompressed or scaled down.

Luckily, we can turn to machine learning to help us! 🤖

We can use a pretrained ML model to extract the feature embeddings for each image and then compare those embeddings using a cosine similarity function. Cosine similarity tells us the cosine of the angle between two vectors (or feature embeddings). The closer the internal angle is to zero, the more similar they are, since \(\boldsymbol{cos(0) = 1}\).

Using a script like dedup.py we can identify duplicates and then remove them after verifying. This script uses EfficientNet V2 (small) to extract feature embeddings for the images and a function from scikit-learn to calculate a matrix of cosine similarities between all embeddings.

After running this script and deleting the duplicates, we now have 309 images left in our dataset.

This dataset can be found in the squirrel2 branch of the repository.

Time to train our 3rd model!

Animated gif of a cat batting the hands of a clock

And after that’s done, our results look like:

Model	Dataset	Precision	Recall	mAP@0.5	mAP@0.5:0.95
0	train	0.776	0.739	0.797	0.514
0	val	0.784	0.737	0.797	0.504
1	train	0.819	0.754	0.804	0.469
1	val	0.853	0.725	0.810	0.468
2	train	0.965	0.710	0.856	0.542
2	val	0.874	0.768	0.844	0.526

This time all metrics improved across the board! That’s what we like to see!

But can we do better?!?

Model 3 - Fixing and Fine-tuning Boxes

Ok. We’ve taken care of the easiest forms of data clean up. The only thing left is… manual labor 😱

This will be, by far, the most time consuming part of the data cleanup process… but it will also be worth it. Luckily, during the first couple of stages, we removed 1,453 images, so we have significantly less data to manually clean up! That sounds like a big win.

To help us clean up the data, we’re going to use Label Studio, an open source data labeling tool. It has a nice web-based interface that allows us to quickly add, delete, and tweak bounding boxes. Additionally, when we use DagsHub’s integrated version, we:

don’t need to move the data to a 3rd-party platform
can easily export these labels into a format compatible with YOLO (or others!)

Importing the YOLO-style labels into LabelStudio is a bit of a manual process right now. But if you look at yolo2labelstudio.py you can get a feel for how to do it fairly easily.

We’ll run that script on both the training and validation datasets and then push the changes to our DagsHub repository. Then once we’ve created a Label Studio workspace, we can go through and graphically view and change our annotations.

DagsHub's Label Studio integration running in a web browser, showing an image of a squirrel walking on a litter bin, annotated with a bounding box

After fixing incorrect bounding boxes, merging overlapping ones, and adding missing labels, we can export our YOLO-style labels.

Image showing a squirrel in a tree with an incorrect bounding box extending further than it should

Same image showing a squirrel in a tree with the correct bounding box — Before/after example of a corrected bounding box

This dataset can be found in the squirrel3 branch of the repository.

Now we're ready to train our 4th and final ML model!

Animated gif in black and white of a young boy drumming his fingers with text reading "still waiting"

From here we can see some much improved results:

Model	Dataset	Precision	Recall	mAP@0.5	mAP@0.5:0.95
0	train	0.776	0.739	0.797	0.514
0	val	0.784	0.737	0.797	0.504
1	train	0.819	0.754	0.804	0.469
1	val	0.853	0.725	0.810	0.468
2	train	0.965	0.710	0.856	0.542
2	val	0.874	0.768	0.844	0.526
3	train	0.958	0.826	0.920	0.709
3	val	0.964	0.812	0.917	0.693

Once again, all the metrics for our validation set got better. These are the kinds of model improvements we dream about! The large improvement to mAP@0.5:0.95 is especially welcome. It hints to the fact that our bounding boxes became significantly more accurate over the previous runs.

Data-centric AI FTW!

What We Have Learned

Despite often being tedious, mundane, and grueling, data clean up is extremely important. The quality of our models greatly depends on it. This is the main takeaway from data-centric AI.

We've also seen that we can write some simple scripts to help us in our herculean clean-up efforts. So although it wasn’t all manual, it is important to be smart about when and how you automate data clean up. Our last step, which was also the most manual one by far, provided the most improvement.

In a future blog post, we're going to delve further into data-centric AI by improving our squirrel model using Active Learning. We'll even use our 4th model from this blog post to begin the active learning training cycle.

Stay tuned!

If you have any comments or questions, feel free to reach out!

Recommended for you

Active Learning

Active Learning Your Way to Better Models

2 years ago • 10 min read

Active Learning

Tutorial: Build an Active Learning Pipeline using Data Engine

9 months ago • 10 min read

Active Learning

Data-Driven AI in Manufacturing: Casting Defect Identification using Active Learning

a year ago • 12 min read

Common Pitfalls To Avoid When Using Vector Databases

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024