How to create a Naive Bayes product classification model

Picture by Daniel Romero, Unsplash.

10 minutes to read

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that products are assigned to the right product categories when listed by third parties.

Product classifiers are also really useful for competitor analysis projects, since they allow you to compare the products sold per category across retailers, through mapping them all to a single information architecture or category tree.

Product classification models typically take the name of the product, which differs across retailers, and predicts which product category it would be assigned to based on a set of labelled training data. In this project, we’ll use a Multinomial Naive Bayes model and apply Natural Language Processing (NLP) techniques to predict product categories from product names.

Load the packages

For this project we’ll need to load Pandas, Numpy and a range of Scikit-Learn packages, including CountVectorizer for turning our text data into a numeric form, plus the MultinomialNB model and some packages for assessing model performance.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

Load the data

The dataset I’m using is a PriceRunner dataset which is ideal for product classification problems. It includes 35,311 product names from various online retailers which map to 12,849 product names assigned to 10 different product categories.

The names the vendors have used for the products are all slightly different. Our aim is to predict which category_label a product will have from its product name.

df = pd.read_csv('pricerunner_aggregate.csv', 
                names=['product_id','product_title','vendor_id','cluster_id',
                       'cluster_label','category_id','category_label'])
df.head()

	product_id	product_title	vendor_id	cluster_id	cluster_label	category_id	category_label
0	1	apple iphone 8 plus 64gb silver	1	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
1	2	apple iphone 8 plus 64 gb spacegrau	2	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
2	3	apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...	3	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
3	4	apple iphone 8 plus 64gb space grey	4	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
4	5	apple iphone 8 plus gold 5.5 64gb 4g unlocked ...	5	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones

Printing the value_counts() of the category_label column we’re trying to predict shows that we have 10 different classes present, each of which has 2212 to 5501 product titles in the dataset. This is good because we have plenty of data from which to make our predictions.

df.category_label.value_counts().to_frame()

	category_label
Fridge Freezers	5501
Mobile Phones	4081
Washing Machines	4044
CPUs	3862
Fridges	3584
TVs	3564
Dishwashers	3424
Digital Cameras	2697
Microwaves	2342
Freezers	2212

Preprocess the data

Before the model can classify text we need to transform it into a numeric form. There are various steps that can be used here, such as the removal of stopwords, lemmatization, Porter stemming, and the use of different algorithms, such as Term-Frequency Inverse Document Frequency (TF-IDF).

However, the basic CountVectorizer approach gives good results out of the box. This takes all of the words in the data and counts them, and then assigns a number to each one based on its prevalence in the dataset, creating a bag of words matrix required by the model. The final step is to convert it to a dense array so it can be used by the Naive Bayes object.

count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['product_title'])
bow = np.array(bow.todense())

Create test and train data

To create the test and train data we’ll assign the bag of words to X as our feature set, and then pass the category_label class column to y. Passing this into the train_test_split() function will create our test and train datasets. We’re assigning 30% of the data to the test group and using the stratify argument to balance out the data.

X = bow
y = df['category_label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

Fit the model

Now the data have been prepared, we’ll fit the multinomial naive bayes model. The Multinomial Naive Bayes model

There are quite a few hyperparameters that can be passed to this model, but we’ll just fit the default one for now.

model = MultinomialNB().fit(X_train, y_train)

Assess performance

Once the model has been fitted to the data, we can make our predictions on the X_test dataset and then calculate the accuracy score and F1 score. This gives us a decent score on the test data, with 94.95% accuracy and an F1 score of 0.945.

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9495941098735133

f1_score(y_test, y_pred, average="macro")

0.9450395183071821

Examine the predictions

To check how well the model did in a bit more detail we can examine the precision, recall, and F1 score for each of the classes using a classification report.

print(classification_report(y_test, y_pred))

                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00      1159
 Digital Cameras       0.99      0.99      0.99       809
     Dishwashers       0.95      0.98      0.96      1027
        Freezers       0.97      0.64      0.77       664
 Fridge Freezers       0.85      0.96      0.90      1651
         Fridges       0.91      0.89      0.90      1075
      Microwaves       0.98      0.98      0.98       703
   Mobile Phones       1.00      0.99      0.99      1224
             TVs       0.98      0.99      0.98      1069
Washing Machines       0.97      0.97      0.97      1213

        accuracy                           0.95     10594
       macro avg       0.96      0.94      0.95     10594
    weighted avg       0.95      0.95      0.95     10594

To examine the predictions themselves, we can join the y_pred predictions to the y_test data and put the results in a dataframe. By using a bit of Numpy, we can also add a binary flag of 1 or 0 to identify whether the prediction was correct or incorrect. I’ve sorted them with the errors at the top, so we can see where the model failed.

The results look pretty good for a first attempt. The model performs very well on most product types, particularly CPUs, but it gets a bit confused over some Freezers, Fridge Freezers, and Fridges, which is to be expected I guess, as there’s far more overlap in the words appearing in these similar product categories than there are in others.

Of course, this is just a very basic example to show the overall approach. Additional work on the data, model selection, cross validation, and tuning the model’s hyper-parameters should improve this further.

results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
results['result'] = np.where(results['predicted']==results['actual'], 1, 0)
results.sort_values(by='result').head(20)

	predicted	actual
33025	Fridge Freezers	Fridges
25846	Fridges	Freezers
25118	Fridge Freezers	Freezers
25048	Fridge Freezers	Freezers
28979	Fridges	Fridge Freezers
32306	Fridge Freezers	Fridges
25550	Fridges	Freezers
31539	Fridges	Fridge Freezers
30107	Fridges	Fridge Freezers
32466	Fridge Freezers	Fridges
24509	Fridge Freezers	Freezers
26096	Fridge Freezers	Freezers
31563	Fridges	Fridge Freezers
28696	Dishwashers	Fridge Freezers
24830	Fridge Freezers	Freezers
23761	Fridge Freezers	Washing Machines
30515	Fridges	Fridge Freezers
24213	Fridge Freezers	Freezers
34920	Washing Machines	Fridges
26148	Fridges	Freezers

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.