Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that products are assigned to the right product categories when listed by third parties.
Product classifiers are also really useful for competitor analysis projects, since they allow you to compare the products sold per category across retailers, through mapping them all to a single information architecture or category tree.
Product classification models typically take the name of the product, which differs across retailers, and predicts which product category it would be assigned to based on a set of labelled training data. In this project, we’ll use a Multinomial Naive Bayes model and apply Natural Language Processing (NLP) techniques to predict product categories from product names.
For this project we’ll need to load Pandas, Numpy and a range of Scikit-Learn packages, including CountVectorizer for turning our text data into a numeric form, plus the MultinomialNB model and some packages for assessing model performance.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
The dataset I’m using is a PriceRunner dataset which is ideal for product classification problems. It includes 35,311 product names from various online retailers which map to 12,849 product names assigned to 10 different product categories.
The names the vendors have used for the products are all slightly different. Our aim is to predict which category_label
a product will have from its product name.
df = pd.read_csv('pricerunner_aggregate.csv',
names=['product_id','product_title','vendor_id','cluster_id',
'cluster_label','category_id','category_label'])
df.head()
product_id | product_title | vendor_id | cluster_id | cluster_label | category_id | category_label | |
---|---|---|---|---|---|---|---|
0 | 1 | apple iphone 8 plus 64gb silver | 1 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
1 | 2 | apple iphone 8 plus 64 gb spacegrau | 2 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
2 | 3 | apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... | 3 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
3 | 4 | apple iphone 8 plus 64gb space grey | 4 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
4 | 5 | apple iphone 8 plus gold 5.5 64gb 4g unlocked ... | 5 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
Printing the value_counts()
of the category_label
column we’re trying to predict shows that we have 10
different classes present, each of which has 2212 to 5501 product titles in the dataset. This is good because we have plenty of data from which to make our predictions.
df.category_label.value_counts().to_frame()
category_label | |
---|---|
Fridge Freezers | 5501 |
Mobile Phones | 4081 |
Washing Machines | 4044 |
CPUs | 3862 |
Fridges | 3584 |
TVs | 3564 |
Dishwashers | 3424 |
Digital Cameras | 2697 |
Microwaves | 2342 |
Freezers | 2212 |
Before the model can classify text we need to transform it into a numeric form. There are various steps that can be used here, such as the removal of stopwords, lemmatization, Porter stemming, and the use of different algorithms, such as Term-Frequency Inverse Document Frequency (TF-IDF).
However, the basic CountVectorizer approach gives good results out of the box. This takes all of the words in the data and counts them, and then assigns a number to each one based on its prevalence in the dataset, creating a bag of words matrix required by the model. The final step is to convert it to a dense array so it can be used by the Naive Bayes object.
count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['product_title'])
bow = np.array(bow.todense())
To create the test and train data we’ll assign the bag of words to X
as our feature set, and then pass the category_label
class column to y
. Passing this into the train_test_split()
function will create our test and train datasets. We’re assigning 30% of the data to the test group and using the stratify
argument to balance out the data.
X = bow
y = df['category_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
Now the data have been prepared, we’ll fit the multinomial naive bayes model. The Multinomial Naive Bayes model
There are quite a few hyperparameters that can be passed to this model, but we’ll just fit the default one for now.
model = MultinomialNB().fit(X_train, y_train)
Once the model has been fitted to the data, we can make our predictions on the X_test
dataset and then calculate the accuracy score and F1 score. This gives us a decent score on the test data, with 94.95% accuracy and an F1 score of 0.945.
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9495941098735133
f1_score(y_test, y_pred, average="macro")
0.9450395183071821
To check how well the model did in a bit more detail we can examine the precision, recall, and F1 score for each of the classes using a classification report.
print(classification_report(y_test, y_pred))
precision recall f1-score support
CPUs 1.00 1.00 1.00 1159
Digital Cameras 0.99 0.99 0.99 809
Dishwashers 0.95 0.98 0.96 1027
Freezers 0.97 0.64 0.77 664
Fridge Freezers 0.85 0.96 0.90 1651
Fridges 0.91 0.89 0.90 1075
Microwaves 0.98 0.98 0.98 703
Mobile Phones 1.00 0.99 0.99 1224
TVs 0.98 0.99 0.98 1069
Washing Machines 0.97 0.97 0.97 1213
accuracy 0.95 10594
macro avg 0.96 0.94 0.95 10594
weighted avg 0.95 0.95 0.95 10594
To examine the predictions themselves, we can join the y_pred
predictions to the y_test
data and put the results in a dataframe. By using a bit of Numpy, we can also add a binary flag of 1 or 0 to identify whether the prediction was correct or incorrect. I’ve sorted them with the errors at the top, so we can see where the model failed.
The results look pretty good for a first attempt. The model performs very well on most product types, particularly CPUs, but it gets a bit confused over some Freezers, Fridge Freezers, and Fridges, which is to be expected I guess, as there’s far more overlap in the words appearing in these similar product categories than there are in others.
Of course, this is just a very basic example to show the overall approach. Additional work on the data, model selection, cross validation, and tuning the model’s hyper-parameters should improve this further.
results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
results['result'] = np.where(results['predicted']==results['actual'], 1, 0)
results.sort_values(by='result').head(20)
predicted | actual | result | |
---|---|---|---|
33025 | Fridge Freezers | Fridges | 0 |
25846 | Fridges | Freezers | 0 |
25118 | Fridge Freezers | Freezers | 0 |
25048 | Fridge Freezers | Freezers | 0 |
28979 | Fridges | Fridge Freezers | 0 |
32306 | Fridge Freezers | Fridges | 0 |
25550 | Fridges | Freezers | 0 |
31539 | Fridges | Fridge Freezers | 0 |
30107 | Fridges | Fridge Freezers | 0 |
32466 | Fridge Freezers | Fridges | 0 |
24509 | Fridge Freezers | Freezers | 0 |
26096 | Fridge Freezers | Freezers | 0 |
31563 | Fridges | Fridge Freezers | 0 |
28696 | Dishwashers | Fridge Freezers | 0 |
24830 | Fridge Freezers | Freezers | 0 |
23761 | Fridge Freezers | Washing Machines | 0 |
30515 | Fridges | Fridge Freezers | 0 |
24213 | Fridge Freezers | Freezers | 0 |
34920 | Washing Machines | Fridges | 0 |
26148 | Fridges | Freezers | 0 |
Matt Clarke, Saturday, March 13, 2021