This post is part 2 of the series on building a product classification API. The API is available for demo here. Part 1 available here; Part 3 available here. (Github repositiory)
Update: API discontinued to save on cloud cost.
In part 1, we focused on data acquisition and formatting the categories. Here, weâll focus on preparing the product titles (and short description, if you want) before training our model.
This is part of a series of posts on building a product classification API:
Weâll have products within our data that are categorized incorrectly. How do we exclude these mis-categorized products from our training set?
Hereâs one approach: If two products have the same title but different category, we assume that at least one of the products is mis-categorized (and the data is dirty).
Extending on the above, as we take steps to prepare our data, weâll be measuring data âpurityâ at each step. In this instance, purity is defined as:
Products with the same title and same category / Total number of products
This measures the proportion of products that have the same title and same category in our data. The higher the purity, the cleaner we can assume our data to be.
At the end of the data preparation, weâll be able to identify which products are âimpureâ. Given that weâre unable to distinguish between correctly and incorrectly categorized products, weâll exclude them from the training of the model.
The titles need a bit of cleaning and preparation before we can train our model on them. In the next steps, weâll go through some sample data cleaning and preparation procedures.
Itâs not uncommon to find non-ascii characters in data, sometimes due to sellers trying to add a touch of class to their product (e.g., CrĂšme brĂ»lĂ©e), or due to errors in scraping the data (e.g., "
, &
,  
).
Thus, before doing any further processing, weâll ensure titles are properly encoded so that CrĂšme brĂ»lĂ©e -> Creme brulee, Äöûëß -> aouei, and " &  -> â & â.
Hereâs the approach I took:
# Function to encode string
def encode_string(title, parser=HTML_PARSER):
""" (str) -> str
Returns a string that is encoded as ascii
:param title:
:return:
>>> encode_string('CrÚme brûlée')
'Creme brulee'
>>> encode_string('Äöûëß')
'aouei'
>>> encode_string('CrÚme brûlée " & ')
'Creme brulee " & '
"""
try:
encoded_title = unicodedata.normalize('NFKD', unicode(title, 'utf-8', 'ignore')).encode('ascii', 'ignore')
encoded_title = parser.unescape(encoded_title).encode('ascii', 'ignore')
except TypeError: # if title is missing and a float
encoded_title = 'NA'
return encoded_title
Thereâs quite a bit going on in the code above, so letâs examine it piece by piece:
x = 'Cr\xc3\xa8me & br\xc3\xbbl\xc3\xa9eâ; print x
# Convert titles into unicode
x = unicode(x, 'utf-8', 'ignore'); print x
>>> CrÚme & brûlée
# Normalize unicode (errors may crop up if this is not done)
x = unicodedata.normalize('NFKD', x); print x
>>> CrÚme & brûlée
# Encode unicode into ascii
x = x.encode('ascii', 'ignore'); print x
>>> Creme & brulee
# Parse html
x = HTML_PARSER.unescape(z).encode('ascii', 'ignore'); print x
>>> Creme & brulee
Lowercasing titles is a fairly standard step in text processing. Weâll lowercase all title characters before proceeding.
One common way to tokenize text is via nltk.tokenize. I tried it and found it to be significantly slower than using regular regex. In addition, writing our own regex tokeniser gives us flexibility in excluding certain characters that are being used as a split character.
For example, we want to exclude the following words/phrases from being tokenised by splitting on the punctuation character in brackets. Intuitively, the punctuation characters provides essential information; empirically, keeping them led to greater accuracy during model training and validation.
Hereâs how we write our own tokeniser:
# Tokenize strings
def tokenize_title_string(title, excluded=â-/.%'):
""" (str) -> list(str)
Returns a list of string tokens given a string.
It will exclude the following characters from the tokenization: - / . %
:param title:
:return:
>>> tokenize_title_string('hello world', '-.')
['hello', 'world']
>>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '')
['test', 'hyphen', 'word', '0', '9', '20', 'green', 'blue']
>>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '.-')
['test', 'hyphen-word', '0.9', '20', 'green', 'blue']
>>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '-./%')
['test', 'hyphen-word', '0.9', '20%', 'green/blue']
"""
return re.split("[^" + excluded + "\w]+", title)
After tokenising our titles, we can proceed to remove stop words. The trick is in which stop words to remove. For the product classification API, I found a combination of the following to work well:
At this point, after tokenising the titles, the tokens are stored in a list. We can remove stop words easy and cleanly via list comprehension, like so:
# Remove stopwords from string
def remove_words_list(title, words_to_remove):
""" (list(str), set) -> list(str)
Returns a list of tokens where the stopwords/spam words/colours have been removed
:param title:
:param words_to_remove:
:return:
>>> remove_words_list(['python', 'is', 'the', 'best'], STOP_WORDS)
['python', 'best']
>>> remove_words_list(['grapes', 'come', 'in', 'purple', 'and', 'green'], STOP_WORDS)
['grapes', 'come']
>>> remove_words_list(['spammy', 'title', 'intl', 'buyincoins', 'export'], STOP_WORDS)
['spammy', 'title']
"""
return [token for token in title if token not in words_to_remove]
Weâll also remove words that are solely numeric. Intuitively, an iPhone 7, iPhone 8, or iPhone 21 should all be categorized as a mobile phone, and having the numeric suffix does not add any additional useful information to categorize it better. Can you think of a product where removing the numerics would put it in different category?
Similar to above, removing numerics can be accomplished easily via list comprehension:
# Remove words that are fully numeric
def remove_numeric_list(title):
""" (list(str)) -> list(str)
Remove words which are fully numeric
:param title:
:return:
>>> remove_numeric_list(['A', 'B2', '1', '123', 'C'])
['A', 'B2', 'C']
>>> remove_numeric_list(['1', '2', '3', '123'])
[]
"""
return [token for token in title if not token.isdigit()]
We also remove words that have character length below a certain threshold. E.g., if the threshold is two, then single character words are removed; if the threshold is three, then words with two characters are removed.
To an untrained eye (like mine), double character words like âTXâ, âABâ, âGTâ doesnât add much informational value to the titleâthough there are exceptions like â3Mâ. Via cross-validation, I found that removing these words led to increased accuracy.
Hereâs how we remove these double character wordsâyou can change the word length threshold to suit your needs:
# Remove words with character count below threshold from string
def remove_chars(title, word_len=2):
""" (list(str), int) -> list(str)
Returns a list of str (tokenized titles) where tokens of character length =< word_len is removed. :param title: :param word_len: :return: >>> remove_chars(['what', 'remains', 'of', 'a', 'word', '!', ''], 1)
['what', 'remains', 'of', 'word']
>>> remove_chars(['what', 'remains', 'of', 'a', 'word', '!', '', 'if', 'word_len', 'is', '2'], 2)
['what', 'remains', 'word', 'word_len']
"""
return [token for token in title if len(token) > word_len]
Next, we exclude duplicated words in titles. Sometimes, titles have duplicate words due to sellers attempting to apply search engine optimisation (SEO) on their products to make them more findable. However, these duplicate words do not provide any additional information to categorizing products.
We can remove duplicate tokens by converting the token list to a token setâyes, this removes any sequential information in the title. However, weâre only doing this step to identify impure products that should not be used in training our model. During the actual data preparation, we will exclude this step.
Converting a list to a set shouldnât be too difficult right? Iâve leave that for the reader.
Lastly, after performing all the cleaning and preparation above, there may be some titles that have no text left. (This means that those titles only contained stop words, numerics, or words with < 3 character length.) Weâll exclude these products as well.
After doing the above, weâre left with titles in their most informational rich and dense form. In this case, weâre confident that products with identical titles and categories are correctly categorized, while products with identical titles but different categories have at least one error in them (i.e., impure)
Among the impure products, without having ground truth about which are correctly or incorrectly categorized, weâll discard them and not use them to train our model.
Whew! Thatâs a lot of work just to clean titles! Nonetheless, Weâre largely done with the data preparation steps.
Next, weâre going to share about the framework to making this product classifier available online, via a simple web UI. This will involve the following:
If you found this useful, please cite this write-up as:
Yan, Ziyou. (Dec 2016). Product Classification API Part 2: Data Preparation. eugeneyan.com. https://eugeneyan.com/writing/product-categorization-api-part-2-data-preparation/.
or
@article{yan2016preparation,
title = {Product Classification API Part 2: Data Preparation},
author = {Yan, Ziyou},
journal = {eugeneyan.com},
year = {2016},
month = {Dec},
url = {https://eugeneyan.com/writing/product-categorization-api-part-2-data-preparation/}
}
Join 9,300+ readers getting updates on machine learning, RecSys, LLMs, and engineering.