Created a recommendation system for the Amazon electronics category
TThe aim of the project ispartially rebuildTo dieAmazon product recommendation systemFor theelectronicProduct Category.
it's november andblack Fridayis here! What type of buyer are you? Save all the products you want to buy for the day or prefer to open the website andlive offerswith their big discounts?
While online stores have been incredibly successful over the past decade and have shown enormous potential and growth, one of the key differences between a physical store and an online store is customer satisfaction.impulsive purchases.
When customers are introduced to avariety of products, they are much more likely to buy an item that they originally did not intend to buy. HeinsaneVonspontaneous purchaseit's amazinglimitedsetting up ain linesave to computer. The sameThis doesn `t happenfor youphysicallycounterparties The largest physical retail chains allow their customers to go through aexact wayto ensure they visit each aisle before leaving the store.
A way that online stores like Amazon believed was possiblecreate a newan impulse buying phenomenon is overrecommendation systems. Recommender systems identifymore likelyoadditionalProducts that the customer has just purchased or viewed. The intention is to maximize thisrandom purchasesPhenomenon that is often lacking in online stores.
shopping on amazonI was very interested in mechanics and I wantedcreate a new(also partially) the results of your recommendation system.
According to the blog "Recostream", Amazon has the product recommendation systemthree types of dependencies, one of them isProduct-by-product recommendations. When a user has virtually no search history, the algorithm adds products and suggests them to the same user based on the item's metadata.
The data
The first stage of the project is to collect theData. Fortunately, researchers fromUniversity of CaliforniaIn San Diego, we have a repository where students and people outside the organization can use the data for research and projects. Data can be accessed as followsshortcuttogether withmany more interesting datasetsregarding recommender systems[2][3]. The product metadata waslast update in 2014; Many of the products may not be available today.
Contain the metadata of the Electronics category498,196 recordsand has8 columnsin the summer:
as in
— the unique identification assigned to each productimurl
— the URL link of the image associated with each productDescription
— The product descriptioncategories
— a Python list of all the categories each product falls intotitle
— the title of the productPreis
- the price of the productsales rating
— the classification of each product within a specific categoryrelated
— Products viewed and purchased by customers related to each productmercado
— the brand of the product.
You'll notice that the file is in a "dropped" state.JSON
Format where each line is aJSON
includes all the columns mentioned above as one of the fields. See the code implementation section for how to handle this.
DEA
Let's start with a quick exploratory data analysis. Afterclear all recordscontaining at least oneYaya
Value in one of the columns I created the views for the Electronics category.
The first diagram is abox plotshows the maximum, minimum, 25th percentile, 75th percentile, and median price for each product. For example, we know them.maximumThe value of a product will be $1000 while theMinimumis around $1. The cross above the $160 level consists ofPoints, and each of these points identifies aTo escape. An outlier represents a dataset that occurs only once in the entire dataset. So we know that there is only 1 product that is priced at around $1000.
HeAverageThe price seems to be around $25. It is important to note that the librarymatplotlib
automatically delete outliers with optionshowfliers=Falso
. To make our boxplot cleaner, we can set the parameter equal to false.
The result is a much cleaner box plot without the outliers. The chart also suggests that the vast majority of electronics are in the $1-$160 range.
The diagram shows thetop 10 brandsafter the number ofproducts listedSell on Amazon in the Electronics category. Among them are HP, Sony, Dell and Samsung.
Finally we can see them.prize distributionfor each oftop 10 sellers. Sony and Samsung definitely offer awide range of products, ranging from a few dollars to $500-$600, making its average price higher than most major competitors. curiouslySIB and SIB CORPoffer more products, but on average at a much lower price.
The graph also tells us that Sony offers products that represent approximately 60% of the most expensive products in the dataset.
cosine similarity
A possible solution to group products according to their characteristics iscosine similarity. We need to understand this concept well to then build our recommendation system.
cosine similaritymeasures how "close" two sequences of numbers are. How does this apply to our case? Surprisingly, the sentences can be converted to numbers or, even better, to vectors.
The cosine similarity can take valuesbetween -1 and 1, It is1indicates that there are two vectorsformally the samewhile-1indicates that they are soandershow can you get
Mathematically,cosine similarityis the dot product of two multidimensional vectors divided by the product of their sizes [4]. I understand there's a lot of profanity here, but let's try to break it down with a practical example.
suppose we areAnalyzing Document AYDocument B. Document A has the three most common terms: “today”, “good” and “sun”, which appear 4, 2 and 3 times, respectively. The same three terms in Document B occur 3, 2, and 2 times. Therefore, we can write them as follows:
A = (2, 2, 3); B = (3, 2, 2)
The formula for thepeeling productof two vectors can be written as:
you vectorpeeling productnothing more than 2x3 + 2x2 + 3x2 = 16
Hesingle vector sizeOn the other hand, it is calculated as follows:
When I apply the formula I get
||A|| = 4,12 ; ||B|| = 4,12
their cosine similarity is therefore
16/17 = 0,94 = 19,74°
The two vectors are very similar.
So far, we have only calculated the score.between two vectorsswindlerthree dimensions. A vector of words can have at handan infinite numberof dimensions (depending on how many words it contains), but the logic behind the process is mathematically the same. In the next section we will see how to apply all the concepts in practice.
let's go to thoseCode implementation phaseto build our recommendation system on the dataset.
Importing Libraries
It should be the first cell of every data science notebook.import the libraries,The ones we need for the project are:
#Import libraries for data management
import gzip
import json
import pandas as pd
from tqdm import tqdm_notebook as tqdm#Import function engineering libraries
import nltk
import right
from nltk.corpus import stopwords
aus sklearn.feature_extraction.text import CountVectorizer
aus sklearn.metrics.pairwise import coseno_similitud
gzip
unzip the data filesjson
decode thempandas
converts JSON data into a more manageable data frame formattqdm
create progress barnltk
to process text stringsAbout
provides support for regular expressions- Finally,
learned
is required for text pre-processing
reading the data
As already mentioned, the data was loaded into alose JSONFormat. The solution to this problem is to first convert the file toReadable Format Lines in JSONwith the commandjson.dumps
. So we can convert this file to aList de Pythonof JSON lines configuring\NORTE
like a line break. Finally, we can add each line to theData
empty list when readingas JSONwith the commandjson.loads
.
with the commandpd.DataFrame
To dieData
list is read as a data frame, which we can now use to create our recommender.
# Create an empty list
data = []# Decrypt the gzip file
def parse(path):
g = gzip.open(ruta, 'r')
for l in g:
Yield json.dumps(eval(l))
#Define f as the file that will contain the json data
f = open("strict_output.json", 'w')
# Set the newline to '\n' and write it at the end of each line
for the analysis("meta_Electronics.json.gz"):
f.write(l + '\n')
Add the #JSON element to the empty 'data' list
mit open('output_strict.json', 'r') como f:
for l in tqdm(f):
data.append(json.loads(l))
#Reading 'data' as pandas dataframe
full = pd.DataFrame(Fecha)
For you to have an idea of how each line of theData
list looks like we can run asimple command print(data[0])
, o consoleprint the line at index 0.
print(data[0])Salida:
{
'asin': '0132793040',
'imUrl': 'http://ecx.images-amazon.com/images/I/31JIPhp%2BGIL.jpg',
'description': 'The Kelby Training DVD Mastering Blend Modes in Adobe Photoshop CS5 with Corey Barker is a useful tool for... and the confidence you need.',
'categories': [['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Display Accessories']],
'title': 'Kelby Training DVD: Mastering Blend Modes in Adobe Photoshop CS5 by Corey Barker'
}
As you can see, the output is a JSON file with the extension{}
to open and close the string, and each column name is followed by the:
and the corresponding sequence. You may find that this first product lacks thatPreis
,sales rating
,related
, Ybrand information
. These columns are automatically populated.Yaya
Values.
After reading the entire list as a data frame, electronic products have the following 8 characteristics:
| donkey | ImUrl | Description | Categories |
|--------|---------|-----|-------------- || Price | Sales Rating | related | brand |
|---------|-------------|------------|------- --|
feature engineering
feature engineeringis responsible forCleaning and creating data.the column on which we calculatethe cosine similarity score. Due to RAM limitations, I didn't want the columns to be particularly long, like a product review or description. Instead, I decided to create a"data soup"with himcategories
,title
, Ymercado
columns. However, before doing so, we must remove all rows that contain a NaN value in any of these three columns.
The selected columns contain valuable and essential information in text form that we need for our recommendation. HeDescription
-Column can also be a potential candidate, but the string is usually too long and not standardized in the dataset. It's not reliable information for what we're trying to achieve.
#Delete each row that contains a NaN value in the selected columns
df = full.dropna(subset=['categories', 'title', 'tag'])#redefine index counter
df = df.reset_index()
After running this first piece of code, the lines shrink at a dizzying rate.498.196until more or less142.000, a big change. Only at this point can we create the so-called data soup:
#Create data soup from selected columns
df['Array'] = df['Title'] + ' ' +
df['Categoria'].astype(str) + ' ' +
df['Brand']#Print record no index 0
df['set'].iloc[0]
Salida:
Barnes & Noble Carbon NOOK Power Kit BNADPN31
[['Electronics', 'Ebook Readers and Accessories', 'Power Supplies']]
Barnes & Noble"
Namemercadomust be included, as the title does not always include it.
now i can get on with itcleaning part. The functiontext_cleanup
is responsible for deleting everythingAmpere
assembly column rope. above the cord[^A-Za-z0–9]
filter allspecial character. Finally, the last line of the function removes allstop the wordcontains the string.
#Define text clearing function
def text_cleaning(texto):
forbidden_words = set(stopwords.words('english'))
text = re.sub(r'amp','',text)
texto = re.sub(r'\s+', ' ', re.sub('[^A-Za-z0-9]', ' ',
text.strip().bottom())).strip()
text = [word for word in text.split() if the word is not in forbidden_words]
return ' '.join(text)
With himlambda function, we can applytext_cleanup
to the entire column mentionedDefine
, we can randomly select a soup of data from a random product over a phone calliloc
and display the index of the random data set.
#Apply the text clearing function to each line
df['set'] = df['set'].apply(lambda text: text_cleaning(text))#Print the line at index 10000
df['set'].iloc[10000]
Salida:
vcool vga cooler electronics computer accessories
computer component fans antech cooling box fans
The inscription on10001st aba(indexing starts at 0) is theAntec vcool VGA cooler. This is a scenario where the brand name was not included in the title.
Cosine calculation and recommendation function
Calculating the cosine similarity starts withCreate an array containing all the words that appear in the array column. The method we will use is called “count vectoring"or more often"bag of words“. If you want to read more about count vectoring, you can read one of my previous articles below.shortcut.
Due to RAM limitations, the cosine similarity score is only calculated for the first35.000records of142.000available after the pre-processing phase. This is very likely to affect the final performance of the recommender.
# Selection of the first 35000 lines
df = df.head(35000)#count_vect object is created
count_vect = CountVectorizer()
# create matrix
matrix_count = vector_count.fit_transform(df['set'])
# Calculate the cosine similarity matrix
cos_sim = cos_similarity(count_matrix, count_matrix)
The commandcosine_similarity
, as the name suggests, calculates the cosine similarity for each row in theparent_account
. each line inparent_account
is nothing more than a vector containing the word count of each word that appears in the column of the array.
#Creating an array of pandas from the index of df
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
Before running thereal recommendation systemwe must ensure that an index is created and that index does not contain duplicates.
Only at this point can we define thatcontent recommendation
Function. It has 4 arguments:title
,cosine_yes
,d.f.
, Yindexes
. The title is the only element that must be entered when calling the function.
content recommendation
it works like this:
- he finds themProduct indexassociated with the user-supplied title
- Find the product index within thecosine similarity matrixand collects all scores for all products
- EScleanall scoresmost similar product(closest to 1) to least similar (closest to 0)
- just choose themtop 30 most similarProducts
- Add an index andreturns an array of pandasas a result
# Function that takes the product title as input and gives recommendations
def recommender_content(title, sim_cos=sim_cos, df=df,
indices=indices):# Get the index of the product that matches the title
idx = indexes[title]
# Get similarity scores of all products with this product
# And convert it to a list of tuples like above
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort products based on cosine similarity values
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 30 most similar products. Ignore the first product.
yes_scores = yes_scores[1:30]
# Get product indexes
product_indices = [i[0] para i en sim_scores]
# Returns the 30 most similar products
return df['title'].iloc[product_indexes]
Now let's try it on "Vcool VGA Cooler". We want 30 similar products that customers are interested in buying. When executing the commandContent recommendation (product title)
, to dieThe function returns a list of 30 recommendations..
#Define the product for which we want to recommend other articles
product_title = 'Vcool VGA-Kühler'#Starting the content_recommender function
Recommendations = content_recommender(product_title)
# Association of titles with recommendations
asin_recommendations = df[df['title'].isin(recommendations)]
# Merge records
recommendations = pd.merge(recommendations,
asin_recommendations,
on='Title',
like = 'left')
# Show top 5 recommended products
Recommendations['Title'].head()
between himThe 5 most similar productsWe can find other Antec products such as Tricool computer case fan, expansion slot fan, etc.
1 Antec Big Boy 200 - 200 mm Tricool Computergehäuselüfter
2 Antec Cyclone Blower, expansion slot fan
3 StarTech.com 90 x 25mm High Airflow Dual Bearing Computer Chassis Fan with TX3 FAN9X25TX3H Fan (Black)
4 x Antec 120MM BLUE LED FAN (transparent)
5 Antec PRO 80MM Pro 80mm fan with 3 and 4-pin connector (discontinued by the manufacturer)
Herelated
The column in the original record contains a list of products that consumers also bought, bought together, and bought after looking at the VGA cooler.
#Selecting the "Related" column of the product for which we calculate recommendations
related = pd.DataFrame.from_dict(df['related'].iloc[10000], orient='index').transpose()#Print the first 10 records of the log
related.head(10)
When printing the Python dictionary header in this column, the console returns the following record.
| | also_purchased | bought together | buy_after_seeing |
|---:|:------|:-----|:------- -------------|
| 0 | B000051299 | B000233ZMU | B000051299 |
| 1 | B000233ZMU | B000051299 | B00552Q7SC |
| 2 | B000I5KSNQ | | B000233ZMU |
| 3 | B00552Q7SC | | B004X90SE2 |
| 4 | B000HVHCKS | | |
| 5 | B0026ZPFCK | | |
| 6 | B009SJR3GS | | |
| 7 | B004X90SE2 | | |
| 8 | B001NPEBEC | | |
| 9 | B002DUKPN2 | | |
| 10 | B00066FH1U | | |
Let's test if our recommender got it right. Let's see if any ofas in
ids and othersso_bought
The list is included in the recommendations.
#Verify that the recommended products in the "also_purchased" column are for
#recommender final ratingrelated['also_purchased'].isin(recommendations['asin'])
Our recommender correctly suggested5 of 44 products.
[TRUEINCORRECTTRUEFalse False False False False False FalseTRUEFalse False False False False False FalseTRUEFalse False False False False False False False FalseTRUEfalse false false false false false false false false false false false false false]
I agreenot an ideal resultbut taking into account that we only use35.000since498.196rows are available in the full dataset, this is acceptable. It certainly has a lot of room for improvement. If the NaN values for the target columns were less common or non-existent, the recommendations could be more accurate and closer to Amazon's actual values. Second, to accessRAM superior, or even distributed computing, could allow the professional to compute even larger matrices.
I hope you liked the project and that it will be useful for future uses.
As mentioned in the article, the end result can be further improved.including all rows in the datasetin the cosine similarity matrix. In addition, we can add any productCheck the average scoremerging the metadata record with other records available in the repository. we couldincludes the pricewhen calculating the cosine similarity. Another potential improvement could be building a recommendation system based entirely on allProduct DescriptionPhotos.
The most important fixes for future improvements have been listed. Most of them are worth following from a future implementation perspective.Production.
Finally, I would like to end this article by thanking Medium for implementing such a useful content sharing feature for developers on the platform.
print('Thanks Media!')
As a final note, if you enjoyed the content, please consider leaving a follower to be notified when new articles are published. If you have any comments about the article, write them in the comments! I would love to read them :) Thanks for reading!
PS If you like what I write, it would mean a lot to me if you could sign up for a Medium subscriptionthis link. Membership gives you the incredible value that Medium Articles provide and is a roundabout way to support my content!
[1] Amazon's Product Recommendation System in 2021: How Does the E-Commerce Giant's Algorithm Work? —Recostream. (2021). Retrieved November 1, 2022 from the Recostream.com website:https://recostream.com/blog/amazon-recommendation-system
[2] He, R. and McAuley, J. (2016, April). Highs and Lows: Shaping the Visual Evolution of Fashion Trends with Collaborative Filtering in a Classroom. InProceedings of the 25th International Conference on the World Wide Web(S. 507-517).
[3] McAuley, J., Targett, C., Shi, Q., and Van Den Hengel, A. (August 2015). Image-based recommendations for styles and replacements. InProceedings of the 38th ACM SIGIR International Conference on Information Retrieval Research and Development(Art. 43-52).
[4] Rahutomo, F., Kitasuka, T., and Aritsugi, M. (2012, October). Cosine semantic similarity. PTThe 7th ICAST International Student Conference on Advanced Science and Technology(Bd. 4, Nr. 1, S. 1).
[5] Rounak Banik. 2018. Hands-On Recommendation Engines with Python: Start building powerful, custom recommendation engines with Python.Package publishing.
If you would like to take a look at some of my previous articles