Building a recommendation system for Amazon products with Python (2023)

Created a recommendation system for the Amazon electronics category

Building a recommendation system for Amazon products with Python (1)

TThe aim of the project ispartially rebuildTo dieAmazon product recommendation systemFor theelectronicProduct Category.

it's november andblack Fridayis here! What type of buyer are you? Save all the products you want to buy for the day or prefer to open the website andlive offerswith their big discounts?

While online stores have been incredibly successful over the past decade and have shown enormous potential and growth, one of the key differences between a physical store and an online store is customer satisfaction.impulsive purchases.

When customers are introduced to avariety of products, they are much more likely to buy an item that they originally did not intend to buy. HeinsaneVonspontaneous purchaseit's amazinglimitedsetting up ain linesave to computer. The sameThis doesn `t happenfor youphysicallycounterparties The largest physical retail chains allow their customers to go through aexact wayto ensure they visit each aisle before leaving the store.

A way that online stores like Amazon believed was possiblecreate a newan impulse buying phenomenon is overrecommendation systems. Recommender systems identifymore likelyoadditionalProducts that the customer has just purchased or viewed. The intention is to maximize thisrandom purchasesPhenomenon that is often lacking in online stores.

shopping on amazonI was very interested in mechanics and I wantedcreate a new(also partially) the results of your recommendation system.

According to the blog "Recostream", Amazon has the product recommendation systemthree types of dependencies, one of them isProduct-by-product recommendations. When a user has virtually no search history, the algorithm adds products and suggests them to the same user based on the item's metadata.

The data

The first stage of the project is to collect theData. Fortunately, researchers fromUniversity of CaliforniaIn San Diego, we have a repository where students and people outside the organization can use the data for research and projects. Data can be accessed as followsshortcuttogether withmany more interesting datasetsregarding recommender systems[2][3]. The product metadata waslast update in 2014; Many of the products may not be available today.

Contain the metadata of the Electronics category498,196 recordsand has8 columnsin the summer:

  • as in— the unique identification assigned to each product
  • imurl— the URL link of the image associated with each product
  • Description— The product description
  • categories— a Python list of all the categories each product falls into
  • title— the title of the product
  • Preis- the price of the product
  • sales rating— the classification of each product within a specific category
  • related— Products viewed and purchased by customers related to each product
  • mercado— the brand of the product.

You'll notice that the file is in a "dropped" state.JSONFormat where each line is aJSONincludes all the columns mentioned above as one of the fields. See the code implementation section for how to handle this.


Let's start with a quick exploratory data analysis. Afterclear all recordscontaining at least oneYayaValue in one of the columns I created the views for the Electronics category.

Building a recommendation system for Amazon products with Python (2)

The first diagram is abox plotshows the maximum, minimum, 25th percentile, 75th percentile, and median price for each product. For example, we know them.maximumThe value of a product will be $1000 while theMinimumis around $1. The cross above the $160 level consists ofPoints, and each of these points identifies aTo escape. An outlier represents a dataset that occurs only once in the entire dataset. So we know that there is only 1 product that is priced at around $1000.

HeAverageThe price seems to be around $25. It is important to note that the librarymatplotlibautomatically delete outliers with optionshowfliers=Falso. To make our boxplot cleaner, we can set the parameter equal to false.

Building a recommendation system for Amazon products with Python (3)

The result is a much cleaner box plot without the outliers. The chart also suggests that the vast majority of electronics are in the $1-$160 range.

Building a recommendation system for Amazon products with Python (4)

The diagram shows thetop 10 brandsafter the number ofproducts listedSell ​​on Amazon in the Electronics category. Among them are HP, Sony, Dell and Samsung.

Building a recommendation system for Amazon products with Python (5)

Finally we can see them.prize distributionfor each oftop 10 sellers. Sony and Samsung definitely offer awide range of products, ranging from a few dollars to $500-$600, making its average price higher than most major competitors. curiouslySIB and SIB CORPoffer more products, but on average at a much lower price.

The graph also tells us that Sony offers products that represent approximately 60% of the most expensive products in the dataset.

cosine similarity

A possible solution to group products according to their characteristics iscosine similarity. We need to understand this concept well to then build our recommendation system.

cosine similaritymeasures how "close" two sequences of numbers are. How does this apply to our case? Surprisingly, the sentences can be converted to numbers or, even better, to vectors.

The cosine similarity can take valuesbetween -1 and 1, It is1indicates that there are two vectorsformally the samewhile-1indicates that they are soandershow can you get

Mathematically,cosine similarityis the dot product of two multidimensional vectors divided by the product of their sizes [4]. I understand there's a lot of profanity here, but let's try to break it down with a practical example.

suppose we areAnalyzing Document AYDocument B. Document A has the three most common terms: “today”, “good” and “sun”, which appear 4, 2 and 3 times, respectively. The same three terms in Document B occur 3, 2, and 2 times. Therefore, we can write them as follows:

A = (2, 2, 3); B = (3, 2, 2)

The formula for thepeeling productof two vectors can be written as:

you vectorpeeling productnothing more than 2x3 + 2x2 + 3x2 = 16

Hesingle vector sizeOn the other hand, it is calculated as follows:

When I apply the formula I get

||A|| = 4,12 ; ||B|| = 4,12

their cosine similarity is therefore

16/17 = 0,94 = 19,74°

The two vectors are very similar.

So far, we have only calculated the score.between two vectorsswindlerthree dimensions. A vector of words can have at handan infinite numberof dimensions (depending on how many words it contains), but the logic behind the process is mathematically the same. In the next section we will see how to apply all the concepts in practice.

let's go to thoseCode implementation phaseto build our recommendation system on the dataset.

Importing Libraries

It should be the first cell of every data science notebook.import the libraries,The ones we need for the project are:

#Import libraries for data management
import gzip
import json
import pandas as pd
from tqdm import tqdm_notebook as tqdm

#Import function engineering libraries
import nltk
import right
from nltk.corpus import stopwords
aus sklearn.feature_extraction.text import CountVectorizer
aus sklearn.metrics.pairwise import coseno_similitud

  • gzipunzip the data files
  • jsondecode them
  • pandasconverts JSON data into a more manageable data frame format
  • tqdmcreate progress bar
  • nltkto process text strings
  • Aboutprovides support for regular expressions
  • Finally,learnedis required for text pre-processing

reading the data

As already mentioned, the data was loaded into alose JSONFormat. The solution to this problem is to first convert the file toReadable Format Lines in JSONwith the commandjson.dumps. So we can convert this file to aList de Pythonof JSON lines configuring\NORTElike a line break. Finally, we can add each line to theDataempty list when readingas JSONwith the commandjson.loads.

with the commandpd.DataFrameTo dieDatalist is read as a data frame, which we can now use to create our recommender.

# Create an empty list
data = []

# Decrypt the gzip file
def parse(path):
g =, 'r')
for l in g:
Yield json.dumps(eval(l))

#Define f as the file that will contain the json data
f = open("strict_output.json", 'w')

# Set the newline to '\n' and write it at the end of each line
for the analysis("meta_Electronics.json.gz"):
f.write(l + '\n')

Add the #JSON element to the empty 'data' list
mit open('output_strict.json', 'r') como f:
for l in tqdm(f):

#Reading 'data' as pandas dataframe
full = pd.DataFrame(Fecha)

For you to have an idea of ​​how each line of theDatalist looks like we can run asimple command print(data[0]), o consoleprint the line at index 0.


'asin': '0132793040',
'imUrl': '',
'description': 'The Kelby Training DVD Mastering Blend Modes in Adobe Photoshop CS5 with Corey Barker is a useful tool for... and the confidence you need.',
'categories': [['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Display Accessories']],
'title': 'Kelby Training DVD: Mastering Blend Modes in Adobe Photoshop CS5 by Corey Barker'

As you can see, the output is a JSON file with the extension{}to open and close the string, and each column name is followed by the:and the corresponding sequence. You may find that this first product lacks thatPreis,sales rating,related, Ybrand information. These columns are automatically populated.YayaValues.

After reading the entire list as a data frame, electronic products have the following 8 characteristics:

| donkey | ImUrl | Description | Categories |
|--------|---------|-----|-------------- |
| Price | Sales Rating | related | brand |
|---------|-------------|------------|------- --|

feature engineering

feature engineeringis responsible forCleaning and creating data.the column on which we calculatethe cosine similarity score. Due to RAM limitations, I didn't want the columns to be particularly long, like a product review or description. Instead, I decided to create a"data soup"with himcategories,title, Ymercadocolumns. However, before doing so, we must remove all rows that contain a NaN value in any of these three columns.

The selected columns contain valuable and essential information in text form that we need for our recommendation. HeDescription-Column can also be a potential candidate, but the string is usually too long and not standardized in the dataset. It's not reliable information for what we're trying to achieve.

#Delete each row that contains a NaN value in the selected columns
df = full.dropna(subset=['categories', 'title', 'tag'])

#redefine index counter
df = df.reset_index()

After running this first piece of code, the lines shrink at a dizzying rate.498.196until more or less142.000, a big change. Only at this point can we create the so-called data soup:

#Create data soup from selected columns
df['Array'] = df['Title'] + ' ' +
df['Categoria'].astype(str) + ' ' +

#Print record no index 0

Barnes & Noble Carbon NOOK Power Kit BNADPN31
[['Electronics', 'Ebook Readers and Accessories', 'Power Supplies']]
Barnes & Noble"

Namemercadomust be included, as the title does not always include it.

now i can get on with itcleaning part. The functiontext_cleanupis responsible for deleting everythingAmpereassembly column rope. above the cord[^A-Za-z0–9]filter allspecial character. Finally, the last line of the function removes allstop the wordcontains the string.

#Define text clearing function
def text_cleaning(texto):
forbidden_words = set(stopwords.words('english'))
text = re.sub(r'amp','',text)
texto = re.sub(r'\s+', ' ', re.sub('[^A-Za-z0-9]', ' ',
text = [word for word in text.split() if the word is not in forbidden_words]
return ' '.join(text)

With himlambda function, we can applytext_cleanupto the entire column mentionedDefine, we can randomly select a soup of data from a random product over a phone callilocand display the index of the random data set.

#Apply the text clearing function to each line
df['set'] = df['set'].apply(lambda text: text_cleaning(text))

#Print the line at index 10000

vcool vga cooler electronics computer accessories
computer component fans antech cooling box fans

The inscription on10001st aba(indexing starts at 0) is theAntec vcool VGA cooler. This is a scenario where the brand name was not included in the title.

Cosine calculation and recommendation function

Calculating the cosine similarity starts withCreate an array containing all the words that appear in the array column. The method we will use is called “count vectoring"or more often"bag of words“. If you want to read more about count vectoring, you can read one of my previous articles below.shortcut.

Due to RAM limitations, the cosine similarity score is only calculated for the first35.000records of142.000available after the pre-processing phase. This is very likely to affect the final performance of the recommender.

# Selection of the first 35000 lines
df = df.head(35000)

#count_vect object is created
count_vect = CountVectorizer()

# create matrix
matrix_count = vector_count.fit_transform(df['set'])

# Calculate the cosine similarity matrix
cos_sim = cos_similarity(count_matrix, count_matrix)

The commandcosine_similarity, as the name suggests, calculates the cosine similarity for each row in theparent_account. each line inparent_accountis nothing more than a vector containing the word count of each word that appears in the column of the array.

#Creating an array of pandas from the index of df
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

Before running thereal recommendation systemwe must ensure that an index is created and that index does not contain duplicates.

Only at this point can we define thatcontent recommendationFunction. It has 4 arguments:title,cosine_yes,d.f., Yindexes. The title is the only element that must be entered when calling the function.

content recommendationit works like this:

  • he finds themProduct indexassociated with the user-supplied title
  • Find the product index within thecosine similarity matrixand collects all scores for all products
  • EScleanall scoresmost similar product(closest to 1) to least similar (closest to 0)
  • just choose themtop 30 most similarProducts
  • Add an index andreturns an array of pandasas a result
# Function that takes the product title as input and gives recommendations
def recommender_content(title, sim_cos=sim_cos, df=df,

# Get the index of the product that matches the title
idx = indexes[title]

# Get similarity scores of all products with this product
# And convert it to a list of tuples like above
sim_scores = list(enumerate(cosine_sim[idx]))

# Sort products based on cosine similarity values
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the scores of the 30 most similar products. Ignore the first product.
yes_scores = yes_scores[1:30]

# Get product indexes
product_indices = [i[0] para i en sim_scores]

# Returns the 30 most similar products
return df['title'].iloc[product_indexes]

Now let's try it on "Vcool VGA Cooler". We want 30 similar products that customers are interested in buying. When executing the commandContent recommendation (product title), to dieThe function returns a list of 30 recommendations..

#Define the product for which we want to recommend other articles
product_title = 'Vcool VGA-Kühler'

#Starting the content_recommender function
Recommendations = content_recommender(product_title)

# Association of titles with recommendations
asin_recommendations = df[df['title'].isin(recommendations)]

# Merge records
recommendations = pd.merge(recommendations,
like = 'left')

# Show top 5 recommended products

between himThe 5 most similar productsWe can find other Antec products such as Tricool computer case fan, expansion slot fan, etc.

1 Antec Big Boy 200 - 200 mm Tricool Computergehäuselüfter
2 Antec Cyclone Blower, expansion slot fan
3 90 x 25mm High Airflow Dual Bearing Computer Chassis Fan with TX3 FAN9X25TX3H Fan (Black)
4 x Antec 120MM BLUE LED FAN (transparent)
5 Antec PRO 80MM Pro 80mm fan with 3 and 4-pin connector (discontinued by the manufacturer)

HerelatedThe column in the original record contains a list of products that consumers also bought, bought together, and bought after looking at the VGA cooler.

#Selecting the "Related" column of the product for which we calculate recommendations
related = pd.DataFrame.from_dict(df['related'].iloc[10000], orient='index').transpose()

#Print the first 10 records of the log

When printing the Python dictionary header in this column, the console returns the following record.

| | also_purchased | bought together | buy_after_seeing |
|---:|:------|:-----|:------- -------------|
| 0 | B000051299 | B000233ZMU | B000051299 |
| 1 | B000233ZMU | B000051299 | B00552Q7SC |
| 2 | B000I5KSNQ | | B000233ZMU |
| 3 | B00552Q7SC | | B004X90SE2 |
| 4 | B000HVHCKS | | |
| 5 | B0026ZPFCK | | |
| 6 | B009SJR3GS | | |
| 7 | B004X90SE2 | | |
| 8 | B001NPEBEC | | |
| 9 | B002DUKPN2 | | |
| 10 | B00066FH1U | | |

Let's test if our recommender got it right. Let's see if any ofas inids and othersso_boughtThe list is included in the recommendations.

#Verify that the recommended products in the "also_purchased" column are for
#recommender final rating


Our recommender correctly suggested5 of 44 products.

[TRUEINCORRECTTRUEFalse False False False False False FalseTRUEFalse False False False False False FalseTRUEFalse False False False False False False False FalseTRUEfalse false false false false false false false false false false false false false]

I agreenot an ideal resultbut taking into account that we only use35.000since498.196rows are available in the full dataset, this is acceptable. It certainly has a lot of room for improvement. If the NaN values ​​for the target columns were less common or non-existent, the recommendations could be more accurate and closer to Amazon's actual values. Second, to accessRAM superior, or even distributed computing, could allow the professional to compute even larger matrices.

I hope you liked the project and that it will be useful for future uses.

As mentioned in the article, the end result can be further improved.including all rows in the datasetin the cosine similarity matrix. In addition, we can add any productCheck the average scoremerging the metadata record with other records available in the repository. we couldincludes the pricewhen calculating the cosine similarity. Another potential improvement could be building a recommendation system based entirely on allProduct DescriptionPhotos.

The most important fixes for future improvements have been listed. Most of them are worth following from a future implementation perspective.Production.

Finally, I would like to end this article by thanking Medium for implementing such a useful content sharing feature for developers on the platform.

print('Thanks Media!')

As a final note, if you enjoyed the content, please consider leaving a follower to be notified when new articles are published. If you have any comments about the article, write them in the comments! I would love to read them :) Thanks for reading!

PS If you like what I write, it would mean a lot to me if you could sign up for a Medium subscriptionthis link. Membership gives you the incredible value that Medium Articles provide and is a roundabout way to support my content!

[1] Amazon's Product Recommendation System in 2021: How Does the E-Commerce Giant's Algorithm Work? —Recostream. (2021). Retrieved November 1, 2022 from the website:

[2] He, R. and McAuley, J. (2016, April). Highs and Lows: Shaping the Visual Evolution of Fashion Trends with Collaborative Filtering in a Classroom. InProceedings of the 25th International Conference on the World Wide Web(S. 507-517).

[3] McAuley, J., Targett, C., Shi, Q., and Van Den Hengel, A. (August 2015). Image-based recommendations for styles and replacements. InProceedings of the 38th ACM SIGIR International Conference on Information Retrieval Research and Development(Art. 43-52).

[4] Rahutomo, F., Kitasuka, T., and Aritsugi, M. (2012, October). Cosine semantic similarity. PTThe 7th ICAST International Student Conference on Advanced Science and Technology(Bd. 4, Nr. 1, S. 1).

[5] Rounak Banik. 2018. Hands-On Recommendation Engines with Python: Start building powerful, custom recommendation engines with Python.Package publishing.

If you would like to take a look at some of my previous articles

Time Series Prediction of Electricity ConsumptionThis article aims to use time series analysis to forecast electricity consumption in the city of Tetouan,
Model Selection and Hyperparameter Tuning on Amazon Kindle Book Reviews with PythonAnalyzing Sentiment in Book Reviews with Model Selection and
Feature Importance to Predict Fungi Ability in Python Leveraging Feature Importance to Predict Whether Fungi Species Are Edible or Poisonous


Top Articles
Latest Posts
Article information

Author: Rev. Porsche Oberbrunner

Last Updated: 28/10/2023

Views: 5729

Rating: 4.2 / 5 (73 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Rev. Porsche Oberbrunner

Birthday: 1994-06-25

Address: Suite 153 582 Lubowitz Walks, Port Alfredoborough, IN 72879-2838

Phone: +128413562823324

Job: IT Strategist

Hobby: Video gaming, Basketball, Web surfing, Book restoration, Jogging, Shooting, Fishing

Introduction: My name is Rev. Porsche Oberbrunner, I am a zany, graceful, talented, witty, determined, shiny, enchanting person who loves writing and wants to share my knowledge and understanding with you.