market basket analysis

The problem of finding the association between customers’ basket items is a common one in the field of market basket analysis. This type of analysis can help businesses understand the purchasing behavior of their customers, and the relationship between different products. By identifying these associations, businesses can make informed decisions on product placement, promotions, and inventory management. Solving this problem can have significant benefits for a business, including increased sales, improved customer satisfaction, and a more efficient supply chain.

In this article, we will introduce the Apriori algorithm, a widely used method for finding association rules in transaction data, and provide a step-by-step explanation of how it works. We will also include a code implementation in Python to solve the problem of finding associations between customers’ basket items.

Apriori Algorithm

Apriori is an association rule mining algorithm that is widely used for finding associations in transaction data. It is based on the concept of frequent item sets, which are items that appear frequently together in the dataset.

The algorithm uses the metrics of support, confidence, and lift to measure the strength of the association between items.

Support refers to the percentage of transactions in the dataset that contains a particular item set. For example, if a particular item set appears in 100 transactions out of 1,000, its support would be 10%. Support is used as a measure of the popularity of an item set, and is used to determine the minimum support threshold, which is the minimum frequency of an item set required for it to be considered frequent.

Confidence measures the likelihood that an item set will be purchased if another item set is purchased. It is expressed as a percentage and is calculated by dividing the number of transactions containing both item sets by the number of transactions containing the antecedent item set. For example, if a customer purchases item A and item B together 100 times out of 1,000 transactions, and item A is purchased in 200 transactions, the confidence of the association between A and B would be 50%.

Lift measures the ratio of the observed support to the expected support if the items were independent. It is used to determine the strength of the association between two items. Lift is calculated by dividing the confidence of the association between two items by the support of the consequent item set. For example, if the lift between items A and B is 2, it means that the likelihood of purchasing item B when item A is purchased is twice as high as expected if the items were independent.

Apriori works by first generating a list of candidate itemsets and then pruning the list to retain only the frequent item sets. The algorithm starts with a list of individual items and generates candidate itemsets by combining items in the list. The algorithm then counts the support for each candidate item set and removes those that do not meet the minimum support threshold. This process is repeated for larger item sets until no more frequent item sets can be generated.

Once the frequent item sets have been identified, the Apriori algorithm generates association rules by combining frequent items in the dataset. For each frequent item set, the algorithm generates all possible association rules and calculates their support, confidence, and lift. The rules with the highest confidence or lift are then selected as the final rules.

The Apriori algorithm is an efficient method for finding associations in large transaction datasets. However, it is sensitive to the choice of the minimum support threshold and may result in a large number of rules. To reduce the number of rules, the algorithm can be modified to use a minimum confidence threshold instead of a minimum support threshold.

Algorithm Implementation

In this section, we will implement the algorithm on a transaction dataset. You can find the dataset here.

Let’s import the necessary libraries, then read and explore the data.

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

data=pd.read_excel(“./Data.xlsx”)

print(“Shape: “,data.shape)
print(“Number of different items in the dataset: “,data[‘Itemname’].nunique())
data.info()

Shape:  (522064, 7)

Number of different items in the dataset:  4185

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 522064 entries, 0 to 522063

Data columns (total 7 columns):

 #   Column      Non-Null Count   Dtype         

—  ——      ————–   —–         

 0   BillNo      522064 non-null  object        

 1   Itemname    520609 non-null  object        

 2   Quantity    522064 non-null  int64         

 3   Date        522064 non-null  datetime64[ns]

 4   Price       522064 non-null  float64       

 5   CustomerID  388023 non-null  float64       

 6   Country     522064 non-null  object        

dtypes: datetime64[ns](1), float64(2), int64(1), object(3)

memory usage: 27.9+ MB

The dataset has more than 500k rows, and around 4200 different items, which makes applying Apriori on it very computationally expensive. Let’s explore the data more:

print(“The number of different countries present in the dataset:”,data[‘Country’].nunique())
print(“Top 10 countries saleswise:”)
print(data[‘Country’].value_counts().head(10))

The number of different countries present in the dataset: 30

Top 10 countries saleswise:

United Kingdom    487622

Germany             9042

France              8408

Spain               2485

Netherlands         2363

Belgium             2031

Switzerland         1967

Portugal            1501

Australia           1185

Norway              1072

Name: Country, dtype: int64

Let’s assume that the dataset represents the sales of a supermarket that is located in the United Kingdom. And to reduce the dataset size, let’s analyze only the international sales and try to find association rules in this part of the dataset only.

data=data[data[‘Country’]!=‘United Kingdom’]
print(“Shape: “,data.shape)
print(“Number of different items in the dataset:”,data[‘Itemname’].nunique())

Shape:  (34442, 7)

Number of different items in the dataset: 2613

We have reduced the size of the data to less than 35k rows, with around 2500 different items to work with.

Now, it is time to reshape the dataframe. It will be an (n*m) matrix, where n is the number of different values of the ‘BillNo’ column, and m is the number of different items of the ‘Itemname’ column. After executing this code, the value in a particular cell ( i , j ) will be 1 (True) if the item i is included in the ‘billNo’ j, and 0 (False) if the item i isn’t included in the ‘billNo’ j. Finally, We will drop the “POSTAGE” column as it doesn’t make sense to consider it in our algorithm.

reshaped_data=data.groupby([‘BillNo’,‘Itemname’])[‘Quantity’].sum().unstack().reset_index().fillna(0).set_index(“BillNo”)

def encode(x):
if x<=0:
        return 0
if x>=1:
        return 1

reshaped_data=reshaped_data.applymap(encode)
reshaped_data.drop(“POSTAGE”,inplace=True,axis=1)

After completing the data cleaning and reshaping, we will apply Apriori algorithm to find associations between different items in the dataset:

model=apriori(reshaped_data,min_support=0.05,use_colnames=True)
rules=association_rules(model,metric=“lift”,min_threshold=0.5)
rules=rules[[‘antecedents’,‘consequents’,‘support’,‘confidence’,‘lift’]]
print(rules)
Idxantecedents consequents  support confidence      lift 
0SPACEBOY LUNCH BOXDOLLY GIRL LUNCH BOX0.064244   0.540984  6.363784 
1DOLLY GIRL LUNCH BOXSPACEBOY LUNCH BOX0.064244   0.755725  6.363784 
2PLASTERS IN TIN CIRCUS PARADEPLASTERS IN TIN SPACEBOY0.059701   0.502732  4.530470 
3PLASTERS IN TIN SPACEBOYPLASTERS IN TIN CIRCUS PARADE0.059701   0.538012  4.530470 
4PLASTERS IN TIN WOODLAND ANIMALSPLASTERS IN TIN CIRCUS PARADE0.0726800.543689  4.578280 
5PLASTERS IN TIN CIRCUS PARADEPLASTERS IN TIN WOODLAND ANIMALS0.072680   0.612022  4.578280 
6PLASTERS IN TIN WOODLAND ANIMALSPLASTERS IN TIN SPACEBOY0.074627   0.558252  5.030801 
7PLASTERS IN TIN SPACEBOYPLASTERS IN TIN WOODLAND ANIMALS0.074627   0.672515  5.030801 
8PLASTERS IN TIN WOODLAND ANIMALSROUND SNACK BOXES SET OF4 WOODLAND0.059701   0.446602  2.332927 
9ROUND SNACK BOXES SET OF4 WOODLANDPLASTERS IN TIN WOODLAND ANIMALS0.059701   0.311864  2.332927 
10ROUND SNACK BOXES SET OF4 WOODLANDROUND SNACK BOXES SET OF 4 FRUITS0.089552   0.467797  4.004859 
11ROUND SNACK BOXES SET OF 4 FRUITSROUND SNACK BOXES SET OF4 WOODLAND0.089552   0.766667  4.004859 
12SPACEBOY LUNCH BOXROUND SNACK BOXES SET OF4 WOODLAND0.064244   0.540984  2.825952 
13ROUND SNACK BOXES SET OF4 WOODLANDSPACEBOY LUNCH BOX0.064244   0.335593  2.825952 
14SET/6 RED SPOTTY PAPER CUPSSET/6 RED SPOTTY PAPER PLATES0.057106   0.880000 13.697778 
15SET/6 RED SPOTTY PAPER PLATESSET/6 RED SPOTTY PAPER CUPS0.057106   0.888889 13.697778 

We apply the apriori algorithm on ‘reshaped_data’, and we set the ‘min_support’ value as 0.05, in other words, I want to make recommendations only on items that exist in more than 5% of the bills. ‘use_colnames=True’ means that I want the results to be in column names, not indices.

Then, we are generating the association rules, as the next step in applying Apriori algorithm. We chose here the “lift” metric, because we want to prioritize association rules with a higher degree of relationship between the items. We set the threshold to 0.5 which means I am looking for relationships with a lift value equals to or more than 0.5. You can try other metrics (support or confidence) and different thresholds to control your output. Choose ‘support’ metric to prioritize association rules based on their frequency in the dataset, and choose ‘confidence’ metric to prioritize association rules based on the reliability of the relationship between the antecedent and consequent items. Modifying thresholds is an efficient way to control the number of relationships in your output.

Final Words

In conclusion, this article aimed to explain the problem of finding associations between customers’ basket items and how the Apriori algorithm can be used to solve this problem. The Apriori algorithm is an association rule mining algorithm that finds frequent item sets in a transaction database and generates association rules from them. The article explained the metrics used by the Apriori algorithm such as support, confidence, and lift, and went into detail about the steps involved in the algorithm’s working process. Finally, the article provided a python code that implements the Apriori algorithm and can be used to find the association rules between customers’ basket items. By using the Apriori algorithm, organizations can gain insights into the buying patterns of their customers, and can make informed decisions that improve their marketing and sales strategies.

Similar Posts