The problem of finding the association between customers’ basket items is a common one in the field of market basket analysis. This type of analysis can help businesses understand the purchasing behavior of their customers, and the relationship between different products. By identifying these associations, businesses can make informed decisions on product placement, promotions, and inventory management. Solving this problem can have significant benefits for a business, including increased sales, improved customer satisfaction, and a more efficient supply chain.
In this article, we will introduce the Apriori algorithm, a widely used method for finding association rules in transaction data, and provide a step-by-step explanation of how it works. We will also include a code implementation in Python to solve the problem of finding associations between customers’ basket items.
Apriori Algorithm
Apriori is an association rule mining algorithm that is widely used for finding associations in transaction data. It is based on the concept of frequent item sets, which are items that appear frequently together in the dataset.
The algorithm uses the metrics of support, confidence, and lift to measure the strength of the association between items.
Support refers to the percentage of transactions in the dataset that contains a particular item set. For example, if a particular item set appears in 100 transactions out of 1,000, its support would be 10%. Support is used as a measure of the popularity of an item set, and is used to determine the minimum support threshold, which is the minimum frequency of an item set required for it to be considered frequent.
Confidence measures the likelihood that an item set will be purchased if another item set is purchased. It is expressed as a percentage and is calculated by dividing the number of transactions containing both item sets by the number of transactions containing the antecedent item set. For example, if a customer purchases item A and item B together 100 times out of 1,000 transactions, and item A is purchased in 200 transactions, the confidence of the association between A and B would be 50%.
Lift measures the ratio of the observed support to the expected support if the items were independent. It is used to determine the strength of the association between two items. Lift is calculated by dividing the confidence of the association between two items by the support of the consequent item set. For example, if the lift between items A and B is 2, it means that the likelihood of purchasing item B when item A is purchased is twice as high as expected if the items were independent.
Apriori works by first generating a list of candidate itemsets and then pruning the list to retain only the frequent item sets. The algorithm starts with a list of individual items and generates candidate itemsets by combining items in the list. The algorithm then counts the support for each candidate item set and removes those that do not meet the minimum support threshold. This process is repeated for larger item sets until no more frequent item sets can be generated.
Once the frequent item sets have been identified, the Apriori algorithm generates association rules by combining frequent items in the dataset. For each frequent item set, the algorithm generates all possible association rules and calculates their support, confidence, and lift. The rules with the highest confidence or lift are then selected as the final rules.
The Apriori algorithm is an efficient method for finding associations in large transaction datasets. However, it is sensitive to the choice of the minimum support threshold and may result in a large number of rules. To reduce the number of rules, the algorithm can be modified to use a minimum confidence threshold instead of a minimum support threshold.
Algorithm Implementation
In this section, we will implement the algorithm on a transaction dataset. You can find the dataset here.
Let’s import the necessary libraries, then read and explore the data.
import numpy as np import pandas as pd from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules data=pd.read_excel(“./Data.xlsx”) print(“Shape: “,data.shape) print(“Number of different items in the dataset: “,data[‘Itemname’].nunique()) data.info() |
Shape: (522064, 7)
Number of different items in the dataset: 4185
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 522064 entries, 0 to 522063
Data columns (total 7 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 BillNo 522064 non-null object
1 Itemname 520609 non-null object
2 Quantity 522064 non-null int64
3 Date 522064 non-null datetime64[ns]
4 Price 522064 non-null float64
5 CustomerID 388023 non-null float64
6 Country 522064 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 27.9+ MB
The dataset has more than 500k rows, and around 4200 different items, which makes applying Apriori on it very computationally expensive. Let’s explore the data more:
print(“The number of different countries present in the dataset:”,data[‘Country’].nunique()) print(“Top 10 countries saleswise:”) print(data[‘Country’].value_counts().head(10)) |
The number of different countries present in the dataset: 30
Top 10 countries saleswise:
United Kingdom 487622
Germany 9042
France 8408
Spain 2485
Netherlands 2363
Belgium 2031
Switzerland 1967
Portugal 1501
Australia 1185
Norway 1072
Name: Country, dtype: int64
Let’s assume that the dataset represents the sales of a supermarket that is located in the United Kingdom. And to reduce the dataset size, let’s analyze only the international sales and try to find association rules in this part of the dataset only.
data=data[data[‘Country’]!=‘United Kingdom’] print(“Shape: “,data.shape) print(“Number of different items in the dataset:”,data[‘Itemname’].nunique()) |
Shape: (34442, 7)
Number of different items in the dataset: 2613
We have reduced the size of the data to less than 35k rows, with around 2500 different items to work with.
Now, it is time to reshape the dataframe. It will be an (n*m) matrix, where n is the number of different values of the ‘BillNo’ column, and m is the number of different items of the ‘Itemname’ column. After executing this code, the value in a particular cell ( i , j ) will be 1 (True) if the item i is included in the ‘billNo’ j, and 0 (False) if the item i isn’t included in the ‘billNo’ j. Finally, We will drop the “POSTAGE” column as it doesn’t make sense to consider it in our algorithm.
reshaped_data=data.groupby([‘BillNo’,‘Itemname’])[‘Quantity’].sum().unstack().reset_index().fillna(0).set_index(“BillNo”) def encode(x): if x<=0: return 0 if x>=1: return 1 reshaped_data=reshaped_data.applymap(encode) reshaped_data.drop(“POSTAGE”,inplace=True,axis=1) |
After completing the data cleaning and reshaping, we will apply Apriori algorithm to find associations between different items in the dataset:
model=apriori(reshaped_data,min_support=0.05,use_colnames=True) rules=association_rules(model,metric=“lift”,min_threshold=0.5) rules=rules[[‘antecedents’,‘consequents’,‘support’,‘confidence’,‘lift’]] print(rules) |
Idx | antecedents | consequents | support | confidence | lift |
0 | SPACEBOY LUNCH BOX | DOLLY GIRL LUNCH BOX | 0.064244 | 0.540984 | 6.363784 |
1 | DOLLY GIRL LUNCH BOX | SPACEBOY LUNCH BOX | 0.064244 | 0.755725 | 6.363784 |
2 | PLASTERS IN TIN CIRCUS PARADE | PLASTERS IN TIN SPACEBOY | 0.059701 | 0.502732 | 4.530470 |
3 | PLASTERS IN TIN SPACEBOY | PLASTERS IN TIN CIRCUS PARADE | 0.059701 | 0.538012 | 4.530470 |
4 | PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN CIRCUS PARADE | 0.072680 | 0.543689 | 4.578280 |
5 | PLASTERS IN TIN CIRCUS PARADE | PLASTERS IN TIN WOODLAND ANIMALS | 0.072680 | 0.612022 | 4.578280 |
6 | PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN SPACEBOY | 0.074627 | 0.558252 | 5.030801 |
7 | PLASTERS IN TIN SPACEBOY | PLASTERS IN TIN WOODLAND ANIMALS | 0.074627 | 0.672515 | 5.030801 |
8 | PLASTERS IN TIN WOODLAND ANIMALS | ROUND SNACK BOXES SET OF4 WOODLAND | 0.059701 | 0.446602 | 2.332927 |
9 | ROUND SNACK BOXES SET OF4 WOODLAND | PLASTERS IN TIN WOODLAND ANIMALS | 0.059701 | 0.311864 | 2.332927 |
10 | ROUND SNACK BOXES SET OF4 WOODLAND | ROUND SNACK BOXES SET OF 4 FRUITS | 0.089552 | 0.467797 | 4.004859 |
11 | ROUND SNACK BOXES SET OF 4 FRUITS | ROUND SNACK BOXES SET OF4 WOODLAND | 0.089552 | 0.766667 | 4.004859 |
12 | SPACEBOY LUNCH BOX | ROUND SNACK BOXES SET OF4 WOODLAND | 0.064244 | 0.540984 | 2.825952 |
13 | ROUND SNACK BOXES SET OF4 WOODLAND | SPACEBOY LUNCH BOX | 0.064244 | 0.335593 | 2.825952 |
14 | SET/6 RED SPOTTY PAPER CUPS | SET/6 RED SPOTTY PAPER PLATES | 0.057106 | 0.880000 | 13.697778 |
15 | SET/6 RED SPOTTY PAPER PLATES | SET/6 RED SPOTTY PAPER CUPS | 0.057106 | 0.888889 | 13.697778 |
We apply the apriori algorithm on ‘reshaped_data’, and we set the ‘min_support’ value as 0.05, in other words, I want to make recommendations only on items that exist in more than 5% of the bills. ‘use_colnames=True’ means that I want the results to be in column names, not indices.
Then, we are generating the association rules, as the next step in applying Apriori algorithm. We chose here the “lift” metric, because we want to prioritize association rules with a higher degree of relationship between the items. We set the threshold to 0.5 which means I am looking for relationships with a lift value equals to or more than 0.5. You can try other metrics (support or confidence) and different thresholds to control your output. Choose ‘support’ metric to prioritize association rules based on their frequency in the dataset, and choose ‘confidence’ metric to prioritize association rules based on the reliability of the relationship between the antecedent and consequent items. Modifying thresholds is an efficient way to control the number of relationships in your output.
Final Words
In conclusion, this article aimed to explain the problem of finding associations between customers’ basket items and how the Apriori algorithm can be used to solve this problem. The Apriori algorithm is an association rule mining algorithm that finds frequent item sets in a transaction database and generates association rules from them. The article explained the metrics used by the Apriori algorithm such as support, confidence, and lift, and went into detail about the steps involved in the algorithm’s working process. Finally, the article provided a python code that implements the Apriori algorithm and can be used to find the association rules between customers’ basket items. By using the Apriori algorithm, organizations can gain insights into the buying patterns of their customers, and can make informed decisions that improve their marketing and sales strategies.