Market Basket Analysis: Analyzing Customer Basket Items

The problem of finding the association between customers’ basket items is a common one in the field of market basket analysis. This type of analysis can help businesses understand the purchasing behavior of their customers, and the relationship between different products. By identifying these associations, businesses can make informed decisions on product placement, promotions, and inventory management. Solving this problem can have significant benefits for a business, including increased sales, improved customer satisfaction, and a more efficient supply chain.

In this article, we will introduce the Apriori algorithm, a widely used method for finding association rules in transaction data, and provide a step-by-step explanation of how it works. We will also include a code implementation in Python to solve the problem of finding associations between customers’ basket items.

Apriori Algorithm

Apriori is an association rule mining algorithm that is widely used for finding associations in transaction data. It is based on the concept of frequent item sets, which are items that appear frequently together in the dataset.

The algorithm uses the metrics of support, confidence, and lift to measure the strength of the association between items.

Support refers to the percentage of transactions in the dataset that contains a particular item set. For example, if a particular item set appears in 100 transactions out of 1,000, its support would be 10%. Support is used as a measure of the popularity of an item set, and is used to determine the minimum support threshold, which is the minimum frequency of an item set required for it to be considered frequent.

Confidence measures the likelihood that an item set will be purchased if another item set is purchased. It is expressed as a percentage and is calculated by dividing the number of transactions containing both item sets by the number of transactions containing the antecedent item set. For example, if a customer purchases item A and item B together 100 times out of 1,000 transactions, and item A is purchased in 200 transactions, the confidence of the association between A and B would be 50%.

Lift measures the ratio of the observed support to the expected support if the items were independent. It is used to determine the strength of the association between two items. Lift is calculated by dividing the confidence of the association between two items by the support of the consequent item set. For example, if the lift between items A and B is 2, it means that the likelihood of purchasing item B when item A is purchased is twice as high as expected if the items were independent.

Apriori works by first generating a list of candidate itemsets and then pruning the list to retain only the frequent item sets. The algorithm starts with a list of individual items and generates candidate itemsets by combining items in the list. The algorithm then counts the support for each candidate item set and removes those that do not meet the minimum support threshold. This process is repeated for larger item sets until no more frequent item sets can be generated.

Once the frequent item sets have been identified, the Apriori algorithm generates association rules by combining frequent items in the dataset. For each frequent item set, the algorithm generates all possible association rules and calculates their support, confidence, and lift. The rules with the highest confidence or lift are then selected as the final rules.

The Apriori algorithm is an efficient method for finding associations in large transaction datasets. However, it is sensitive to the choice of the minimum support threshold and may result in a large number of rules. To reduce the number of rules, the algorithm can be modified to use a minimum confidence threshold instead of a minimum support threshold.

Algorithm Implementation

In this section, we will implement the algorithm on a transaction dataset. You can find the dataset here.

Let’s import the necessary libraries, then read and explore the data.

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

data=pd.read_excel(“./Data.xlsx”)

print(“Shape: “,data.shape)
print(“Number of different items in the dataset: “,data[‘Itemname’].nunique())
data.info()

Shape: (522064, 7)

Number of different items in the dataset: 4185

RangeIndex: 522064 entries, 0 to 522063

Data columns (total 7 columns):

# Column Non-Null Count Dtype

— —— ————– —–

0 BillNo 522064 non-null object

1 Itemname 520609 non-null object

2 Quantity 522064 non-null int64

3 Date 522064 non-null datetime64[ns]

4 Price 522064 non-null float64

5 CustomerID 388023 non-null float64

6 Country 522064 non-null object

dtypes: datetime64[ns](1), float64(2), int64(1), object(3)

memory usage: 27.9+ MB

The dataset has more than 500k rows, and around 4200 different items, which makes applying Apriori on it very computationally expensive. Let’s explore the data more:

print(“The number of different countries present in the dataset:”,data[‘Country’].nunique())
print(“Top 10 countries saleswise:”)
print(data[‘Country’].value_counts().head(10))

The number of different countries present in the dataset: 30

Top 10 countries saleswise:

United Kingdom 487622

Germany 9042

France 8408

Spain 2485

Netherlands 2363

Belgium 2031

Switzerland 1967

Portugal 1501

Australia 1185

Norway 1072

Name: Country, dtype: int64

Let’s assume that the dataset represents the sales of a supermarket that is located in the United Kingdom. And to reduce the dataset size, let’s analyze only the international sales and try to find association rules in this part of the dataset only.

data=data[data[‘Country’]!=‘United Kingdom’]
print(“Shape: “,data.shape)
print(“Number of different items in the dataset:”,data[‘Itemname’].nunique())

Shape: (34442, 7)

Number of different items in the dataset: 2613

We have reduced the size of the data to less than 35k rows, with around 2500 different items to work with.

Now, it is time to reshape the dataframe. It will be an (n*m) matrix, where n is the number of different values of the ‘BillNo’ column, and m is the number of different items of the ‘Itemname’ column. After executing this code, the value in a particular cell ( i , j ) will be 1 (True) if the item i is included in the ‘billNo’ j, and 0 (False) if the item i isn’t included in the ‘billNo’ j. Finally, We will drop the “POSTAGE” column as it doesn’t make sense to consider it in our algorithm.

reshaped_data=data.groupby([‘BillNo’,‘Itemname’])[‘Quantity’].sum().unstack().reset_index().fillna(0).set_index(“BillNo”)

def encode(x):
if x<=0:
return 0
if x>=1:
return 1

reshaped_data=reshaped_data.applymap(encode)
reshaped_data.drop(“POSTAGE”,inplace=True,axis=1)

After completing the data cleaning and reshaping, we will apply Apriori algorithm to find associations between different items in the dataset:

model=apriori(reshaped_data,min_support=0.05,use_colnames=True)
rules=association_rules(model,metric=“lift”,min_threshold=0.5)
rules=rules[[‘antecedents’,‘consequents’,‘support’,‘confidence’,‘lift’]]
print(rules)

Idx	antecedents	consequents	support	confidence	lift
0	SPACEBOY LUNCH BOX	DOLLY GIRL LUNCH BOX	0.064244	0.540984	6.363784
1	DOLLY GIRL LUNCH BOX	SPACEBOY LUNCH BOX	0.064244	0.755725	6.363784
2	PLASTERS IN TIN CIRCUS PARADE	PLASTERS IN TIN SPACEBOY	0.059701	0.502732	4.530470
3	PLASTERS IN TIN SPACEBOY	PLASTERS IN TIN CIRCUS PARADE	0.059701	0.538012	4.530470
4	PLASTERS IN TIN WOODLAND ANIMALS	PLASTERS IN TIN CIRCUS PARADE	0.072680	0.543689	4.578280
5	PLASTERS IN TIN CIRCUS PARADE	PLASTERS IN TIN WOODLAND ANIMALS	0.072680	0.612022	4.578280
6	PLASTERS IN TIN WOODLAND ANIMALS	PLASTERS IN TIN SPACEBOY	0.074627	0.558252	5.030801
7	PLASTERS IN TIN SPACEBOY	PLASTERS IN TIN WOODLAND ANIMALS	0.074627	0.672515	5.030801
8	PLASTERS IN TIN WOODLAND ANIMALS	ROUND SNACK BOXES SET OF4 WOODLAND	0.059701	0.446602	2.332927
9	ROUND SNACK BOXES SET OF4 WOODLAND	PLASTERS IN TIN WOODLAND ANIMALS	0.059701	0.311864	2.332927
10	ROUND SNACK BOXES SET OF4 WOODLAND	ROUND SNACK BOXES SET OF 4 FRUITS	0.089552	0.467797	4.004859
11	ROUND SNACK BOXES SET OF 4 FRUITS	ROUND SNACK BOXES SET OF4 WOODLAND	0.089552	0.766667	4.004859
12	SPACEBOY LUNCH BOX	ROUND SNACK BOXES SET OF4 WOODLAND	0.064244	0.540984	2.825952
13	ROUND SNACK BOXES SET OF4 WOODLAND	SPACEBOY LUNCH BOX	0.064244	0.335593	2.825952
14	SET/6 RED SPOTTY PAPER CUPS	SET/6 RED SPOTTY PAPER PLATES	0.057106	0.880000	13.697778
15	SET/6 RED SPOTTY PAPER PLATES	SET/6 RED SPOTTY PAPER CUPS	0.057106	0.888889	13.697778

We apply the apriori algorithm on ‘reshaped_data’, and we set the ‘min_support’ value as 0.05, in other words, I want to make recommendations only on items that exist in more than 5% of the bills. ‘use_colnames=True’ means that I want the results to be in column names, not indices.

Then, we are generating the association rules, as the next step in applying Apriori algorithm. We chose here the “lift” metric, because we want to prioritize association rules with a higher degree of relationship between the items. We set the threshold to 0.5 which means I am looking for relationships with a lift value equals to or more than 0.5. You can try other metrics (support or confidence) and different thresholds to control your output. Choose ‘support’ metric to prioritize association rules based on their frequency in the dataset, and choose ‘confidence’ metric to prioritize association rules based on the reliability of the relationship between the antecedent and consequent items. Modifying thresholds is an efficient way to control the number of relationships in your output.

Final Words

In conclusion, this article aimed to explain the problem of finding associations between customers’ basket items and how the Apriori algorithm can be used to solve this problem. The Apriori algorithm is an association rule mining algorithm that finds frequent item sets in a transaction database and generates association rules from them. The article explained the metrics used by the Apriori algorithm such as support, confidence, and lift, and went into detail about the steps involved in the algorithm’s working process. Finally, the article provided a python code that implements the Apriori algorithm and can be used to find the association rules between customers’ basket items. By using the Apriori algorithm, organizations can gain insights into the buying patterns of their customers, and can make informed decisions that improve their marketing and sales strategies.