Sales Analytics: Finding Deep Insights from a Sales Dataset

Data analysis is becoming increasingly important for companies as it allows them to gain valuable insights and make informed business decisions. By finding patterns and trends in the data, companies can identify growth opportunities, improve operations, and stay competitive in the market. In today’s article, we will dive deeper into Sales Analytics and apply different kinds of analysis on a sales dataset, then we will try to get helpful insights out of it, which the company can use to its benefit.

Choosing a dataset

For this article, we will use a sales dataset for a car company. The dataset contains information about the products, prices, shipments, etc…. It contains data for sales that happened between 2003 and 2005. You can find the dataset here.

Importing the Libraries and the Dataset

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read in the data
df=pd.read_csv(“sales_data_sample.csv”)

Data Cleaning and Fixing

After storing the dataset in a data frame, we now start the cleaning process.

First, we can get the info of the dataframe:

df.info()

RangeIndex: 2823 entries, 0 to 2822

Data columns (total 17 columns):

# Column Non-Null Count Dtype

— —— ————– —–

0 Order Number 2823 non-null int64

1 Quantity 2823 non-null int64

2 Price Each 2823 non-null float64

3 Order Line Number 2823 non-null int64

4 Sales 2823 non-null float64

5 Order Date 2823 non-null object

6 Status 2823 non-null object

7 Quarter ID 2823 non-null int64

8 Month ID 2823 non-null int64

9 Year ID 2823 non-null int64

10 Product Line 2823 non-null object

11 MSRP 2823 non-null int64

12 Product Code 2823 non-null object

13 Customer Name 2823 non-null object

14 City 2823 non-null object

15 Country 2823 non-null object

16 Deal Size 2823 non-null object

dtypes: float64(2), int64(7), object(8)

memory usage: 375.1+ KB

We see that we have 2823 rows in the dataframe, and each column has 2823 non-null values as well, so we made sure we don’t have any null values in our dataframe. We check now if there are duplicated rows in the data frame.

df.drop_duplicates(inplace=True)
len(df)

2823

We still have the same number of rows, so there were no duplicated rows.

And the final step of fixing data is to convert the ‘Order Date’ column to datetime type by using the to_datetime() method from pandas, allowing further analysis with this feature if needed.

df[‘Order Date’] = pd.to_datetime(df[‘Order Date’])

Now we can say we have cleaned and fixed the data.

Cleaning and fixing data is an important step in the data analysis process, as it helps to ensure that the data is accurate and consistent. Cleaning the data can make the analysis process more accurate and reliable.

Data Visualization and Data Analysis

We start the second phase of the code. We will now visualize different features of the data frame, and we will get some statistics. This step varies from one developer to another, as there are an enormous amount of different ways to visualize the data we have. You can copy the code and play around with the features, and plot your figures to do your analysis.

We will start by plotting the different product lines in the company, and check how they are performing in sales.

# Understand the popularity of different Product Lines
plt.figure(figsize=(10,5),dpi=300)
sns.barplot(x=‘Product Line’, y=‘Sales’, data=df)
plt.title(“Product Line popularity”)
plt.xlabel(“Product Line”)
plt.ylabel(“Sales”)
plt.show()

Now, checking the countries that the company has relations with, and the company’s performance in each one of these countries.

# Understand the distribution of sales across different countries
plt.figure(figsize=(20,10),dpi=600)
sns.barplot(x=‘Country’, y=‘Sales’, data=df)
plt.title(“Country-wise Sales distribution”)
plt.xlabel(“Country”)
plt.ylabel(“Sales”)
plt.show()

So far, nothing interesting. Using the above plottings, we can get simple information, such as The Product line with the best sales performance (Classic Cars), the Product line with the worst sales performance (Trains), and the countries the company works with, along with its performance in each of them.

Let’s get some statistics in the period of the dataset (between 2003 and 2005). The total revenue of the company:

total_revenue = df[‘Sales’].sum()
total_revenue

10032628.85

The top 5 customers in average purchases:

avg_sales_per_customer = df.groupby([‘Customer Name’]).mean()[‘Sales’]
avg_sales_per_customer = avg_sales_per_customer.sort_values(ascending=False)
print(avg_sales_per_customer.head())

Customer Name

Super Scale Inc. 4674.827647

Mini Caravy 4233.604211

La Corne D’abondance, Co. 4226.246957

Royale Belge 4180.012500

Muscle Machine Inc 4119.519583

Name: Sales, dtype: float64

The status of the orders:

status_data = df.groupby([‘Status’]).sum()
status_data=status_data[[‘Quantity’,‘Sales’]]
print(status_data)

Status Quantity $

————————————

Cancelled 2038 194487.48

Disputed 597 72212.86

In Process 1490 144729.96

On Hold 1879 178979.19

Resolved 1660 150718.28

Shipped 91403 9291501.08

The top 5 products in sales:

popular_products = df.groupby([‘Product Code’]).sum().sort_values(by=‘Sales’, ascending=False)
popular_products = popular_products[[“Quantity”,“Sales”]]
print(popular_products.head())

Product Code Quantity $

————————————

S18_3232 1774 288245.42

S10_1949 961 191073.03

S10_4698 921 170401.07

S12_1108 973 168585.32

S18_2238 966 154623.95

And that’s it for this section. We have gained some useful insights from the plottings and the statistics we generated from our data frame. As we mentioned before, you can visualize different features, or get other statistics. It depends on your experience in data analysis, the company’s plans, etc…

More Complicated Analysis

Now, we will dive deeper into the analysis, and try to plot more complicated plottings.

# group by year and month to calculate the total sales for each year-month
sales_by_year_month = df.groupby([‘Year ID’, ‘Month ID’])[‘Sales’].sum()

# create a line chart to visualize the total sales by year and month
plt.figure(figsize=(10,5),dpi=300)
sns.lineplot(data = sales_by_year_month.reset_index(), x = ‘Month ID’, y = ‘Sales’, hue = ‘Year ID’)
plt.title(“Total Sales by Month and Year”)
plt.xlabel(“Month”)
plt.ylabel(“Sales”)
plt.show()

Here, we are checking the performance of sales throughout the years. We can tell that sales go up starting from the 9th month (September), and go down again in the 12th month (December). Two years aren’t enough to consider the rise in September as a “seasonality” in our dataset. However, the performance in 2003 and 2004 is almost identical during the period (September to December). That is a good sign that data has seasonality.

Now, We will display the top 15 cities in purchasing from the company:

# group by city to calculate the total sales for each city
city_sales = df.groupby([‘City’])[‘Sales’].sum().sort_values(ascending=False)

# create a bar chart to visualize the total sales for each city
plt.figure(figsize=(10,8),dpi=300)
sns.barplot(x=city_sales.head(15).index, y=city_sales.head(15).values)
plt.title(“Total Sales by City”)
plt.xlabel(“City”)
plt.ylabel(“Sales”)
plt.xticks(rotation=90)
plt.show()

After that, we will display the top 10 sellers and the worst 10 sellers of the company’s products:

# group by product code to calculate the total quantity for each code
product_code_quantity = df.groupby([‘Product Code’])[‘Quantity’].sum().sort_values(ascending=False)

# create a bar chart to visualize the total quantity for each product code
plt.figure(figsize=(10,5),dpi=300)
sns.barplot(x=product_code_quantity.head(10).index, y=product_code_quantity.head(10).values)
plt.title(“Total Quantity by Product Code”)
plt.xlabel(“Product Code”)
plt.ylabel(“Quantity”)
plt.xticks(rotation=90)
plt.show()

# create a bar chart to visualize the total quantity for each product code
plt.figure(figsize=(10,5),dpi=300)
sns.barplot(x=product_code_quantity.tail(10).index, y=product_code_quantity.tail(10).values)
plt.title(“Total Quantity by Product Code”)
plt.xlabel(“Product Code”)
plt.ylabel(“Quantity”)
plt.xticks(rotation=90)
plt.show()

Finally, We will display some pie charts, to understand the distribution of sales in different quarters and months of the year:

# group by year and quarter to calculate the total sales for each year-quarter
sales_by_year_quarter = df.groupby([‘Year ID’, ‘Quarter ID’])[‘Sales’].sum()

# get the unique years in the dataset
years = df[‘Year ID’].unique()

# loop through the unique years and create a pie chart for each year
for year in years:
# filter the data to only include the selected year
filtered_data = sales_by_year_quarter[sales_by_year_quarter.index.get_level_values(‘Year ID’) == year]
# create a pie chart to visualize the total sales for each quarter of the selected year
plt.figure(figsize=(5,5),dpi=120)
try:
plt.pie(filtered_data, labels=[‘Q1’, ‘Q2’, ‘Q3’, ‘Q4’], autopct=‘%1.1f%%’)
plt.title(“Total Sales by Quarter for “ + str(year))
plt.show()
except:
plt.clf()

# group by year and month to calculate the total sales for each year-month
sales_by_year_month = df.groupby([‘Year ID’, ‘Month ID’])[‘Sales’].sum()

# get the unique years in the dataset
years = df[‘Year ID’].unique()

# loop through the unique years and create a pie chart for each year
for year in years:
# filter the data to only include the selected year
filtered_data = sales_by_year_month[sales_by_year_month.index.get_level_values(‘Year ID’) == year]
# create a pie chart to visualize the total sales for each month of the selected year
try:
plt.figure(figsize=(5,5),dpi=120)
plt.pie(filtered_data, labels=[‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’, ‘August’, ‘September’, ‘October’, ‘November’, ‘December’], autopct=‘%1.1f%%’)
plt.title(“Total Sales by Month for “ + str(year))
plt.show()
except:
plt.clf()

And we will end our analysis here, we have gained good information and insights. Companies can use our data for future planning, expansions, understanding the nature of customers in different cities and countries, and much more.

Conclusion

We have completed our analysis of the dataset, coming up with useful insights and statistics, which can be used by the company for improving and planning. The results of our analysis show that the company can consider expanding more in Spain, where it already has a good market. They can consider increasing the readiness of its warehouses and storage units before September, to handle the rise in sales that happens then. More insights can be found, but this is enough for our analysis. You can download the data, and try to play around and find new insights yourself!