Univariate and Bivariate Analysis

Univariate and Bivariate analysis are useful data analytics techniques. These techniques are used in almost every data science project to understand the data. These techniques help fix data, familiarize us with the data we are working with, and even find patterns and relationships between different variables in our dataset. In this article, we will explain the concepts of Univariate and Bivariate analysis in detail, compare them, and use them in an example to demonstrate the work of both methods.

Univariate Analysis

Univariate analysis is a statistical technique used to analyze and understand a single variable. It is a simple form of analysis that looks at one variable at a time, to understand its distribution, central tendency, and dispersion. A famous example of when to use univariate analysis is in quality control, where it can be used to analyze the distribution of measurements for a single quality characteristic, such as the weight or dimensions of a product.

Some advantages of univariate analysis include

  • It is simple and easy to understand
  • It can quickly identify outliers or anomalies in the data
  • It can provide a baseline for understanding more complex analyses

However, there are also some disadvantages to univariate analysis, including

  • It only looks at one variable at a time, so it may not provide a complete picture of the data
  • It may not reveal any relationships between variables
  • It may not be able to identify underlying patterns or trends in the data that are only visible when looking at multiple variables simultaneously.

Univariate analysis is a widely used technique in data science and has several main uses, including

  • Exploratory Data Analysis (EDA): Univariate analysis is often used as a first step in EDA to understand the distribution of a single variable and identify any outliers or anomalies.
  • Data Cleaning: Univariate analysis can be used to identify and correct errors in data, such as missing values or outliers.
  • Feature Selection: In machine learning, univariate analysis can be used to select the most informative features for a model by evaluating the relationship between each feature and the target variable.
  • Model Building: Univariate analysis can be used to build simple models, such as linear regression models, that can be used to predict the value of a single variable based on the value of another variable.
  • Quality Control: Univariate analysis can be used in quality control to ensure that products meet certain specifications by analyzing the distribution of measurements for a single quality characteristic.
  • Descriptive Analysis: Univariate analysis can describe the basic characteristics of a single variable by calculating measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation).

Bivariate Analysis

Bivariate analysis is also a statistical technique, but it is used to analyze and understand the relationship between two variables. It is used to determine if there is a relationship or association between the two variables and if so, the strength and direction of that relationship. An example of when to use bivariate analysis is in market research, where it can be used to analyze the relationship between a product’s price and its sales volume.

Some advantages of bivariate analysis include

  • It can reveal relationships between variables that may not be apparent in univariate analysis.
  • It can help identify potential cause-and-effect relationships.
  • It can be used to predict one variable based on the other.
  • It can identify the relationship between two variables by calculating correlation (Pearson, Spearman, Kendall) and covariance.
  • It can also use visual representations such as scatter plots, box plots, and others to understand the relationship.

And as for univariate analysis, there are some disadvantages to bivariate analysis, including

  • It only looks at two variables at a time, so it may not provide a complete picture of the data.
  • It assumes that the relationship between the two variables is linear, which may not always be the case.
  • It may not be able to identify underlying patterns or trends in the data that are only visible when looking at multiple variables simultaneously.
  • Also, it may not identify the causality, only the correlation.

Bivariate main uses include

  • Exploratory Data Analysis (EDA): Bivariate analysis is often used as a second step in EDA, after univariate analysis, to identify relationships between pairs of variables and understand the distribution of each variable with another variable.
  • Feature Selection: In machine learning, bivariate analysis can be used to select the most informative features for a model by evaluating the relationship between each feature and the target variable.
  • Model Building: Bivariate analysis can be used to build more complex models than univariate models, such as logistic regression or multiple linear regression models, that can be used to predict the value of one variable based on the value of another variable.
  • Identifying Outliers: Bivariate analysis can be used to identify outliers in the data by identifying points that deviate from the expected relationship between two variables.
  • Identifying Correlation: Bivariate analysis can be used to identify the correlation between two variables. Correlation coefficients, such as Pearson, Kendall, and Spearman, can be used to determine the strength and direction of the relationship.
  • Identifying Patterns: Bivariate analysis can be used to identify patterns in the data by using visual representations.
  • Identifying causality: Bivariate analysis can be used to identify a possible causal relationship between two variables, but it is important to note that correlation does not imply causality, it is only an indication of a possible causal relationship and further investigation is necessary to establish causality.

Univariate Analysis vs Bivariate Analysis

This section is a summarization of the last two sections in the form of a comparison table between the two techniques.

Univariate AnalysisBivariate Analysis
Used to study the distribution and frequency of a single variable, and it can be used to identify patterns and trends in the data. It can also be used to test hypotheses about the population from which the sample was drawn.Used to study the relationship between two variables. It can be used to identify patterns and trends in the data, as well as to test hypotheses about the relationship between the two variables.
Graphical methods such as histograms, bar charts, and box plots are commonly used to display the data.Scatter plots are used to visualize the relationship between the two variables.
Simpler than bivariate analysis, as it only involves one variable.More complex, as it involves multiple variables and requires a deeper understanding of the data.
Can be performed on both numerical and categorical data.Typically used for numerical data only.

It is also worth mentioning that univariate analysis can be used as a preliminary step in bivariate analysis to understand the distribution of each variable individually before analyzing the relationship between them.

Univariate and Bivariate analysis on Python

Let’s take an example to fully understand how to apply Univariate and Bivariate analysis on a real dataset and discuss the different insights that we were able to find along the way.

First, We have to get a dataset. For this example, we are going to use a dataset that contains information on students’ performance in high school. You can find and download the dataset here.

We will start our code by importing the needed libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

After that, we will load the dataset.

data=pd.read_csv(“StudentsPerformance.csv”)
data=data.dropna()

In our dataset, we have three columns with numerical values (‘math score’, ‘reading score’, and ‘writing score’). We also have four columns with categorical values (‘gender’, ‘parental level of education’, ‘lunch’, and ‘test preparation course’).

We will start with univariate analysis.

# Plot histograms for numerical columns
data.hist(bins=30, figsize=(20,12))
plt.savefig(“histograms.png”, dpi=300)
plt.show()

We start our analysis by plotting 3 histograms of the numerical columns of our dataset. We can see that all the histograms are similar in shape. Each of them draws a right-skewed bell curve. It is generally a good sign and it represents a good performance from the students.

plt.figure(dpi=600)
fig, axs = plt.subplots(2, 2, figsize=(24,12))
sns.countplot(x=‘gender’,data=data, ax=axs[0, 0])
sns.countplot(x=‘parental level of education’,data=data, ax=axs[0, 1])
sns.countplot(x=‘lunch’,data=data, ax=axs[1, 0])
sns.countplot(x=‘test preparation course’,data=data, ax=axs[1, 1])
plt.show()

After plotting the categorical columns, we got more information. We now know that females are more than males in our dataset, parents with master’s and bachelor’s degrees are way less than other parents with less education level, about ⅔ of students have standard meals, and nearly the same ratio of students haven’t completed their test preparation course.

print(data[[“math score”, “reading score”, “writing score”]].describe())

      math score  reading score  writing score

count  1000.00000    1000.000000    1000.000000

mean     66.08900      69.169000      68.054000

std      15.16308      14.600192      15.195657

min       0.00000      17.000000      10.000000

25%      57.00000      59.000000      57.750000

50%      66.00000      70.000000      69.000000

75%      77.00000      79.000000      79.000000

max     100.00000     100.000000     100.000000

And before moving to the bivariate analysis, we printed values like count, mean, and standard deviation of the numeric columns of our dataset.

Moving to the bivariate analysis. There are many relationships to study here, we will list some, and try to get useful insights out of them.

fig, axs = plt.subplots(2, 2, figsize=(15,10), dpi=600)
axs[0, 0].scatter(data[‘writing score’], data[‘math score’],s=4)
axs[0, 0].set_xlabel(“Writing Score”)
axs[0, 0].set_ylabel(“Math score”)
axs[0, 1].scatter(data[‘reading score’], data[‘math score’],s=4)
axs[0, 1].set_xlabel(“Reading score”)
axs[0, 1].set_ylabel(“Math score”)
axs[1, 0].scatter(data[‘reading score’], data[‘writing score’],s=4)
axs[1, 0].set_xlabel(“Reading score”)
axs[1, 0].set_ylabel(“Writing Score”)
fig.delaxes(axs[1,1])
plt.show()

In the first two plots, we see that points are scattered and the relationship between the two variables isn’t as good as in the last plot. This makes sense, as the first two plots represent the relationship between math score on one hand, and writing score and reading score on the other hand. There is no real relationship between your math skills, and your writing or reading skills. However, in the last plot, we see a better relationship with less noise in the trend, because reading and writing skills fall in the same category of skills.

We can see the same if we print the correlation coefficient between these skills:

print(data[“writing score”].corr(data[“math score”]))
print(data[“writing score”].corr(data[“reading score”]))
print(data[“reading score”].corr(data[“math score”]))

0.8026420459498084

0.954598077146248

0.8175796636720539

We see that the largest coefficient is between reading and writing scores.

plt.figure(dpi=600)
fig, axs = plt.subplots(2, 2, figsize=(20,12))
sns.boxplot(x=‘lunch’,y=‘math score’,data=data,ax=axs[0, 0])
sns.boxplot(x=‘lunch’,y=‘reading score’,data=data,ax=axs[0, 1])
sns.boxplot(x=‘lunch’,y=‘writing score’,data=data,ax=axs[1, 0])
fig.delaxes(axs[1,1])
plt.show()

Here we can see that in all subjects, students who are having standard lunch are achieving higher grades than those who are having free/reduced lunch.

plt.figure(dpi=600)
fig, axs = plt.subplots(2, 2, figsize=(20,12))
sns.boxplot(x=‘test preparation course’,y=‘math score’,data=data,ax=axs[0, 0])
sns.boxplot(x=‘test preparation course’,y=‘reading score’,data=data,ax=axs[0, 1])
sns.boxplot(x=‘test preparation course’,y=‘writing score’,data=data,ax=axs[1, 0])
fig.delaxes(axs[1,1])
plt.show()

Similarly, students who completed their test preparation course are achieving higher grades than those who didn’t complete their preparation test.

Finally, we will find the relationship between parents’ education level and their children’s marks.

plt.figure(dpi=600)
data.groupby(“parental level of education”).mean()[[“math score”, “reading score”, “writing score”]].plot(kind=‘bar’,figsize=(20,12))
plt.show()

We observe that there is a relationship between parents’ education level and their children’s grades. The higher the education level is, the higher the grades are.

Conclusion

In this article, we defined univariate and bivariate analysis, discussed their advantages and disadvantages, compared them, and tried them ourselves. There are other ways to perform such analysis, hopefully, we will cover some of them in a future article.

Similar Posts