feature analysis

Creating models to predict credit card defaults can help financial institutions to make better decisions, reduce losses and improve their performance. Feature analysis is an important step after predicting credit card defaulters as it helps to understand the underlying reasons why certain customers are more likely to default, identify potential biases in the prediction model and ensure that the model is fair and equitable.

In this article, we will discuss the machine learning model (Random Forest), talk about a couple of tools that are used in feature analysis, and conclude our article with a Python code that applies the above concepts to a real dataset of credit card defaulters.

Random Forest Model

Random Forest is a popular machine learning algorithm that was first introduced in 2001 by Leo Breiman and Adele Cutler. It is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. The algorithm creates multiple decision trees by randomly selecting a subset of the data and a subset of the features to train each tree. The final prediction is made by averaging the predictions of all the decision trees.

The concept behind Random Forest is to reduce the variance of a single decision tree by averaging the predictions of multiple decision trees. This results in a model that is less prone to overfitting. Random Forest can be used for both classification and regression problems and is particularly useful for datasets with a large number of features.

Random Forest has many advantages, such as being easy to interpret and understand, being able to handle missing values, being robust to outliers, and being able to handle categorical features. However, Random Forest also has some disadvantages, such as being less interpretable than decision trees, being computationally intensive, and being less sensitive to small changes in the data.

The expensive computing in Random Forest comes from the creation of multiple decision trees. Each tree requires training on a subset of the data and a subset of the features, which can take a significant amount of time and computational resources, especially for large datasets with many features. Additionally, when making predictions, the algorithm must average the predictions of all the decision trees, which also requires significant computational resources.

Random Forest algorithm is widely used in many fields, such as finance, healthcare, and natural language processing. It’s also a popular method for feature selection and feature importance in order to understand the underlying factors that contribute to the prediction.

LIME

Local Interpretable Model-Agnostic Explanations (LIME) is a method for interpreting the predictions of any machine learning model. It was introduced in 2016 by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. The concept behind LIME is to explain the predictions of a machine learning model by approximating it locally with a simpler interpretable model. In order to do this, LIME perturbs the input data and examines the effect on the model’s predictions. This is done by generating a set of simulated samples around the input and training a simple interpretable model on the set of simulated samples and their corresponding predictions. LIME then uses this interpretable model to explain the predictions of the original model.

One of the main advantages of LIME is that it is model-agnostic, meaning it can be used to explain the predictions of any machine learning model, regardless of its complexity. Additionally, LIME can provide explanations for individual predictions, which can be useful for understanding why a model made a particular prediction. LIME also allows the creation of perturbations that are tailored to the input data, which can be useful for understanding the features that have a higher importance in the model decision.

However, LIME also has some disadvantages. One of the main limitations is that it can be computationally expensive, especially for large datasets or high-dimensional data. Additionally, LIME can be sensitive to the choice of interpretable model and the number of perturbations used, which can affect the quality of the explanations. The interpretable model choice is a trade-off between the simplicity of the model and the accuracy of the explanations. The number of perturbations can also affect the quality of the explanations and the computational cost.

Eli5

Eli5 (Explain like I’m 5) is an open-source library for model interpretability and feature selection. It was introduced in 2015 and is built on top of the popular machine learning library scikit-learn. The concept behind Eli5 is to provide easy-to-understand explanations for the predictions of machine learning models.

One of the main advantages of Eli5 is its simplicity and ease of use. It provides a user-friendly interface for interpreting the predictions of machine learning models and can be easily integrated into any scikit-learn workflow. Additionally, Eli5 can be used to analyze the feature importance of a model, which can be useful for feature selection and understanding the underlying factors that contribute to the predictions.

However, Eli5 is designed to be used with scikit-learn models, so it may not be compatible with other machine learning libraries. Additionally, it can be less flexible than other interpretability tools and may not work well for models with high-dimensional data.

Eli5 is widely used in many fields, such as natural language processing, computer vision, healthcare, and finance. It’s particularly useful for creating interpretable models for non-technical users, it can be used to explain the predictions of a credit risk model, which can help to understand the underlying factors that contribute to credit card default and ensure that the model is fair and equitable.

Feature Analysis using Python

In this section, we will use a credit card dataset, build a Random Forest model to predict the defaulters, then interpret the model using LIME and Eli5.

We will start by importing the necessary libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import lime
import lime.lime_tabular

Then we will read the data:

data = pd.read_csv(“./Credit_Card.csv”)
data.info()

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 30000 entries, 0 to 29999

Data columns (total 25 columns):

 #   Column                      Non-Null Count  Dtype  

—  ——                      ————–  —–  

 0   ID                          30000 non-null  int64  

 1   LIMIT_BAL                   30000 non-null  float64

 2   SEX                         30000 non-null  int64  

 3   EDUCATION                   30000 non-null  int64  

 4   MARRIAGE                    30000 non-null  int64  

 5   AGE                         30000 non-null  int64  

 6   PAY_1                       30000 non-null  int64  

 7   PAY_2                       30000 non-null  int64  

 8   PAY_3                       30000 non-null  int64  

 9   PAY_4                       30000 non-null  int64  

 10  PAY_5                       30000 non-null  int64  

 11  PAY_6                       30000 non-null  int64  

 12  BILL_AMT1                   30000 non-null  float64

 13  BILL_AMT2                   30000 non-null  float64

 14  BILL_AMT3                   30000 non-null  float64

 15  BILL_AMT4                   30000 non-null  float64

 16  BILL_AMT5                   30000 non-null  float64

 17  BILL_AMT6                   30000 non-null  float64

 18  PAY_AMT1                    30000 non-null  float64

 19  PAY_AMT2                    30000 non-null  float64

 20  PAY_AMT3                    30000 non-null  float64

 21  PAY_AMT4                    30000 non-null  float64

 22  PAY_AMT5                    30000 non-null  float64

 23  PAY_AMT6                    30000 non-null  float64

 24  default.payment.next.month  30000 non-null  int64  

dtypes: float64(13), int64(12)

memory usage: 5.7 MB

We see that the file has 30.000 entries and 25 columns. There are no null values according to the data info.
The dataset contains information on 30.000 credit card holders. Some of its features are: given credit, sex, education level, marital status, age, the past 6 months’ payment and billing details, and finally the target feature “default payment next month”. 

After studying the dataset, we can do some cleaning in Education and Marriage features, to help the model give better predictions.
We will reduce the Education feature into graduate school(1), high school(2), university(3), and others(4).
And we’ll reduce the options of Marriage into married(1), single(2), or others(3).

data[‘EDUCATION’]=np.where(data[‘EDUCATION’] == 5, 4, data[‘EDUCATION’])
data[‘EDUCATION’]=np.where(data[‘EDUCATION’] == 6, 4, data[‘EDUCATION’])
data[‘EDUCATION’]=np.where(data[‘EDUCATION’] == 0, 4, data[‘EDUCATION’])

data[‘MARRIAGE’]=np.where(data[‘MARRIAGE’] == 0, 3, data[‘MARRIAGE’])

Splitting the data into features and target, then splitting it again into the training set and testing set, and storing the names of the features in a list:

X = data.drop(‘default.payment.next.month’,axis=1)
Y = data[‘default.payment.next.month’]

x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.85,random_state=42)

feature_names = (x_train.columns)

Tuning the Random Forest, so we get the best accuracy. 

param_dist = {‘n_estimators’: [50,100,150,200,250,300], #trees in the forest
          “max_features”: [1,2,3,4,5,6,7,8,9,10,11,12], #features to be considered
          ‘max_depth’: [1,2,3,4,5,6,7,8,9,10,11,12], #max depth of the tree
          “criterion”: [“gini”, “entropy”]} #measure the quality of the split

rf = RandomForestClassifier()

rf_cv = RandomizedSearchCV(rf, param_distributions = param_dist,
                      cv = 5, random_state=0, n_jobs = -1)

rf_cv.fit(x_train, y_train)

print(“Tuned Random Forest Parameters: %s” % (rf_cv.best_params_))

Tuned Random Forest Parameters: {‘n_estimators’: 100, ‘max_features’: 11, ‘max_depth’: 4, ‘criterion’: ‘gini’}

We take the best parameters according to our tuning, and build the Random Forest depending on these parameters:

tuned_rf = RandomForestClassifier(criterion= ‘gini’, max_depth= 4,
                                max_features= 11, n_estimators= 100,
                                random_state=0)
tuned_rf.fit(x_train, y_train)
y_pred = tuned_rf.predict(x_test)
print(‘Accuracy:’, metrics.accuracy_score(y_pred,y_test))

Accuracy: 0.8206666666666667

We got an accuracy of 82 percent. It is an acceptable accuracy. However, you can get better results if you do better cleaning of the data or try every possible parameter while tuning the forest.
We will accept the 82 percent accuracy for now.

Now it is time to plot a confusion matrix of the results:

plt.figure(figsize=(8,6),dpi=300)
ConfMatrix = confusion_matrix(y_test,tuned_rf.predict(x_test))
sns.heatmap(ConfMatrix,annot=True, cmap=“Blues”, fmt=“d”,
        xticklabels = [‘Non-default’, ‘Default’],
        yticklabels = [‘Non-default’, ‘Default’])
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
plt.title(“Confusion Matrix – Random Forest”)

Applying LIME on a couple of defaulters’ rows to find out the relationship between different features and the target feature:

predict_fn_rf = lambda x: tuned_rf.predict_proba(x).astype(float)

explainer = lime.lime_tabular.LimeTabularExplainer(x_train.values, feature_names=feature_names, class_names=[‘Won’t Default’,‘Will Default’], verbose=False, mode=‘classification’)

choosen_instance = X.iloc[[18256]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf, num_features=10)
exp.show_in_notebook(show_table=True)
choosen_instance = X.iloc[[0]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf, num_features=10)
exp.show_in_notebook(show_table=True)

As we see, in the two random defaulters we analyzed, ‘PAY_1’ and ‘PAY_2’ features are crucial in determining whether someone is gonna default or not.

Now let’s apply Eli5:

import eli5
eli5.show_weights(tuned_rf,top=10)

We see again that columns number 6, 7, 8, 9, and 10 (PAY_1 -> PAY_5) are the most important features to determine whether someone will default or not. These columns represent the repayment status in the last 5 months. 

Final Words

In conclusion, this article has discussed the use of Random Forest, LIME, and Eli5. These tools can be used to gain insights into the workings of a machine learning model and can be especially useful when working with complex models such as Random Forest. The practical part of the article demonstrated the application of these tools in building a model and interpreting its results. However, it is important to note that there are other interpretability tools available, such as SHAP, that can also be used to gain a further understanding of a model’s behavior. Overall, the use of interpretability tools is an important step in the development of any machine learning model, as it can help to ensure that the model is behaving as expected and that its predictions are trustworthy.

Similar Posts