Credit Risk of Vehicle Loans : A Machine Learning based Prediction

Bharath Chandran
The Startup
Published in
8 min readJul 8, 2020

--

INTRODUCTION

Vehicle loans are one such type where the banks offer money to their customers to purchase a car and the customer agrees to pay back the full loan amount along with some interest, which is a source of profit to the bank. So, it is important to build a model to predict the credit risk of vehicle loan, based on its dependable factors. In today’s world of economic expansion, credit risk is the biggest risk banks face. Vehicle loans are loans offered by the banks to their customers to purchase a vehicle where the customer agrees to terms and conditions that include repayment of the full loan amount along with the interests. This interest amount is actually a source of income to the bank. Population rich countries like India and China where population is huge, have loads of loan filing claims which are on hold and need approvals. Loan approval is a heavy task for banks because approving a loan for a defaulter might lead to loss of profit and refusal of loan to a non-defaulter might also lead to loss of profits for the bank. Hence, banks rely on such prediction model so that they could gain knowledge on figuring out as to whom the loan should be granted. It reminds me of how my father was denied a vehicle loan in the early 80’s. Hence, it is very essential for a data analyst to build a predictive model to forecast the possibility of credit risk based on certain dependable factors.

DATA MINING METHODOLOGIES

A. Logistic regression

The main mathematical concept that functions the logistic regression is the logit function which is the natural logarithm of an odd-even ratio [11]. It can be well explained by taking into consideration a distribution of one dichotomous outcome variable is paired with another dichotomous variable. Generally logistic regression is well matched for sketching and testing hypotheses about relationships between a categorical outcome variable and with one or more categorical or continuous predictor attributes. The simplest logistic regression is of the form

Where β is the regression coefficient, α is the Y intercept and e = 2.71828 is the base of the system of natural logarithms. X can be either continuous or categorical based on the chosen dataset, but Y is always categorical.

Logistic regression is a strong tool which allows simultaneous analysis of multiple explanatory variables thereby reducing the effect of confounding factors [10]. In this paper I have chosen logistic regression to predict the vehicle loan credit risk since the outcome is dichotomous like, if the loan is given, will the customer be a defaulter or non-defaulter.

Logistic regression is a strong tool which allows simultaneous analysis of multiple explanatory variables thereby reducing the effect of confounding factors [10]. In this paper I have chosen logistic regression to predict the vehicle loan credit risk since the outcome is dichotomous like, if the loan is given, will the customer be a defaulter or non-defaulter.

B. DECISION TREE

In this paper, I have considered decision tree to be applied to the vehicle loan credit risk problem which is a classification type of problem. Generally, classification is a task of assigning the object attributes into categorical attributes.

classification of object attributes into categories

We already know that a normal tree contains root, branches and leaves. This same structure is followed in decision tree algorithm. It comprises of the root node, branches and the leaf nodes. When a test is required to be done on a specific attribute, it is done on the every internal node and the result of the test is on branch and class label similar to a result in a leaf node [12]. The top most node in a tree is the parent node. Hence a decision tree is a tree where, each node is compared to a different attribute, each connection link to a branch through a decision rule and each leaf shows a result of the continuous or categorical value. It is based on the similarity of human level thinking and so it’s easy to make use of the data and make effective interpretations. The whole idea is to create a tree of the entire data and determine a solitary result at every leaf based on the objective of the problem. Below is one such example of how the decision tree algorithm is related to real-time problems faced.

The decision tree with an example statement

Hence, I have adopted the decision tree algorithm to make decisions on whether to approve vehicle loan for the customer’s claims or not.

DATA PREPROCESSING AND TRANSFORMATION

It’s always better to pre-process the data before we build a model. The pre-processing of the data includes checking the presence of missing values, detection of outliers if any and the univariate and bivariate analyses.

The dataset that I have collected for vehicle loan credit risk prediction consists of 233,154 rows and 40 different variables that might affect the prediction of the defaulters. The summary of the dataset is as follows,

From the missing value analysis, I found my dataset did not have any missing values and hence I have directly moved on to my univariate and bivariate analyses. Sample plots of my univariate analysis is as below.

(a) univariate analysis plot of the disbursed amount (b) univariate analysis plot of LTV © univariate analysis plot of branch_id (d) univariate analysis plot of supplier_id

I have also compared all the independent variables with the predictor variable and found the distribution of dependent variables with respect to the independent variables through the bivariate analyses. I have attached a screenshots of bivariate analyses as below.

(a) bivariate analysis plot of employment type vs loan_default (b) bivariate analysis plot of Flag vs loan_default
bivariate analysis plot of CNS_score vs loan_default

Finally the correlation map gives us insights on the important attributes that contribute to the prediction of defaulters vs non-defaulters variable.

Correlation heatmap for attributes of credit risk dataset

From investigation I found that all columns that have secondary related data like secondary accounts, secondary accounts balance have less importance in prediction and hence I have deleted those unwanted variables before I move on to the model building. Above is the correlation heatmap of my data.

Since this is a classification problem, it is necessary to convert the numerical attributes into categorical attributes. Hence I have used the OneHotEncoder function to convert the string categories into categorical attributes for namely the employment type and the manufacturer id columns.

Employment type attribute changed to categories

I also found that the distribution of the defaulters vs non defaulters in my predictor variables to be imbalanced and hence there is a need for making the dependent variable balanced. Therefor I have adopted the under sampling method to overcome this challenge. Initially my data had 182543 non-defaulters and 50611 defaulters in my dependent variable column. After my under sampling technique, I have made both the categories in my dependent variable column equal leading to 50611 values in each category.

Under-sampling technique

From here, I have moved on to my model building process with 101222 rows and 34 columns.

MODEL BUILDING AND EVALUATION

In this section, we are going to see the various models that have been employed in the three datasets that we obtained. The model’s performances has also been evaluated using various evaluation metrices like mean absolute error (MAE), mean squared error (MSE), r2 score, confusion matrices and ROC curves based on the algorithm that has been chosen.

The pre-processed dataset was split into parts namely X and Y parts which was further split into x_train, y_train, x_test and y_test. The X portion is the training dataset that contains all the rows and all the columns except the dependent column that needs to be predicted. Y is the test dataset portion that consists of the dependent column with all the row values. Now x_train and y_train are the parts with all the actual values with which the model learns to predict. x_test is the part on which the model applies itself and gives an prediction output. y_test is the part with actual predicted values with which the x_test predicted values are compared and evaluated for metrics like accuracy and precision etc.

For credit risk problem I have applied the logistic regression and decision tree algorithms for the prediction of defaulters vs non-defaulters. I have chosen these models since my output is dichotomous or binary. The logistic regression function and decision tree function was imported from sklearn.linear_model and sklearn.tree packages respectively. I have also evaluated the performances of these models using the confusion matrices and the AUC-ROC curves. The confusion matrix is the tabular representation of the true positives, true negatives, false positives and the false negatives that arise when the predicted values are compared with the actual values. Based on the confusion matrix, parameters like accuracy, precision, recall and the f1 scores have been determined. Accuracy is the measure of All the truly predicted values contribute to accuracy. I.e. the percentage of rightly predicted values to the total predictions . Precision is defined as the percentage of true positives to the total predicted positives. Recall is the percentage of true positives to the total actual positive. F1 score is a metric that involves both the precision and recall metrics. The output of logistic regression is as follows

Output of logistic regression

I have received the model’s accuracy as 52% which is not that bad. I have also obtained the values of other evaluation metrices. Further on plotting the AUC — ROC curve, I have obtained AUC score of 0.55 and the curve shows as

ROC curve of logistic regression

The AUC value of 0.55 tells that 55% of our predicted values are correct. Additionally, I have applied decision tree algorithm to the same dataset, as discussed previously, I have mentioned the decision tree criterions of entropy and Gini index decision tree models. Hence I have received two outputs based on the respective criteria.

Output of decision tree classifier — Entropy
Output of decision tree classifier — Gini

The output from my entropy model tells that the model has an accuracy of 57.42% and the Gini-index model has an accuracy of 57.24%. Below is the ROC curves from the decision tree algorithm.

ROC curve of decision tee classifier

The AUC value from decision tree tells that 57% of our predicted values are correct. Hence on comparison,

Comparison table

CONCLUSION
I can conclude that decision tree entropy has a better accuracy value and hence deemed as a better appropriate model when compare to logistic regression when working on binary classification datasets.

--

--