Come, Lets Predict the HR attrition!


Human resource (HR) analytics refers to a field of analytics where analytic processes are applied to human resource department of an enterprise in the aim to improvise employee performance and gain better return on investment. It just does not involve gathering of employee efficiency data, but also plans to provide intuition on the gathered data and make relevant decisions to improve certain key processes. HR attrition refers to the gradual loss of its employees over time. Hence enterprises consider attrition as one of the important problem. HR professionals are held responsible for high attrition rates and so they often assume leader in designing a enterprise’s compensation program and work culture. They also make effective use of motivation systems to retain the top employees. Also, every year various companies hire numerous employees and put a lot of money and time into educating them over the process to increase their effectiveness. The problem of high employee attrition rate is that it results in as a cost to an enterprise. Job postings, recruiting process, paperwork and new employee training are some of the common key expenses of replacing a lost employee. Also, it prevents the enterprise from gaining collective knowledge base and experience over time due to high attrition rate. This is more problematic if the company runs a customer facing business since customers would like to interact with same familiar people. Errors and issues are also common if the enterprise has new employees.


K — Nearest Neighbours (KNN)

KNN technique is generally applied for classification of objects bases on nearest training examples on the independent attributes space. I can also say that it is a kind of occasion-based learning where the only objective is determined locally and all the experimentation is adjourned until classification. It is the most fundamental and straight-forward classification method where there is lack of prior understanding of the division of the data. This regulation generally retains the whole training set of data during learning and allocates a class to individual query that is represented by the most of its neighbours in the training dataset. The nearest neighbour (NN) rule is the simplest form of KNN when K=1.

The performance of KNN classifier is essentially determined by the choice of K and the distance metric like Euclidean distance or Manhattan distance that is applied on the model. This estimate is also affected by the choice of regional size K, since the radius of the local neighbourhood is decided by the Kth nearest neighbour. Smaller the value of K, poorer the local estimation. So if I need to further flatten the estimate, I need to increase the K value and consider a very large neighbourhood around the query that is raised. But the large value of K makes the dependent variable more flattened without much difficulty and this degrades the classification performance due to the initiation of outliers from the other classes. In spite of these challenges, KNN is still strongly used in sectors like text mining, agriculture, finance and healthcare. I have used KNN to predict the HR attrition by considering a range of K values from K=1 to K = 17.


It’s always better to pre-process the data before we build a model. The pre-processing of the data includes checking the presence of missing values, detection of outliers if any and the univariate and bivariate analyses.

The dataset that I have collected for HR attrition prediction consists of 14999 rows and 10 different variables that might affect the prediction of the employee attrition. Firstly, after I have imported the dataset and the summary is below.

My missing value analysis also tells that the dataset does not possess any missing values. The output of missing value analysis is shown above.

The univariate analysis tells me that the distribution of values in the independent variable’s column is normal and does not contain skewness. Below are the screen output samples of univariate analyses. The univariate analysis also tells me that my data lacks outliers and hence I can proceed with model building.

Once the univariate analysis is done, now we need to know how the independent attributes affect the dependent attributes. This can be analyses using the bi-variate analysis as below.

The correlation map provides me insights on the necessary attributes I need to consider for my model building and is as below. Although my correlation map shows only seven of my nine independent variables are significant, I have deployed all of my nine variables so that I could build a better model.

Since my dependent variable was unbalanced, I have deployed the oversampling technique to overcome this problem and I have successfully balanced my dataset with equal categories of my dependent variable as shown below.

Since I have a few columns with numerical categories, it is necessary for us to convert these into categorical attributes. Hence for this purpose, I have used the label encoder function and converted them into categories.

Now I have a balanced dataset with 22856 rows and 10 columns including my dependent variable and is ready for any machine learning model.


In this section, we are going to see the various models that have been employed in the three datasets that we obtained. The model’s performances has also been evaluated using various evaluation metrices like mean absolute error (MAE), mean squared error (MSE), r2 score, confusion matrices and ROC curves based on the algorithm that has been chosen.

The pre-processed dataset was split into parts namely X and Y parts which was further split into x_train, y_train, x_test and y_test. The X portion is the training dataset that contains all the rows and all the columns except the dependent column that needs to be predicted. Y is the test dataset portion that consists of the dependent column with all the row values. Now x_train and y_train are the parts with all the actual values with which the model learns to predict. x_test is the part on which the model applies itself and gives an prediction output. y_test is the part with actual predicted values with which the x_test predicted values are compared and evaluated for metrics like accuracy and precision etc.

For attrition problem I have applied the KNN regression algorithm to predict if the employee will leave the company or not. I have chosen this model since my output is dichotomous or binary. The KNN regression function was imported from sklearn.neighbors package. I have also evaluated the performances of this model using the same confusion matrices and the AUC-ROC curves. I have employed KNN for K values varying from K=1 to K=17. Below is the outputs that I have obtained for different K values.

It is evident that the accuracy value varies with respect to the K values. The accuracy value for K=3 is 95%, for K=9 is 93% and for K=17 is 91%. Now let’s see how the ROC curve varies for different K values.

From the above ROC curves, for most of the K values I have received AUC value as 0.98 which implies 98% of my predictions are correct. I have also compared the accuracy values for different K values and tabulated as below

The comparison table tells us that lower is the K value better is the prediction accuracy since K=3 has a accuracy of 95%.

Data Mining enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store