Buying a used car? Come, lets predict its price!


Due to the rise in the price of new cars and incapability of customers to buy new cars because of fund deficiency, used car sales are on a global rise. Hence there exists a demand for a prediction system that would help us determine the true value of a car effectively using an array of different features. Manufacturers sell new cars with prices fixed by them along with taxes as per the government rules. So, a customer finds it worthy to buy a new car since he is also provided with assurance. But it is always not possible for all customer to buy new cars. Because buying a new car depends upon several factors like budget and family situations. Hence, they go to the second option of buying previously used cars because they might be relatively cheaper than the new ones. Nowadays there are a lot of mediums available for buying old cars like used car outlets and online websites. Therefore, before they opt to buy a used car, it always better to study their actual market value. This will provide a better understanding of the key factors that are responsible for determination of market value of a used car. Thus, it is considered that a prediction model to predict the average price of a used car should be built and made effective in all ways. A prediction model can benefit the sellers, online pricing services and the buyer who is interested in buying the car. The sellers’ group would highly be interested in this model, because a forecast model which help them understand what makes a car more desirable, its key features and may later use that knowledge to offer a better service. As said earlier, there are a lot of online websites that offer services on estimation of prices. Although they might be having one such prediction model, they might feel it better to have another model as a backup. Hence this model would also benefit online web services. Finally, buyers would also regard this model more beneficial so that they can avoid paying too much than the actual market value of the car.


A. Multiple linear regression

Linear regression is an analytical technique that is used to determine the relationship among the different attributes used which have a reason and result relation among them. The ultimate goal of multiple regression is to relate the dependent attribute to two or more independent attributes. Technique that involve one dependent and one independent variable is called as linear regression and those that involve multiple independent variables is called as multiple regression. Multiple regression is formulated as the following

Y = β0 + β1X1 + β2X2 +…. + βnXn


Y = dependent variable

X= independent variable

β = y-intercept

The assumptions of multiple regression are the attributes considered should have a normal distribution, should be linear, should be free from the presence of outliers and should not have multiple ties among the considered independent attributes. This model has be chosen for the prediction of used car prices since the dataset has numerical values in most of its columns.

B. Random Forest regression

As the name suggests, random forest is a tree-based group where each tree is influenced by a collection of random variables. The trees employed in this technique are based on the binary periodic partitioning trees. These trees divide the dependent variable using a series of binary partitions called splits with respect to the independent variables. The root node contains the entire dependent variables. The nodes that aren’t divided are called as terminal nodes and they configure the final partition of the predictor variable. Further each non terminal node divides into two successor nodes, each in opposite directions of left and right based on the value of one of the predictor variable. Statistically, random forest has been found more appealing due to measures of variable significance, dissimilar class weighing, dealing with missing values and visualization. Also from computational point of view, random forest is more convincing since they naturally hold both regression and multiclass classification. They are also relatively quick to train and forecast. Mostly they depend on one or two tuning parameters. Another important aspect is that they can be employed in parallel and can be directly used on high-dimensional problems where the list of independent attributes is huge like the dataset I have chosen to predict the price of used cars in this paper.


It’s always better to pre-process the data before we build a model. The pre-processing of the data includes checking the presence of missing values, detection of outliers if any and the univariate and bivariate analyses. The dataset that I have collected for used cars prediction consists of 31,172 rows and 20 different variables that might affect the price of the used car. Firstly, after I have imported the dataset and checked the summary as below.

The summary of the used cars dataset

The summary gives us a rough idea of our dataset including the mean, standard deviation and the quartile range values. Next, I have checked the presence of missing values in the dataset.

(a) List of attributes containing missing values (b) List of attributes showing null missing values after being processed

My dataset had a number of missing values in various variables. Hence to deal with this problem, I have replaced respective missing values with values from other variables. Since the brand of car and gearbox is relatable, I have replaced gearbox missing values with respective values as per the brand column ones. Similarly, I have replaced ‘notRepairedDamage’ ones with the majority category, replaced the fuelType ones with the majority in that category, replaced missing values of vehicle type with those of the fuel type and finally the model missing variables with the majority category in that column. All these steps helped me achieve null missing values in each

independent variables column. I also wanted to remove outliers from these two columns ‘yearsofregistration’ and ‘price’. Hence I removed rows which had ‘yearsofregistration’ values before the year 1950 and after the year 2017. For the price column, I removed those below $100 and those above $200,000.

I have also done the univariate analysis to check if the distribution of variables counts in each variable column is normalised so that we don’t find irregular patterns in our data.

(a) univariate analysis plot of fuel-type (b) univariate analysis plot of vehicle-type

Finally I have drawn the correlation heat map which helps me study the correlation coefficient of my variables. Hence I concluded on the variables that are necessary since they have high correlation and deleted other unwanted variables which might not be of any use to me while building the model.

The correlation heatmap for attributes of used cars dataset

From the correlation table, I found that all the variables I have considered for my model building are correlated to the price variable hence I have moved forward to model building from this point.


In this section, we are going to see the various models that have been employed in the three datasets that we obtained. The model’s performances has also been evaluated using various evaluation metrices like mean absolute error (MAE), mean squared error (MSE), r2 score, confusion matrices and ROC curves based on the algorithm that has been chosen.

The pre-processed dataset was split into parts namely X and Y parts which was further split into x_train, y_train, x_test and y_test. The X portion is the training dataset that contains all the rows and all the columns except the dependent column that needs to be predicted. Y is the test dataset portion that consists of the dependent column with all the row values. Now x_train and y_train are the parts with all the actual values with which the model learns to predict. x_test is the part on which the model applies itself and gives an prediction output. y_test is the part with actual predicted values with which the x_test predicted values are compared and evaluated for metrics like accuracy and precision etc.

As informed earlier, I have applied the linear regression and random forest techniques on this dataset since they seem apt to the dataset. The linear regression function is imported from a package called sklearn.linear_model and random forest is imported from sklearn.ensemble package. The model is built based on x_train and y_train values and predicted on the x_test values and compared with the y_test vales. The common evaluation technique that has been used for linear regression is MAE, MSE and r2 scores. MAE is the measure of errors between paired observations that indicate the same phenomenon, while MSE is the measure of the average of the squares of the errors. And r2 score is the coefficient of determination whose value varies from 0 to 1. I can also say that, it’s the ratio of variance in the dependent attributes derived from the independent attributes. When linear regression was applied to the used cars dataset, the result is as follows

Output of linear regression

Although the dataset is balanced well, I have received a r2 score of 34.2% which indicates that the model is a poor model in terms of evaluation metrics.

I have also applied random forest to the same dataset and we already know that random forest is also evaluated using the same evaluation metrics such as MSE, MAE and r2 scores. The output from application of random forest shows as

Output of random forest regression

The r2 score from the random forest algorithm shows that the model is 80.3% which indicates that the model is a better model. I have also compared both the models to conclude on which model is the better one. My comparison output is as follows



The comparison shows the random forest holds supremacy when compared to linear regression when predicting the price of used cars.

Data Mining enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store