Car Price Prediction in Python

Check correlation from a car dataset and train different linear regression models to predict the price based on data points such as mileage and horsepower.

5 July, 2022

Contributors

Ander

@anderrv

The other day, we did some Exploratory Data Analysis to a car dataset. After working with the dataset and gathering many insights, we’ll focus on price prediction today.

The dataset comprises cars for sale in Germany, the registration year being between 2011 and 2021. So we can assume that it is an accurate representation of market price nowadays.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

Data Cleaning

Let’s say we want to predict car prices based on attributes. For that, we will train a model. Let’s start by replacing registration year with age and removing make and model — we won’t be using them for the predictions.

We talked about outliers last week, and now it’s time to remove them. We’ll do that by removing the items in which the z score for the price, horsepower, or mileage is higher than 3. In short, this means taking out the values that deviate more than three standard deviations from the mean.

Before that, dropna will remove all lines with empty or null values.

Next, we will replace the category values (offer type and gear) with boolean markers. In practice, this means creating new columns for each category type (i.e., for gear, it will be Automatic, Manual, and Semi-automatic).

There is a function in pandas to do that: get_dummies.

Checking Correlation Visually

We are going to plot variable correlation using seaborn heatmap. It will show us graphically which variables are positively or negatively correlated.

The highest correlation shows for age with mileage — sounds fine — and price with horsepower — no big news either. And looking at negative correlation, price with age — which also seems natural.

We can ignore the relation between Manual and Automatic since it’s evident that you will only have one or the other — there are almost no Semi-automatics.

For a double check, we are going to plot horsepower and mileage variables with the price. We’ll do it with seaborn jointplot.

It will plot all the entries and a line for the regression. For brevity, there is only one code snippet. The second one would be the same, replacing hp with mileage.

Price Prediction

We are reaching the critical part. We will try three different prediction models and see which one performs better.

We need two variables, Y and X, containing price and all the remaining columns. We will use these new variables for the other models too. Then, we split the data for training and testing in a 70%-30% distribution.

Disclaimer: we did several tests with all the models and chose the best results for each model. Not the same vars apply, and some “magic” numbers will appear. We adjusted those mainly through trial and error.

linear_model from sklearn

To train the first LinearRegression model, we will pass the train data to the fit method and then the test data to predict.

To check the results, we’ll be using R-squared for all of them. In this case, the result is 0.81237.

Regressor from CatBoost

Next, we’ll use Regressor from CatBoost. The model's created with some numbers that we adjusted by testing. Similar to the previous one, fit the model with the train data and check the score, resulting in 0.92416.

There is a big difference since this method is much slower, more than 20 seconds. It might be a closer match, but not a good option if it must run instantly.

OLS from statsmodels

For statsmodels, we will change X's value and take only mileage, hp, and age. The difference is almost 10% better than with the previous values.

R-squared is 0.91823, and it runs in under two seconds — counting the data load.

Extra Ball: Best Prediction

What happens if we do not drop make and model? Two of the models would perform worse, but not CatBoost. It will take much longer and use more space. We would have more than 700 feature columns. But it is worth it if you are after accuracy.

For brevity, we will not reproduce all the manipulations we did previously. Instead of dropping make and model, create dummies for them and then continue as before.

This model exposes a method to obtain feature importance. We can use that data with a bar chart to check which features affect the prediction the most. We will limit them to 20, but as you’ll see, two of them — excluding price itself — carry all the weight.

Age not being significant might look suspicious at first glance. But it makes sense. As we saw in the correlation graph, age and mileage go hand in hand. So there is no need for the two of them to carry all the weight.

Estimate Car Price

Let’s say that you want to buy or sell your car. You collect the features we provide (mileage, year, etcetera). How to use the prediction model for that?

We’ll choose CatBoost again and use their predict method for that. We would need to go all the way again by transforming all the data with dummies, so we'll summarize. This process would be extracted and performed equally for training, test, or actual data in a real-world app.

We will also need to add all the empty features (i.e., all the other makes) that the model supports.

Here we present an example with three cars for sale. We manually entered all the initial features (price included), so we can compare the output. As you’ll see, the predictions are not far from the actual price.

Conclusion

As a quick note on the three models, sklearn performs a bit worse. And the other two are pretty similar in results - if excluding makes and models - but not on time spent training. So it might be a crucial aspect to consider when choosing between them.

If you are after high accuracy, train the CatBoost model with all the available data. It might take up to a minute, but it can be stored in a file and instantly loaded when needed.

As you’ve seen, loading a ZenRows generated dataset into pandas is quite simple. Then, there are some steps to perform: describe, explore manually, look at the values, check for empty or nulls.

These are everyday tasks when first testing a dataset. From there, standard practices such as generating dummies or removing outliers.

And then the juicy part. In this case, price prediction using linear regression, but it could be anything.

Thanks for reading.

Did you find the content helpful? Please, spread the word and share it. 👉

Originally published at https://www.zenrows.com

python

data science

data analysis

price prediction

programming