Predicting NYC Taxi Fare Amount

using Python

23 November, 2020

Contributors

Ritika Srivastava

@ritika

Team

Giovanna Nogueira Sa, Prakhar Jain, Ritika Srivastava, Xiaomei Wang

Background

New York City is known as one of the busiest cities in the world. Everyone that visits the big apple will notice the amount of yellow taxis driving around the city. The project is focused on predicting the New York City yellow taxi fare amount. This amount represents the time and distance fare calculated by the meter. Nowadays, the biggest advantage that the mobile application car rides have over the yellow cabs is the price prediction. This way they can plan around how much they would spend. This is also really good for tourists. Thus, knowing the specific fare for a ride that the passenger would want to do, could possibly help the industry in this area of improvement.

What we did

Predicting the NYC yellow taxi fare using variables such as Pick up/drop off locations, Trip distances, Payment types, Trip and Toll amount, etc. The project was done using Python and various models were built using Python libraries such as Scikit Learn, Statsmodel API and Numpy library, etc.

Our approach

•

Sampled the data using random stratified sampling.

•

We did not have any missing values however, we did remove some data which had -taxi fares as 0 and negative, number of passengers were 0, distance trip and time in taxi was 0

•

We did feature engineering to create some variables which were useful for the analysis such as weekday (showing if it was a weekday or a weekend), daytime (which indicated the time at which the ride was taken), broke down pickup, dropoff time to Year, Month, Day, Weekday and Hour.

•

Explored the data using Tableau to know the relationship between the variables.

•

Created models such as OLS, Gradient Boosting, Random forest, etc.

Conclusion

Selected Random Forest as the final model because as compared to other models, the RMSE of this model for the training set was 1.515 (R-square 0.964) and for validation set was 1.848 (R-square 0.937) and with random forest, we were also able to resolve the problem of overfitting that is high bias and low variance.

Predicting NYC Taxi Fare Amount

Team

What we did

Our approach

Conclusion

Related Links

More Articles