Predicting NYC Taxi Fare Amount
using Python
23 November, 2020
0
0
0
Contributors
Team
Giovanna Nogueira Sa, Prakhar Jain, Ritika Srivastava, Xiaomei Wang
BackgroundNew York City is known as one of the busiest cities in the world. Everyone that visits the big apple will notice the amount of yellow taxis driving around the city. The project is focused on predicting the New York City yellow taxi fare amount. This amount represents the time and distance fare calculated by the meter. Nowadays, the biggest advantage that the mobile application car rides have over the yellow cabs is the price prediction. This way they can plan around how much they would spend. This is also really good for tourists. Thus, knowing the specific fare for a ride that the passenger would want to do, could possibly help the industry in this area of improvement.
What we did
Predicting the NYC yellow taxi fare using variables such as Pick up/drop off locations, Trip distances, Payment types, Trip and Toll amount, etc. The project was done using Python and various models were built using Python libraries such as Scikit Learn,
Statsmodel API and Numpy library, etc.
Our approach
•
Sampled the data using random stratified sampling.
•
We did not have any missing values however, we did remove some data which had -taxi fares as 0 and negative, number of passengers were 0, distance trip and time in taxi was 0
•
We did feature engineering to create some variables which were useful for the analysis such as weekday (showing if it was a weekday or a weekend), daytime (which indicated the time at which the ride was taken), broke down pickup, dropoff time to Year, Month, Day, Weekday and Hour.
•
Explored the data using Tableau to know the relationship between the variables.
•
Created models such as OLS, Gradient Boosting, Random forest, etc.
Conclusion
Selected Random Forest as the final model because as compared to other models, the RMSE of this model for the training set was 1.515 (R-square 0.964) and for validation set was 1.848 (R-square 0.937) and with random forest, we were also able to resolve the problem of overfitting that is high bias and low variance.