I undertook an analysis of a massive dataset containing over 160 million trip records from New York City’s iconic yellow and green taxi cabs spanning January 2019 to April 2022.
With a dataset of this immense scale, Databricks and its PySpark API proved highly suitable tools, providing a unified analytics platform with an Apache Spark-based cluster computing framework for distributed processing and parallel operations on the large, distributed datasets.
The key business questions involved analyzing monthly trends, comparing statistics between the two taxi types, understanding tipping behavior, identifying profitable trip durations based on distance per dollar, and building predictive models for fare amounts.
Link to Github
Datasets
The New York City Taxi and Limousine Commission publishes detailed trip records for both the iconic yellow taxis that can pick up street-hailing passengers anywhere in the city, as well as the newer green taxis introduced in 2013 to service the outer boroughs.
The datasets used span from January 2019 through April 2022 and contain a wealth of information about each trip, including pick-up and drop-off dates/times, locations, trip distances, itemized fare details like the base fare and taxes/surcharges, rate types based on times/locations, payment types, and driver-reported passenger counts.
The datasets were downloaded, in a parquet format, from the nyc.gov website here
Infrastructure Setup
Azure Cloud Storage
After downloading the datasets, a storage container was created in the Azure cloud storage and all the data files were uploaded here. Parquet files have better compression resulting in a more efficient storage method than CSV, hence, reading parquets is much faster than reading CSVs.
Initially, two folders were created for green and yellow taxis, however, due to inconsistencies in the data type for the airport_fee for Yellow taxis (stored as a double in some months and as an integer in others), this data was further organized into different folders.
For yellow taxis, data from all months in 2019, January to July 2020, and October 2020 were uploaded into an ‘integer’ folder, while data from August, September, November, December 2020, and all months in 2021 and 2022 were uploaded into a ‘double’ folder.
Databricks Setup
A Community edition Databricks account (free) was set up and a new cluster was created from the Compute menu with Runtime: 11.2 (Scala 11.2, Spark 3.3.0).
Note: In the Databricks community edition, this cluster will need to be cloned everytime it is opened, after a 2 hour idle period, as this edition does not allow you to restart previous clusters.
The DBFS File Browser was then enabled in the Admin Console Settings (under Workplace Settings).
A new notebook was created in the Workspace, with Python selected for the Default Language, and the required Azure Blob Storage container was mounted using the following script.
storage_account_name = "{STORAGE ACCOUNT NAME}"
storage_account_access_key = "{STORAGE ACCOUNT ACCESS KEY}" # from storage account under ‘Security + networking’
blob_container_name = "{BLOB CONTAINER NAME}"
dbutils.fs.mount(
source = f'wasbs://{blob_container_name}@{storage_account_name}.blob.core.windows.net',
mount_point = f'/mnt/{blob_container_name}/',
extra_configs = {'fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net': storage_account_access_key}
)
Data Cleaning & Preparation
-
The files from the green taxis folder, yellow taxis integer, and yellow taxis double were read into separate data frames using
spark.read.parquet()
, and each one was cached. -
The
airport_fee
in the yellow taxis integer data frame was cast to double, and the yellow taxis double and integer data frames were combined into a single data frame. -
The total number of rows in the green taxi data frame and the combined yellow taxi data frame were counted, resulting in 9,390,483 rows for green taxis and 152,823,008 rows for yellow taxis.
- Additional columns were created for both green and yellow taxi data frames to aid in cleaning and analysis:
DurationInSeconds
: The difference between the finish time and the start time in seconds.DurationInHours
: TheDurationInSeconds
divided by 3600, rounded to 3 decimal places.DurationInMinutes
: TheDurationInSeconds
divided by 60, rounded to 3 decimal places.trip_distance_km
: Thetrip_distance
multiplied by 1.60934.speed
: Thetrip_distance
divided byDurationInHours
, rounded to 3 decimal places.speed_km_hr
: Thetrip_distance_km
divided byDurationInHours
, rounded to 3 decimal places.
- The datasets were cleaned to remove unrealistic rows that could skew aggregate calculations:
- Trips that finished before the start time were removed.
- Trips with negative speed were removed.
- Trips starting outside January 2019 to the end of April 2022 were removed.
- Low and high percentiles for speed, duration, and distance were computed and used to filter the data.
- Trips with speeds over 36 mph (approximately the 99th percentile) were removed.
- Trips shorter than 60 seconds or longer than 4500 seconds were removed.
- Trips shorter than 0.4 miles or longer than 22 miles were removed.
-
A column for the taxi’s color (green or yellow) was added, and the pickup and dropoff datetime columns were renamed to match across data frames. The data frames were then combined using
unionByName
, allowing missing columns. - The new combined data frame was exported to a parquet file and loaded as a temporary view to enable SQL operations.
Data Analysis
Monthly Aggregates
The following monthly aggregates were computed and merged into a single dataframe:
- Total Trips: Counted rows per month.
- Busiest Day: Determined by counting trips per day of the week (1 = Sunday) and ranking.
- Busiest Hour: Counted trips per hour and identified the hour with the most trips each month.
- Average Passengers Averaged
passenger_count
per month. - Average Fare: Averaged
total_amount
per trip each month. - Average Fare per Passenger: Averaged
total_amount / passenger_count
each month.
month_year | total_trips | dayofweek | hr | avg_passenger_count | avg_total_amount | avg_amt_per_passenger |
---|---|---|---|---|---|---|
2019-01 | 7906544 | 6 | 18 | 1.55 | 15.44 | 12.67 |
2019-02 | 7316958 | 6 | 18 | 1.56 | 18.3 | 15 |
2019-03 | 8128986 | 6 | 18 | 1.56 | 18.09 | 15.29 |
2019-04 | 7685437 | 6 | 18 | 1.56 | 18.99 | 15.49 |
2019-05 | 7716015 | 6 | 18 | 1.56 | 18.94 | 15.45 |
2019-06 | 7111000 | 7 | 18 | 1.56 | 19.02 | 15.55 |
2019-07 | 6441573 | 4 | 18 | 1.56 | 18.06 | 15.34 |
2019-08 | 6196916 | 5 | 18 | 1.56 | 18.98 | 15.37 |
2019-09 | 6653629 | 6 | 18 | 1.54 | 19.18 | 15.67 |
2019-10 | 7316092 | 6 | 18 | 1.53 | 19.15 | 15.63 |
2019-11 | 6939987 | 6 | 18 | 1.53 | 18.75 | 15.32 |
2019-12 | 6596217 | 5 | 18 | 1.54 | 18.87 | 15.29 |
2020-01 | 6559092 | 6 | 18 | 1.51 | 18.09 | 14.81 |
2020-02 | 6378824 | 7 | 18 | 1.49 | 18.08 | 14.91 |
2020-03 | 3606502 | 3 | 18 | 1.46 | 17.8 | 14.86 |
2020-04 | 2349705 | 3 | 16 | 1.29 | 15.85 | 13.39 |
2020-05 | 367893 | 6 | 16 | 1.31 | 17.54 | 13.7 |
2020-06 | 559263 | 8 | 15 | 1.38 | 17.99 | 14.31 |
2020-07 | 802560 | 6 | 15 | 1.38 | 18.09 | 14.15 |
2020-08 | 1008140 | 2 | 16 | 1.41 | 18.16 | 14.3 |
Taxi Colour Statistics:
For yellow and green taxis, average, median, minimum, and maximum were computed for the trip duration, distance, and speed.
cab_colour | feature | average | median | minimum | maximum |
---|---|---|---|---|---|
yellow | Trip Duration (mins) | 14.16 | 11.22 | 1 | 75 |
green | Trip Duration (mins) | 16.38 | 12.68 | 1 | 75 |
yellow | Trip distance (km) | 4.72 | 2.75 | 0.64 | 35.41 |
green | Trip distance (km) | 5.86 | 3.6 | 0.64 | 35.41 |
yellow | Trip speed (km/hr) | 18.58 | 16.48 | 0.52 | 57.94 |
green | Trip speed (km/hr) | 20.15 | 17.88 | 0.57 | 57.94 |
Tip Analysis:
- Percentage with Tips: Calculated percentage of trips where
tip_amount
> 0. - Percentage with Tips ≥ $10: For trips with tips, calculated percentage where
tip_amount
≥ $10.
Metric | Value |
---|---|
percentage_trips_with_tips | 69.67 |
percentage_tips_greater_10 | 2.85 |
Trip Duration Bins:
The trips were classified into 6 bins according to their duration. Then for each bin, the Average Speed (speed_km_hr
) and Average Distance per Dollar (trip_distance_km / total_amount
) were computed.
DurationBins | avg_speed | distance_per_dollar |
---|---|---|
Under 5 Mins | 20.17 | 0.13 |
5-10 Mins | 16.88 | 0.17 |
10-20 mins | 17.41 | 0.23 |
20-30 mins | 20.43 | 0.28 |
30-60 mins | 26.09 | 0.36 |
60 or more mins | 22.67 | 0.4 |
Regression Modeling
To predict the fare amount, both a linear regression model and a decision tree model were developed. The process of constructing these models involved several key steps:
1. Feature Engineering
Additional features were created by extracting elements from the pickup and dropoff datetimes: - PUmonth: Month of the pickup datetime. - DOmonth: Month of the dropoff datetime. - PUyear: Year of the pickup datetime. - DOyear: Year of the dropoff datetime. - PUhour: Hour of the pickup datetime. - DOhour: Hour of the dropoff datetime. - PUdayofweek: Weekday of the pickup datetime. - DOdayofweek: Weekday of the dropoff datetime.
2. Feature Selection
- Columns assumed to have no effect on
fare_amount
, such as additional fees, surcharges, andtotal_amount
, were excluded. - Columns with missing values, like
passenger_count
, were also excluded.
Features selected:
- Numerical Features: PUmonth, PUyear, PUhour, PUdayofweek, DOmonth, DOyear, DOhour, DOdayofweek, trip_distance, DurationInSeconds.
- Categorical Features: VendorID, PULocationID, DOLocationID, cab_colour.
3. Data Preprocessing
- Negative values of
fare_amount
were removed to avoid inaccuracies in the prediction. - A pipeline was created to handle categorical features using string indexing and one-hot encoding.
- A vector assembler combined the encoded categorical features and numerical features into a single
features
column.
5. Model Training
- Data for April 2022 was set aside as the testing dataset.
- The remaining data was split into training (80%) and validation (20%) datasets using an 80:20 ratio with seed 8.
- Both linear regression and decision tree models were fitted on the training dataset, specifying
featuresCol
as ‘features’ andlabelCol
as ‘fare_amount’.
6. Results
Model | Validation RMSE |
---|---|
Linear Regression | 219.41 |
Decision Tree | 160.689 |
- The decision tree model performed better with a lower RMSE but took longer to train (4.33 hours compared to 2.41 hours for linear regression).
- The RMSE for the decision tree model on the testing set (April 2022) was 4.50032.