Big Data Analysis of NYC Taxi Trips

30 Sep 2022

I undertook an analysis of a massive dataset containing over 160 million trip records from New York City’s iconic yellow and green taxi cabs spanning January 2019 to April 2022.

With a dataset of this immense scale, Databricks and its PySpark API proved highly suitable tools, providing a unified analytics platform with an Apache Spark-based cluster computing framework for distributed processing and parallel operations on the large, distributed datasets.

The key business questions involved analyzing monthly trends, comparing statistics between the two taxi types, understanding tipping behavior, identifying profitable trip durations based on distance per dollar, and building predictive models for fare amounts.

Link to Github

Datasets
Infrastructure Setup
1. Azure Cloud Storage
2. Databricks Setup
Data Cleaning & Preparation
Data Analysis
Regression Modeling

Datasets

The New York City Taxi and Limousine Commission publishes detailed trip records for both the iconic yellow taxis that can pick up street-hailing passengers anywhere in the city, as well as the newer green taxis introduced in 2013 to service the outer boroughs.

The datasets used span from January 2019 through April 2022 and contain a wealth of information about each trip, including pick-up and drop-off dates/times, locations, trip distances, itemized fare details like the base fare and taxes/surcharges, rate types based on times/locations, payment types, and driver-reported passenger counts.

The datasets were downloaded, in a parquet format, from the nyc.gov website here

Infrastructure Setup

Azure Cloud Storage

After downloading the datasets, a storage container was created in the Azure cloud storage and all the data files were uploaded here. Parquet files have better compression resulting in a more efficient storage method than CSV, hence, reading parquets is much faster than reading CSVs.

Initially, two folders were created for green and yellow taxis, however, due to inconsistencies in the data type for the airport_fee for Yellow taxis (stored as a double in some months and as an integer in others), this data was further organized into different folders.

For yellow taxis, data from all months in 2019, January to July 2020, and October 2020 were uploaded into an ‘integer’ folder, while data from August, September, November, December 2020, and all months in 2021 and 2022 were uploaded into a ‘double’ folder.

Databricks Setup

A Community edition Databricks account (free) was set up and a new cluster was created from the Compute menu with Runtime: 11.2 (Scala 11.2, Spark 3.3.0).

Note: In the Databricks community edition, this cluster will need to be cloned everytime it is opened, after a 2 hour idle period, as this edition does not allow you to restart previous clusters.

The DBFS File Browser was then enabled in the Admin Console Settings (under Workplace Settings).

A new notebook was created in the Workspace, with Python selected for the Default Language, and the required Azure Blob Storage container was mounted using the following script.

storage_account_name = "{STORAGE ACCOUNT NAME}"
storage_account_access_key = "{STORAGE ACCOUNT ACCESS KEY}" # from storage account under ‘Security + networking’
blob_container_name = "{BLOB CONTAINER NAME}"

dbutils.fs.mount(
    source = f'wasbs://{blob_container_name}@{storage_account_name}.blob.core.windows.net',
    mount_point = f'/mnt/{blob_container_name}/',
    extra_configs = {'fs.azure.account.key.' + storage_account_name + '.blob.core.windows.net': storage_account_access_key}
)

Data Cleaning & Preparation

The files from the green taxis folder, yellow taxis integer, and yellow taxis double were read into separate data frames using spark.read.parquet(), and each one was cached.
The airport_fee in the yellow taxis integer data frame was cast to double, and the yellow taxis double and integer data frames were combined into a single data frame.
The total number of rows in the green taxi data frame and the combined yellow taxi data frame were counted, resulting in 9,390,483 rows for green taxis and 152,823,008 rows for yellow taxis.
Additional columns were created for both green and yellow taxi data frames to aid in cleaning and analysis:
- DurationInSeconds: The difference between the finish time and the start time in seconds.
- DurationInHours: The DurationInSeconds divided by 3600, rounded to 3 decimal places.
- DurationInMinutes: The DurationInSeconds divided by 60, rounded to 3 decimal places.
- trip_distance_km: The trip_distance multiplied by 1.60934.
- speed: The trip_distance divided by DurationInHours, rounded to 3 decimal places.
- speed_km_hr: The trip_distance_km divided by DurationInHours, rounded to 3 decimal places.
The datasets were cleaned to remove unrealistic rows that could skew aggregate calculations:
- Trips that finished before the start time were removed.
- Trips with negative speed were removed.
- Trips starting outside January 2019 to the end of April 2022 were removed.
- Low and high percentiles for speed, duration, and distance were computed and used to filter the data.
- Trips with speeds over 36 mph (approximately the 99th percentile) were removed.
- Trips shorter than 60 seconds or longer than 4500 seconds were removed.
- Trips shorter than 0.4 miles or longer than 22 miles were removed.
A column for the taxi’s color (green or yellow) was added, and the pickup and dropoff datetime columns were renamed to match across data frames. The data frames were then combined using unionByName, allowing missing columns.
The new combined data frame was exported to a parquet file and loaded as a temporary view to enable SQL operations.

Data Analysis

Monthly Aggregates

The following monthly aggregates were computed and merged into a single dataframe:

Total Trips: Counted rows per month.
Busiest Day: Determined by counting trips per day of the week (1 = Sunday) and ranking.
Busiest Hour: Counted trips per hour and identified the hour with the most trips each month.
Average Passengers Averaged passenger_count per month.
Average Fare: Averaged total_amount per trip each month.
Average Fare per Passenger: Averaged total_amount / passenger_count each month.

month_year	total_trips	dayofweek	hr	avg_passenger_count	avg_total_amount	avg_amt_per_passenger
2019-01	7906544	6	18	1.55	15.44	12.67
2019-02	7316958	6	18	1.56	18.3	15
2019-03	8128986	6	18	1.56	18.09	15.29
2019-04	7685437	6	18	1.56	18.99	15.49
2019-05	7716015	6	18	1.56	18.94	15.45
2019-06	7111000	7	18	1.56	19.02	15.55
2019-07	6441573	4	18	1.56	18.06	15.34
2019-08	6196916	5	18	1.56	18.98	15.37
2019-09	6653629	6	18	1.54	19.18	15.67
2019-10	7316092	6	18	1.53	19.15	15.63
2019-11	6939987	6	18	1.53	18.75	15.32
2019-12	6596217	5	18	1.54	18.87	15.29
2020-01	6559092	6	18	1.51	18.09	14.81
2020-02	6378824	7	18	1.49	18.08	14.91
2020-03	3606502	3	18	1.46	17.8	14.86
2020-04	2349705	3	16	1.29	15.85	13.39
2020-05	367893	6	16	1.31	17.54	13.7
2020-06	559263	8	15	1.38	17.99	14.31
2020-07	802560	6	15	1.38	18.09	14.15
2020-08	1008140	2	16	1.41	18.16	14.3

Taxi Colour Statistics:

For yellow and green taxis, average, median, minimum, and maximum were computed for the trip duration, distance, and speed.

cab_colour	feature	average	median	minimum	maximum
yellow	Trip Duration (mins)	14.16	11.22	1	75
green	Trip Duration (mins)	16.38	12.68	1	75
yellow	Trip distance (km)	4.72	2.75	0.64	35.41
green	Trip distance (km)	5.86	3.6	0.64	35.41
yellow	Trip speed (km/hr)	18.58	16.48	0.52	57.94
green	Trip speed (km/hr)	20.15	17.88	0.57	57.94

Tip Analysis:

Percentage with Tips: Calculated percentage of trips where tip_amount > 0.
Percentage with Tips ≥ $10: For trips with tips, calculated percentage where tip_amount ≥ $10.

Metric	Value
percentage_trips_with_tips	69.67
percentage_tips_greater_10	2.85

Trip Duration Bins:

The trips were classified into 6 bins according to their duration. Then for each bin, the Average Speed (speed_km_hr) and Average Distance per Dollar (trip_distance_km / total_amount) were computed.

DurationBins	avg_speed	distance_per_dollar
Under 5 Mins	20.17	0.13
5-10 Mins	16.88	0.17
10-20 mins	17.41	0.23
20-30 mins	20.43	0.28
30-60 mins	26.09	0.36
60 or more mins	22.67	0.4

Regression Modeling

To predict the fare amount, both a linear regression model and a decision tree model were developed. The process of constructing these models involved several key steps:

1. Feature Engineering

Additional features were created by extracting elements from the pickup and dropoff datetimes: - PUmonth: Month of the pickup datetime. - DOmonth: Month of the dropoff datetime. - PUyear: Year of the pickup datetime. - DOyear: Year of the dropoff datetime. - PUhour: Hour of the pickup datetime. - DOhour: Hour of the dropoff datetime. - PUdayofweek: Weekday of the pickup datetime. - DOdayofweek: Weekday of the dropoff datetime.

2. Feature Selection

Columns assumed to have no effect on fare_amount, such as additional fees, surcharges, and total_amount, were excluded.
Columns with missing values, like passenger_count, were also excluded.

Features selected:

Numerical Features: PUmonth, PUyear, PUhour, PUdayofweek, DOmonth, DOyear, DOhour, DOdayofweek, trip_distance, DurationInSeconds.
Categorical Features: VendorID, PULocationID, DOLocationID, cab_colour.

3. Data Preprocessing

Negative values of fare_amount were removed to avoid inaccuracies in the prediction.
A pipeline was created to handle categorical features using string indexing and one-hot encoding.
A vector assembler combined the encoded categorical features and numerical features into a single features column.

5. Model Training

Data for April 2022 was set aside as the testing dataset.
The remaining data was split into training (80%) and validation (20%) datasets using an 80:20 ratio with seed 8.
Both linear regression and decision tree models were fitted on the training dataset, specifying featuresCol as ‘features’ and labelCol as ‘fare_amount’.

6. Results

Model	Validation RMSE
Linear Regression	219.41
Decision Tree	160.689

The decision tree model performed better with a lower RMSE but took longer to train (4.33 hours compared to 2.41 hours for linear regression).
The RMSE for the decision tree model on the testing set (April 2022) was 4.50032.