Surviving dreaded commute in NYC — A case study on Citi Bike & Ride-Hailing
Commuting in New York City can be really hard. It is a challenge that most of new yorkers need to take in their daily life. The abundant choices of transportation tools and the complicated traffic in the city make transportation painful. In the first few months when I had just started living in the city, I always ended up arriving at a place late because I chose the wrong transportation. As time flies, my experience is more relaxing with the traffics and transportation when outing in the city. When a problem gets better when experience accumulates, indicating that we can pattern the problem with past data, and luckily, we have tons of transportation data in NYC.
That’s why I proposed to run a case study with big data tools on two iconic and newly-bornt popular transportations: CITI Bikes and Ride Hailing (including Uber, Lyft, and Others) for the final project of my class at NYU: Processing Big Data for Analytics Applications. In this class, we learned how to manipulate big volumes of data using Hadoop MapReduce, Spark, Hive, Impala, and a variety of other classic big data tools. These tools are extremely useful and powerful for me to run analytics on my dataset: Ride-Hailing Data, which you can find on NYC TLC Trip Record Data (LINK). This data source, which ranges from 2019–02 to 2022–05, contains 0.6 billion rows and is sized around 15 GB, along with my teammate Yanchen’s Citi Bike dataset, which is also of the same time range and sizing of 15GB, are perfectly aligned with the use case of these big data tools. (Figure 1)
Each row of these datasets is the trip record with duration and location data. I leveraged the efficiency of PySpark for the whole data process and performed the data project based on the NYU HPC platform. Here is our data flow diagram for the study design.
To avoid a repetitive downloading and uploading process to the NYU Peel, I chose to use curl to fetch all the data from the TLC website directly onto Peel and then upload them all at once to HDFS. The file is stored in .parquet format, a column-oriented data file format designed for efficient data storage and retrieval, which is highly compatible with Apache Spark but less workable with MapReduce.
After using MapReduce and PySpark to clean and aggregate the 1.2 billion rows of data, we have some insights into the usage and trip duration of the Citi Bike and Ride Hailing companies. The first visualization (Figure 3) is on the overall historical usage (total trip counts) of the CitiBike and Ride-hailing, ranging from 2019–02 to 2022–05. And the corresponding Month over Month percentage (Figure 4). You might immediately notice a sharp drop around April 2020. That is precisely when the first wave of COVID-19 hit the city, and citizens were required to stay at home. After the first wave, both businesses got recovered from the impact, but overall, CITI Bike got recovered more quickly than Ride-hailing, maybe because you do not need to worry about sharing the same space with anyone else when you are on a bike. Another significant pattern you would be most likely to notice is the three bell shape growth of CITI bikes across the three years. That is called seasonality, which is formally defined as “predictable fluctuation or pattern that recurs or repeats over a one-year period” (Wiki).
If you look closely into the seasonality, which directs us to Figure 5, which shows the total monthly trip counts, you will further notice that throughout the year, the ride of CITI bikes rises in the summertime and falls when the winter comes, especially from January to March, when the city is the coldest. You may also wonder why there is fluctuation in the ride-hailing lines. Our speculation is that COVID-19 hits severely on ride-hailing companies from April to June, and also, people are, in general, more willing to go out when the weather is warmer. Identifying seasonal patterns in CITI bike and ride-hailing companies is helpful for future forecasting and correlation detection.
Next, we further look at the breakdown of the different companies (Figure 6) on daily average trip counts: Uber, Lyft, CITI bikes, and other ride-hailing companies, including Juno and Via. At the start of 2019, Uber, Lyft, and other ride-hailing companies were competing with each other, but in the middle of 2022, the market was completely dominated by Uber and Lyft, and all other companies were out of the game. An exciting trend is that CITI bike usage is gradually increasing and catching up with Lyft. In general, the market for sharing transportation has been growing over the year.
Then, let us comes to the day level and break down the hourly trip usage throughout the day. Here are the graphs of the average hourly trip in different time ranges in a day (Figure 7) and their corresponding hour-over-hour percentage change (Figure 8). From the two graphs, we observe that the peak hours for ride-sharing are 12 pm — 3 pm, and the peak hour for the peak hours for Citi Bike are 3 pm — 6 pm. In fact, the correlation between ride-sharing and Citi bike is around 0.69, indicating that the usage of shared transportation has similar patterns: people go out during day time and make fewer trips at midnight. But the correlation is not close to 1, and if you examine the graph, you will see that people also use ride-hailing services around 0–3 AM, but few would choose to bike at that time. This portrays the discrepancy between Citi Bike usage and Ride-hailing companies' usage.
Last but not least, we look at the duration of the trip (Figure 9). The graphs show the average trip duration for different months. The average trip duration of a ride/bike trip is between 12 to 26 minutes. You can see that during summertime, people bike for a longer time and less in winter. But for ride-hailing services, the trip duration is more steadily around 18 minutes.
Overall, the insight presents quite a lot of interesting patterns, and this could inspire us to investigate the insights further and perform predictive analysis on these data. From information to insights, such analytics are helpful for related readers and dwellers in NYC to get a sense of top-level statistics of the scope of the shared transportation in the city. The process of this data is not easy, especially when you have to manipulate around 1 billion rows of data. By leveraging the power of distributed computing on HDFS and Apache Spark, provided through New York University High-Performance Computing Center, we are able to analysis this amount of data within a reasonable time. Lastly, special thanks to Professor Malavet for teaching and instructing me and my teammate in, leading us into the world of big data processing, and helping us with a series of difficult data processing challenges.