Bicycle Trip Analysis.

Introduction.

This exercise presents an example of application of R language on data analysis as part of the Google Data Analytics Program available in Coursera. Most of the analysis was made with the help of Tydiverse , a collection of packages commonly used in the field. It is important to mention that all the information we are about to use is open data, and it is publicly available here, while the data is real, the company that will be mentioned below is fictional.

This exercise is divided into different sections:

Scenario. Here, I’ll give details about the business task and a brief description of the available data.
Data Cleaning and Aggregation. Here I’ll describe the changes that I’ve made on the original data to make it suitable for the analysis. For everyone interested, the appendix at the document will expand further into this.
Findings. In this section, I’ll present visualization made with ggplot2, that will support our main findings about the data.
Conclusions and next steps. Finally, in this section I’ll present my final recommendations to stakeholders and data analytics team.
References. It is always important to give credit to the developers and teams who make this possible, this sections is dedicated to them.

NOTE: Please, feel free to unhide the first cell above to see the entire code.

Scenario.

Cyclistic, a bike-share company in Chicago, offers to their customers two different ways to use the bike service, by casual rides or by subscribing to a membership. The senior managers of the company believe that the best strategy is to convince more customers to subscribe to a membership and thus, will design a marketing strategy to achieve this objective. But, to make this possible, it is necessary to have a better understanding of how these two groups of customers differ from one another.

Our business task is initially described as Understanding how casual customers differ from customers with a paid membership, however, this definition is vague, as there is little to no definition of exactly what we want to find out in our data. To come out with a better definition of our task, we need to make an initial inspection on the available data we will work with.

Our data consist of 12 csv files (1 file per month), the records are from 2020-12-01 to 2021-11-30, in total, we have 5.47M records. here is a description of the fields:

ride_id. string to identify individual records.
rideable_type. string with 3 possible values: classic_bike, electric_bike, docked_bike.
started_at. datetime data, indicates the start of the trip.
ended_at. datetime data, indicates the end of the trip.
start_station_name. String to identify the name of the given bicycle stations, including blank, there are 846 distinct values.
start_station_id. String to identify the ID of the given bicycle stations, including blank, there are 832 unique values.
end_station_name. String to identify the name of the given bicycle stations, including blank, there are 840 unique values.
end_station_id. String to identify the ID of the given bicycle stations. including blanks, there are 827 unique values.
start_lat. Decimal (double) data to identify the latitude of the starting point. There are 377733 unique values.
start_lng. Decimal (double) data to identify the longitude of the starting point. There are 358177 unique values.
end_lat. Decimal (double) data to identify the latitude of the ending point. There are 443644 unique values.
end_lng. Decimal (double) data to identify the longitude of the ending point. There are 403742 unique values.
member_casual. string to identify type of user, there are 2 possible values: casual and member.

Now, with a better understanding of our data, the business task could be defined as to identify observable differences among casual and members customers regarding:

Total amount of trips.
Amount differences throughout the year.
Preferred rideable type.
Ride duration.
Preferred times of the day to take a trip.
Preferred starting points and routes (stations).
Preferred days to take a trip.

Data Cleaning and Aggregation

Before using data for analysis, it has to be cleaned and transformed. For this exercise, here is the process I’ve used:

Cleaning

After taking a closer look into the data, I decided to remove some of the columns, as their information is not relevant for my study, and they were many inconsistencies. These removed fields are:

ride_id. This is clearly a primary / foreign key, thus, is quite useful when combining information from different sources. However, for the scope of this case study, I’m working with a single source (even if its distributed along many files), and for that reason, it is a string that is not useful for grouping records or identify any pattern.
start_station_name. The biggest problem with this field is that there are many blank records.
start_station_id. There are two major problems here a) the IDs are inconsistent, for example, some IDs are only numbers, some others combine letters with numbers and some others combine decimal places. b) There are too many blank records.
end_station_name. Same as start_station_name.
end_station_id. Same as start_station_id.

After creating a field to calculate trip duration (ride_duration_mins, described in the next section), I’ve observed that there are cases where, the trip ends before starting, for example, a trip lasting -5 minutes. This is, from my perspective, some kind of bug on the data gathering system. while there are unusual records that could have an human origin (like trips extremely long), this records can only be explained as something wrong on the computational systems. To prevent these records from impacting on the analysis, I decided to remove them, in total, 581 records are deleted . As last step I sorted data by start datetime field.

An example of a single record of the table looks like:

ride_id: 70B6A9A437D4C30D.
rideable_type: classic_bike.
started_at: 12/27/2020 12:44:29 PM.
ended_at: 12/27/2020 12:55:06 PM.
start_station_name: Aberdeen St & Jackson Blvd.
start_station_id: 13157.
end_station_name: Desplaines St & Kinzie St.
end_station_id: TA1306000003.
start_lat: 41.87772613.
start_lng: -87.65478743.
end_lat: 41.888716036.
end_lng: -87.6444478533.
member_casual: member.

Aggregation.

For this exercise, I have focused on getting the most out of date and time information, to be able to create categories based on year, month, day and time, I’ve aggregated the following fields calculated from started_at and ended_at:

ride_duration_mins. This is the interval between the start and end date-time presented in minutes.
trip_hour. The starting hour of a trip expressed as an integer (number), range 0 to 23.
trip_day. The day of the starting date time of a trip, presented as “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat” and “Sun”.
trip_month. The month of the starting date time of a trip, presented as “Dec”, “Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct” and “Nov”.
trip_year. The year of the starting date time of a trip, presented as “2020” and “2021”.
trip_year_month. This field merges trip_year and trip_month using “-” as a separator, the final output looks like “2020-Dec”, “2021-Jan” and so on.

An example of a single record of the table now looks like:

rideable_type: classic_bike.
started_at: 12/27/2020 12:44:29 PM.
ended_at: 12/27/2020 12:55:06 PM.
start_lat: 41.87772613.
start_lng: -87.65478743.
end_lat: 41.87772613.
end_lng: -87.65478743.
member_casual: member.
ride_duration_mins: 10
trip_hour: 12
trip_day: Sun
trip_month: Dec
trip_year: 2020
trip_year_month: 2020-Dec

Findings.

Total amount of trips.

Let’s begin by analyzing the total amount of records from both groups, it is clear that there are more registered trips coming from customers with a membership (the exact difference is 500,105 rides), however, there is a problem, it is out of the scope of this exercise to deal with any kind of Personally identifiable information (PII for short), and thus, it is impossible to know if: a) there is a similar or smaller amount of member customers taking a ride more often or b) every ride correspond to a different person and thus there are more members than casual users.

Regardless of the limitation, it is clear that the members group was more active overall throughout the year , the question is, is this result consistent? what if regardless of the overall count, there are seasons where casual users are more active? to find out, read the next section.

Amount differences throughout the year.

For 9 out of 12 months in the year, the members group was the more active and thus, its safe to say that the previous observation is consistent, however, it is quite interesting to see that our three exceptions to the rule, are also the months with the most trips throughout the year, June, July and August.

It is interesting to see how both groups clearly prefer the summer season (time between June to September) over the rest of the year. At first glance, there is a clear explanation for this, the weather. However it would be a great idea to compare some other years to confirm the pattern.

Preferred rideable type.

As a quick refresher, Cyclistic offers three different choices for their customers 1. Classic 2. Electric and 3. Docked bikes.

First, let’s focus on the business task, we can conclude that there are no differences among casuals and members group, in both cases classic bike is preferred, then electric bike takes the second place and the docked bike is more unpopular taking the third spot.

It is interesting to see how unpopular docked bikes are against the classic and electric bikes. As an external observer, it would be easy to conclude there is something wrong with this rideable, however, we need more context that unfortunately, is out of our scope for this exercise.

Ride duration.

There is the same pattern in both groups. The distribution of frequencies resembles the “bell curve”, of course, limited by zero on the x-axis.

is impressive to see how most part on records on both groups fall inside the range of 1 to 21 minutes (62% of casual rides and 83% of members’).

Preferred times of the day to take a trip.

If both groups behave similarly regarding the average trip duration, how about the hour of the day they prefer to take a ride? Again, both groups behave quite similarly.

The hours between 16hrs to 18hrs are the most active, and, as expected, the range between midnight and 7:00AM are the less popular for customers.

Preferred starting points (stations).

Now, let’s focus on a new question, how about the geographical location preferred by the customers from both groups? if we just plot every starting point, it’s difficult to see any significant difference, we can only confirm that most trips happens on the central part of the city, while the periphery has less activity.

There is no question! only plotting starting points won’t clearly reveal key differences among the groups, however, the red spots above allow to identify some of the preferred starting point locations, let’s investigate this further by filtering the top starting point by group

##   member_casual start_lat start_lng   percent
## 1        member  41.79000 -87.60000 0.8727767
## 2        member  41.90297 -87.63128 0.6387762
## 3        member  41.91213 -87.63466 0.5781267
## 4        member  41.88918 -87.63851 0.5576872
## 5        member  41.90322 -87.63432 0.5177449
## 6        member  41.91172 -87.62680 0.5124928

##   member_casual start_lat start_lng   percent
## 1        casual  41.89228 -87.61204 2.2152856
## 2        casual  41.88096 -87.61674 1.1867243
## 3        casual  41.91172 -87.62680 0.9733630
## 4        casual  41.92628 -87.63083 0.7181409
## 5        casual  41.88103 -87.62408 0.6970097
## 6        casual  41.90096 -87.62378 0.6909033

This is quite interesting! the preferred coordenates for the casual members, 41.892280, -87.612040, correspond to Addams Memorial Park, close to Lake Michigan and other parks. While, for members, the preferred coordenates, 41.79000, -87.60000, correpond with Hyde Park, that, is close to The University of Chicago, bookstores, laboratories, etc. While this is pure speculation, I think we are starting to see a pattern here, this is, casual users are more interested on trips for pleasure, while members use the service to get to their job location, school, stores, etc. Let’s try to take it one step further and try to see the main routes, by identifying the preferred combinations of start and end coordenates.

Note: The following map could be a little confusing, the arrow tail represents the starting point, and its the head, the destination, you’ll notice arrows having only a tail, this represents that the starting and ending points are the same. To avoid an overcomplicated map, I’ve filtered only records with at least 0.0001% occupancy rates. Also, the following map was designed following the amazing article by Jacqueline Araya, “Alternatives to ggmap for beautiful maps”

The map above is quite interesting and complements the previous one (about preferred start locations). Here we can confirm the main starting points we discovered before, but now, we can also see the whole rute. Casual users are showing a greater dispertion along the city, but mostly focused on the north section, on the other hand, members remain focused on the south part of the city, nearby The University of Chicago. Is it possible that the most part of our members group includes students from this university? I think it is possible, just remember, we can’t oversimplify this lets keep in mind that the first map showed how, as a whole, there is activity all around the city.

Preferred days to take a trip.

The groups behave differently regarding their preferred days of the weeks to take a ride.

Casual users clearly prefer weekends, thus, understanding the activities, places and motivations behind this difference is crucial to have a broader understanding of the needs among these two groups, and, as consequence, how to convince casual users to become members. At first glance, it seems that members use the service for daily activities, for example, getting to the office, their school, etc., while casual users prefer it for rides they mostly do on weekends, for example, to workout, for pleasure, etc. Unfortunately the data alone won’t give us these details, we need to ask them directly

As before, we want to confirm if the pattern is consistent.

By looking at the chart above, we can confirm the patter is consistent, and thus, we can conclude, this is the most important difference among the groups.

Conclusions and next steps.

With the available data, it is impossible to know if there are more member than casual users, as we can’t pair ride id with client id (that is missing). What we do know, is that, with approximately 0.5M more trips during the year (Dec, 2020 to Nov, 2021), the member group came on top on total amount of trips. It would be great to confirm the exact amount of members, and how often they take a ride, this way we can confirm exactly how much the customer’s habits impact on the observed results (unfortunately, this is out of the scope of this study).
The observation above is consistent throughout the year
It is clear, for both of our groups, that the rideable “docked bike” is, by far, the less popular out of our three available options. With the available data, it’s impossible to tell why, however, it would be great idea to explore in further detail why is this happening.
It is clear, that short trips (below 21 minutes) are the most common among both groups.
Most of trips happens between 16hrs to 18hrs on both groups, this is quite interesting, it would be great if we could go one step further and make an analysis by groups of age. For example, business hours, according to Wikipedia, are defined as “the hours between 9 am and 5 pm” so It would make sense to think some of those trips are people going back home after work, right? but how about students? how has the COVID-19 pandemic altered this pattern? if they are not going to the office or back home, what is their destination then and how it is related to these hours? maybe, if we compare these numbers with traffic reports we will discover bicycle rides is the fastest way to travel around the city during this time.
As expected, there are fewer trips starting on the periphery of the city. This could be happening due many factors, for example, how the population is distributed,their motivation, etc.
If we analyse only the top preferred starting point by group, we can see how members concentrate mostly on the south part of the city, while casual users concentrate on the north with a greater level of dispertion throughout the city
There is a clear difference among the groups, casual members prefer to take their rides on weekends,while members distribute their trips evenly through the week. On the visuals, its clear how, due Saturdays and Sundays being columns 1 and 7 on the charts, members have a clear “U” shape vs members, showing a more flat pattern. This pattern is consistent, as it is visible even separating the data by month.
If we want to have a better understanding of the customers and their needs we need to ask them directly their motivation to become members or remain casual users, surveys would be an effective approach.

References.

RStudio. RStudio Team (2021). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
Tydiverse. Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686.
Sf. Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009.
rnaturalearth.Andy South (2017). rnaturalearth: World Map Data from Natural Earth. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearth.
rnaturalearthdata. Andy South (2017). rnaturalearthdata: World Vector Map Data from Natural Earth Used in ‘rnaturalearth’. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearthdata.
udunits2. James Hiebert (2022). udunits2: Udunits-2 Bindings for R. R package version 0.13.2. https://CRAN.R-project.org/package=udunits2.
maps. Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn. (2021). maps: Draw Geographical Maps. R package version 3.4.0. https://CRAN.R-project.org/package=maps.
ggmap. D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2.The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
ggrepel. Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel.
rmarkdown. JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.
Moreno, M. & Basille, M. (2018), “Drawing beautiful maps programmatically with R, sf and ggplot2”. URL https://r-spatial.org/r/2018/10/25/ggplot2-sf.html.
Araya, J. (2020), “Alternatives to ggmap for beautiful maps”. URL https://jacquelinearaya.github.io/2020/06/16/burlington_mapping.html

Bicycle Trip Analysis. Chicago City, Illinois. Google Data Analytics Program Case Study.

Cesar Perez