This exercise presents an example of application of R language on data analysis as part of the Google Data Analytics Program available in Coursera. Most of the analysis was made using Tydiverse , a collection of packages commonly used in the field. The dataset is publicly available here on kaggle, but it commes from Fitabase instead of Bellabeat, the company mentioned on the exercise guideline.
This exercise is divided into different sections:
Scenario. Here, I’ll give details about the business task and a brief description of the available data.
Limitations and notes before starting. In this section, I’ll mention the limitations and possible bias that you’ll need to keep in mind while reading the rest of the article.
Data Cleaning Processing. Here I’ll describe how the data its transformed from the original format to the structure needed for the visual representation I’ll use later.
Findings. In this section, I’ll present visualization made using ggplot2, that will support our main findings about the data and conclusions. For this exercise I’ll present 2 different types of findings a) findings about the behavior of wearing the device and b) findings about what is being reported by the devices.
Conclusions and Discussion. While it is true that, we can get to some conclusion with the available data, this is a complicated topic, where we can’t overlook factors beyond these numbers. This section will be dedicated to reflect on what we can and can’t know by using just this data.
References. It is always important to give credit to the developers, researchers and teams who make this possible, this sections is dedicated to them.
I would like to say thanks to all the people who shared their personal data to make this open dataset possible.
NOTE: Please, feel free to unhide the first cell above to see the entire code.
This time, we’ll help Bellabeat (a high-tech manufacturer of health-focused products for women) to analyse data comming from fitness devices. Their senior management team believes that by finding trends and patters, they will be able to help guide marketing strategies for the company.
Originally, the business task is defined as:
The first question triggers two different approaches, 1. Studying the behaviour of wearing fitness devices, in practice, this will be comparing the amount of valid records existing in our tables and 2. Studying trends about the physical activity of our participants that could be generalized to wider populations and allow us to make predictions about their health and habits. The sencond and third questions will guide our conclutions to apply to our customers. Our final business task will be defined as:
Analyse Fitabase dataset records comming from April 12th to May 12th, 2016 to identify trends about the usage of their products and the relationships between its variables that could me generalized to Bellabeat customers and thus, useful for marketing purposes
Before we beging, lets identify the main limitations affecting our study, while analysing data give us a sense of knowledge, it is important to keep in mind the things that we can and can’t know with certainly just by studying the data, this way, we will be able to separate what we know from what we assume.
We can’t be 100% sure all participants are women. Our exercise guideline says “… asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices”, so, while Bellabeat focuses on devices for women, Fitabase mission statement says: “Our mission at Fitabase is to enable researchers to use the latest tools, devices, and apps to further our collective knowledge. We’re dedicated to making it as easy as possible for researchers to deploy the “new tools of wellness” to measure, track, and engage their participants”, there is no mention about their data being exclusively from women and thus we can’t jump to that conclusion, none of the tables that we are going to use has any field for sex/gender either. So, while our conclutions aims to a female population, we can’t be sure our 33 participants are women.
We have to assume the data collected is represetative of the “normal habits” of our participants. In other words, we assume they are not bahaving differently because they know their data is being recorded. This happens a lot on social research, we all want to give the best impresion if we know we are being observed, right?
The sample it is quite small, this leads to sampling bias. This is a big problem as we aim to generalize our conclusions to a wider population (Bellabeat customers) but our sample consist of only 33 participants, this means we are more vulnerable to get wrong conclusions by outliers or atypical values.
There are different numbers of observations per participant regarding different areas. This is related to the previous point, we do not only have a small sample, but also, during the 31 days period the data was gathered, their participation rate declined, we will explore this trend in detail in the “Findings” section below.
We have no information about preexisting medical conditions afecting the participants or any other background. In the findings section, you’ll see a correlation between physical activity and Body mass index (BMI), however, the overall physical status depends on many different factors, and not consider them may oversimplify a complex matter.
FINAL NOTE: In the findings section below, you’ll see terms such as overweight, obese, etc., personally, I don’t like using these terms … but I have to, as I need to keep my language consistent with medical terminology and standard categories.
The Fitabase dataset consist of 18 files, however, for this exercise I’ve decided to work with only 5 of them, they are:
First, let’s mention the data that I’ve decided to remove:
What was the criteria to remove those records? They were removed because on all of those cases we can assume that the device (not the owner) remainded static for the day, in other words, the device is active, but not being used by the participant.
Let’s talk about date format homologation. Among the tables, we find date / datetime fields formated differently, of course, this becomes an issue while trying to combine the information and thus, it has to be homologated to a single format. The tables we had to manipulate were:
There is also a case where the the analyisis was focused on time instead of date, the table was:
The approach allowed me to extract the hour from a datetime field.
Finally, let’s talk about how the data was summarized for the analysis. As mentioned before, there are differences regarding the consistency among participants, for example, some of them reporting sleep hours every day versus some of them reporting just a couple of times, to tackle these differences, I’ve summarized data by averages, for example, let’s say we have a participant who reported being active 10, 15, and 20 minutes from april 28th to april 30th, this participant gets a daily average of 15 minutes and that is the value I’ll use to compare him with the rest of participants, I’ve used the same approach in every table, please keep this in mind while reading the article.
A common problem in research that involves lack of strict experimental control is having, at least, some degree of data lost due human errors or abandonment. Outside research, the time we dedicate to some activity can also be afected by lost of motivation or interfering events. Here was no different (it is totally expected), the general participation rate decreased over the month the data was gathered, here are a couple of examples.
The following chart represents the decline of valid records from participants where the amount of sedentary minutes is different from 1440. To give more context, this table summarises the amount of minutes spent on different levels of intensity, from sedentary to very active, when a record shows a value of 1440 minutes spent on sedentary activity, it means that the device didn’t tracked any activity, and, unless the user stood completly motionless during the day, it’s safe to say we are observing an unvalid record and thus, it has to be removed, the rest of records, having valid information, were count by day. As we have 33 participants, we are expecting to see this amount of records by day, but as you are about to see, this rate drops significantly.
tracker_1
If we change the perspective, and see it from the side of the individual participation rate, in general, we can see that the most part of participants remained quite participative, the table below summarizes their participation per day, as a total, and as a percentage (please remember, this is for the daily activity table only).
rename(participant_records, "Total" = total_obs,
"Participation Percent" = part_percent)
## # A tibble: 33 × 10
## Id tue wed thu fri sat sun mon Total `Participation P…`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1624580081 5 5 5 4 4 4 4 31 100
## 2 2022484408 5 5 5 4 4 4 4 31 100
## 3 2026352035 5 5 5 4 4 4 4 31 100
## 4 2320127002 5 5 5 4 4 4 4 31 100
## 5 2873212765 5 5 5 4 4 4 4 31 100
## 6 4445114986 5 5 5 4 4 4 4 31 100
## 7 4558609924 5 5 5 4 4 4 4 31 100
## 8 5553957443 5 5 5 4 4 4 4 31 100
## 9 6962181067 5 5 5 4 4 4 4 31 100
## 10 8053475328 5 5 5 4 4 4 4 31 100
## # … with 23 more rows
Let’s summarize it one step further by grouping by participation rate. The table bellow presents the count of participants according to their participation rate, 31 records represents the 100% of expected records.
rename(category_participant_records, "Participation Percent" = part_percent)
## # A tibble: 10 × 2
## # Groups: Participation Percent [10]
## `Participation Percent` count
## <dbl> <int>
## 1 100 12
## 2 96.8 7
## 3 90.3 1
## 4 80.6 3
## 5 74.2 1
## 6 71.0 1
## 7 64.5 2
## 8 58.1 3
## 9 54.8 2
## 10 9.68 1
This behaviour goes beyond this tracker (table) and remains consistent in different tables, let’s see another two examples
Here, the participation rate shows a similar pattern of declive, in addition the participation rate, in general, was much lower.
tracker_2
Here, the pattern is almost identical to what we see on the first tracker.
tracker_3
To summarise these observations, let’s see the “whole picture” by merging the last 3 charts
tracker_4
While the pattern is clear, the question remains, what causes it? it’s hard to say, there could be many things, included, lack of motivation, device malfunction, misunderstanding of correct usage, you name it, all is speculation until we ask the participant directly. However, from these observations we’ve found some recommendations for our stakeholders:
Outside the context of research:
Eight of our participants shared data about their BMI with us, this is particularly useful because opens the door to compare levels of physical activity with an overall physical state, according to the Centers for Disease Control and Prevention (CDC), Adult BMI are categorized as:
Taking the BMI data from these eight participants we can generate categories about what we can call “Overall State” (normal, overweight, etc.), also, we can try to find correlation between their average BMI and different behaviours (minutes spent on different levels of activity, daily steps, sleep, etc.), the first table looks like this.
rename(basic_profile, "Average BMI" = average_BMI,
"Overall State" = overall_state)
## # A tibble: 8 × 3
## Id `Average BMI` `Overall State`
## <dbl> <dbl> <chr>
## 1 2873212765 21.6 normal
## 2 1503960366 22.6 normal
## 3 6962181067 24.0 normal
## 4 8877689391 25.5 overweight
## 5 4558609924 27.2 overweight
## 6 4319703577 27.4 overweight
## 7 5577150313 28 overweight
## 8 1927972279 47.5 obese
Now, remember that we have a table that provides the daily minutes spent on each level of activity by participant and date, after the data is summarized to show the averages values per participant, the resulting table looks like the one below, let’s show only the first 6 rows.
head(average_daily_activity) #first 6 rows, column names haven't been edited
## # A tibble: 6 × 11
## Id average_VeryAct… average_FairlyA… average_Lightly… average_Sedenta…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 40 19.8 227. 828.
## 2 1624580081 8.68 5.81 153. 1258.
## 3 1644430081 9.57 21.4 178. 1162.
## 4 1844505072 0.18 1.82 163. 1111.
## 5 1927972279 2.28 1.33 66.4 1229.
## 6 2022484408 36.3 19.4 257. 1113.
## # … with 6 more variables: record_count <int>, total_mins <dbl>,
## # Percent_VeryActiveMinutes <dbl>, Percent_FairlyActiveMinutes <dbl>,
## # Percent_LightlyActiveMinutes <dbl>, Percent_SedentaryMinutes <dbl>
We can see 4 levels of activity, from very active, to sedentary. There is one type of activity that is particularly interesting, Lightly Active Minutes. There is a negative correlation between the participant’s BMI and their average daily lightly active minutes, let’s appreciate it on a table joining data from our 8 participants who reported BMI and their average lightly active minutes:
basic_profile_main %>% select(c("Id", "average_BMI", "overall_state", "average_LightlyActiveMinutes")) %>%
rename("Average Lightly Active Minutes" = average_LightlyActiveMinutes,
"Overall State" = overall_state,
"Average BMI" = average_BMI)
## Id Average BMI Overall State Average Lightly Active Minutes
## 1 2873212765 21.57000 normal 308.00
## 2 1503960366 22.65000 normal 227.27
## 3 6962181067 24.02800 normal 245.81
## 4 8877689391 25.48708 overweight 234.71
## 5 4558609924 27.21400 overweight 284.97
## 6 4319703577 27.41500 overweight 236.40
## 7 5577150313 28.00000 overweight 158.50
## 8 1927972279 47.54000 obese 66.44
Let’s create a scatter plot, let’s add a regression line and finally, let’s calculate a correlation coefficient:
scatter_1
## `geom_smooth()` using formula 'y ~ x'
With a calculated pearson’s value of -0.87 it’s safe to say that there is evidence of a strong negative correlation.
To complement our findings, let’s compare the Pearson’s value of average BMI vs each of the activity levels.
round(cor(basic_profile_main[,2], basic_profile_main[,4:7], method = "pearson"), digits = 2)
## average_VeryActiveMinutes average_FairlyActiveMinutes
## [1,] -0.28 -0.43
## average_LightlyActiveMinutes average_SedentaryMinutes
## [1,] -0.87 0.45
With the only exception being Lightly Active Minutes, there are no other strong correlation levels.
The next question is, if this level of activity is so highly related to people’s BMI then, on average, what portion of our day we spend on this level of activity compared to other intensities? to answer this question, let’s go back to our 8 participants with known BMI and let’s see how they compare to each other.
bar_1
It is quite interesting to see how the activities that takes less than the 25% of our day have the most impact on keeping us in shape. On the other side, it is no surprise at all to see that for most part of our day we’re on sedentary activities.
Now, while we can definitively take advantage of our fitness device to control the amount of minutes we dedicate to all of these intensity levels, what happens if this is not the case? for example, when I’m using my device but not changing my routine? well, in that case we can expect to see a lot of variation throughout the day, in a single hour we can walk on the street, run to catch a bus, watch TV, etc., so, we can calculate an average activity level per hour that is representative of all those activities, hourlyIntensities_merged.csv allow us to get this information, let’s see, on average, what is the activity average per hour from our whole sample.
line_1
The pattern is totally expected, from the 21hrs to 4hrs (next day) we see activity levels below the general average as most part of people rests and sleep during this time, but from 5hrs to 20hrs we wake up, go to work, and do all of our activities.
The chart above, while shows some truth about our sample, it can also be misleading, as the mean alone won’t tell the whole story of our data. To have a better understanding of our sample, let’s generate a boxplot.
box_1
Let’s repeat this approach, this time, using only data from out 8 participants with known BMI to see how they compare.
line_2
The results are quite similar in general, but this time we can see how the participant with higher the BMI has lower hourly levels of activity, most of them below the general average.
Again, let’s now generate a boxplot, we know the average activity per group and hour, but the questions remains, how did these hourly values vary from day to day? remember, the study consisted of 31 days, so… there was little or no variation? or on the contrary, the hourly average is rather representing a wide range of values?
box_2
There is an alternative way to represent these results, it is less accurate, but can also provide a general understanding about the differences we are studying among the groups.
line_3
The chart above has a limitation, to keep the chart simple I calculated the standard deviation (sd) per hour and group and then, using an area geom, I’m just highliting the -1/+1 sd area to give you a general sense of the dispersion.
The last question is, exactly, what is the lightly active intensity? is there a specific set of activities we can do to trigger this physical state? according to Nikki Prosch, on her article Light, Moderate, And Vigorous Activity, lightly activity is:
“Light intensity activities require the least amount of effort, compared to moderate and vigorous activities. The definition for light intensity activity is an activity that is classified as < 3 METS. One MET, or metabolic equivalent, is the amount of oxygen consumed while sitting at rest. Thus, an activity classified as 2 METS would be equal to 2 times the amount of oxygen consumed while sitting at rest (1 MET). METS are a convenient and standard method for describing absolute intensity of physical activities. Some examples of light physical activities include: walking slowly (i.e. shopping, walking around the office), sitting at your computer, making the bed, eating, preparing food, and washing dishes.”
If walking is an example of light intensity activity, then, it makes sense to think that the number of steps registered by our devices should by strongly correlated to the amount of lightly active minutes, then, given the correlation we’ve found between lightly active minutes and BMI, we can say we have found the way to stay in good shape, right? let’s see.
scatter_2
## `geom_smooth()` using formula 'y ~ x'
Unfortunately, the answer to the question above is: not necessarily. With a value of 0.39 we can conclude the correlation is rather weak. As expected, we cannot oversimplify complicated matters.
What can we conclude from all these observations? there are a few things:
There is no question that the more active we are throughout the day, the more likely we are to stay in good physical shape.
However, it seems that “lightly active” is the level of intensity more closely related with BMI, instead of the more intense levels.
While walking is considered an example of a light intensity activity, it is not necessarily the only or main way to trigger this physical state, we need to keep in mind that each user may behave differently, and what works for some of them, may no be the right fit for others.
Given that our devices are capable of measuring METs, the best approach is to learn from the users themselves and provide motivation by using their own MET indicators.
Generalizing our finding is difficult mostly for 2 reasons 1. The sample is very small and 2. because we don’t really know anything about our participants backgroud for example, we don’t know if they are from the same country, their age, their medical condition, etc., this makes difficult to compare our results with different research. To overcome this obstacle we’ll have to make one assumption: that our 33 participants are a representative sample of a population that population will be defined as Fitbase users.
So, if our sample is representative of the population, how should our results look like if we were looking at the whole population? for this exercise, we’ll use 3 metricts:
As a starting point, our hypothesis will be that these behaviours are normally distributed.
Let’s start by using a normality test to see if data supports our hypothesis, for this exercise we’ll use the Shapiro–Wilk test.
norm_1
##
## Shapiro-Wilk normality test
##
## data: average_daily_activity$average_LightlyActiveMinutes
## W = 0.9785, p-value = 0.74
With p-value = 0.74 we have strong evidence the data is normally distributed. Let’s compare our real observations (grey histogram) with what can we expect from the whole population (blue bell).
hist_2
With a calculated total average of 204.59 daily lightly active minutes and a standard deviation of 70.07 we can estimate that 69% percent of our population ranges between 134.52 and 274.66 daily lightly active minutes,now, if the correlation with BMI is accurate, then we can estimate that their BMI ranges from 23 to 36 aproximately. Considering that BMI >= 30 is already considered obesity, then we have a serious health problem inside our population, for more context, go back to the chat titled “Correlation Between Average BMI vs. Average Lightly Active Minutes”, then, visualize the mean on the x-axis (around 200) and see where is the intersection with the blue regression line at the y-axis, its around 30.
Where do our known-BMI participants stand against the whole population? let’s see.
hist_1
It’s quite worrying to think that, in strict theory, people inside the first 3 bins are likely to have obesity problems.
Again, let’s use the Shapiro–Wilk test to see if our data is normally distributed.
norm_3
##
## Shapiro-Wilk normality test
##
## data: average_dailysteps$average_StepTotal
## W = 0.98139, p-value = 0.8272
With a p-value = 0.8272, again, there is strong evidence that our data is normally distributed, let’s compare the real observation with the expected values.
hist_3
Finally, let’s take a look on this metric.
norm_2
##
## Shapiro-Wilk normality test
##
## data: average_TotalMinutesAsleep$average_TotalMinutesAsleep
## W = 0.88339, p-value = 0.009728
Here the results are different. With p-value = 0.009728 (p < 0.01) we can’t say data is normally distributed, this is very strange, honestly, I think this happened due the small sample size. Let’s generate the chart like the ones above, but this time, we’ll add a second bell (in green) with the adjusted distribution.
hist_4
If we take a closer look into our outliers, we can confirm that, the participants who reported them, were not consistent, having a very low amount of records during the month and thus, their reported values are not likely to represent their daily habits but rather misuse of device or at least an atypical record from the user.
head(arrange(average_TotalMinutesAsleep, record_count))
## # A tibble: 6 × 3
## Id average_TotalMinutesAsleep record_count
## <dbl> <dbl> <int>
## 1 2320127002 61 1
## 2 7007744171 68.5 2
## 3 1844505072 652 3
## 4 6775888955 350. 3
## 5 8053475328 297 3
## 6 1644430081 294 4
Now, how can our findings apply to Bellabeat customers? and how can the company benefit from them? let’s see:
We can expect to see a significant portion of users will stop using the devices before a month after start using them. From the perspective of behavioral psychology, this phenomenon happens due lack of proper positive feedback or, in other words, positive behaviour reinforcement related to the monitored activity. So without a proper compensation, the behaviour is destined to disappear. I don’t have the details about how was this data gathered, for example, if our participants are new customers, they received some compensation or they just allowed to have their data gathered, etc., so this is more speculation, but our findings reminds me a lot to what happens on gyms, there is a common scenario, lots of people hire a 6-month plan, motivated to get in better shape, but when they don’t see inmediate results they abandon their goals and stop going, I feel something like that could ave happened here, but again, its speculation. What is clear, is that, most likely, this will happen in some degree with Bellabeat’s customers.
We can expect to see most behaviours to be normally distributed. It is not surprising at all to see that behaviours like lightly active minutes and daily steps are normally distributed, many social, biological and phychogical phenomena shows this pattern and its the reason normal distriution is an important topic those sciences. What is really important are the parameters we’ve found, in particular the average daily lightly active minutes, if the correlation is accurate, we can expect to see half our our customers having weight problems. Let’s speculate one more time, it makes sense to think that, a lot of people would purchase these fitness devices because they are looking for a solution to already existing weight / health problems and thus our average would be higher than their respective national average. I would recommend Bellabeat to survey new customers about the reason that motivated them to get the products this way we can have a better understanding of their needs and motivations.
Finally, we can expect to see a strong correlation between Lightly Active Minutes and BMI. This finding is quite interesting, as the correlation value is lower for physical intensities that we would often associate to excersice routines, like very active and fairly active, even more interesting is the fact the the positive correlation vs sedentary minutes is not at least as high. If the correlation could be verified on a bigger sample, we could definetively use this to motivate Bellabeat’s customers, explaining that they really don’t need an exhausting gym routine to be in better shape, but rather simple habits they can do day by day.
In the end, we need to keep in mind that there are important factor that also contribute to our weight and overall health that go beyond physical activity, just to mention some examples, alcoholism, nutrition, drug use, you name it, so if we really want to know more about how the products can or can’t help the customers, we need to investigate in much further detail, if we want to focus on a small sample, we’ll need to gather data from a much wider time period, for example, two years, and, of course, we’ll need to complement our data with interviews, surveys, etc., all the information a participan can share about their routine.
RStudio. RStudio Team (2021). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
Tydiverse. Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686.
Moments. Lukasz Komsta and Frederick Novomestky (2015). moments: Moments, cumulants, skewness, kurtosis and related tests. R package version 0.14.https://CRAN.R-project.org/package=moments.
Centers for Disease Control and Prevention, “About Adult BMI”, https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html.
Prosch, N. (2018), “Light, Moderate, And Vigorous Activity”, https://extension.sdstate.edu/light-moderate-and-vigorous-activity.
Cherry, K. (2020), “Positive and Negative Reinforcement in Operant Conditioning. How Reinforcement Is Used in Psychology”, https://www.verywellmind.com/what-is-reinforcement-2795414.