Fitness Device Usage Analysis.

Introduction

This exercise presents an example of application of R language on data analysis as part of the Google Data Analytics Program available in Coursera. Most of the analysis was made using Tydiverse , a collection of packages commonly used in the field. The dataset is publicly available here on kaggle, but it commes from Fitabase instead of Bellabeat, the company mentioned on the exercise guideline.

This exercise is divided into different sections:

  • Scenario. Here, I’ll give details about the business task and a brief description of the available data.

  • Limitations and notes before starting. In this section, I’ll mention the limitations and possible bias that you’ll need to keep in mind while reading the rest of the article.

  • Data Cleaning Processing. Here I’ll describe how the data its transformed from the original format to the structure needed for the visual representation I’ll use later.

  • Findings. In this section, I’ll present visualization made using ggplot2, that will support our main findings about the data and conclusions. For this exercise I’ll present 2 different types of findings a) findings about the behavior of wearing the device and b) findings about what is being reported by the devices.

  • Conclusions and Discussion. While it is true that, we can get to some conclusion with the available data, this is a complicated topic, where we can’t overlook factors beyond these numbers. This section will be dedicated to reflect on what we can and can’t know by using just this data.

  • References. It is always important to give credit to the developers, researchers and teams who make this possible, this sections is dedicated to them.

I would like to say thanks to all the people who shared their personal data to make this open dataset possible.

NOTE: Please, feel free to unhide the first cell above to see the entire code.

Scenario

This time, we’ll help Bellabeat (a high-tech manufacturer of health-focused products for women) to analyse data comming from fitness devices. Their senior management team believes that by finding trends and patters, they will be able to help guide marketing strategies for the company.

Originally, the business task is defined as:

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

The first question triggers two different approaches, 1. Studying the behaviour of wearing fitness devices, in practice, this will be comparing the amount of valid records existing in our tables and 2. Studying trends about the physical activity of our participants that could be generalized to wider populations and allow us to make predictions about their health and habits. The sencond and third questions will guide our conclutions to apply to our customers. Our final business task will be defined as:

Analyse Fitabase dataset records comming from April 12th to May 12th, 2016 to identify trends about the usage of their products and the relationships between its variables that could me generalized to Bellabeat customers and thus, useful for marketing purposes

Limitations and Notes Before Starting

Before we beging, lets identify the main limitations affecting our study, while analysing data give us a sense of knowledge, it is important to keep in mind the things that we can and can’t know with certainly just by studying the data, this way, we will be able to separate what we know from what we assume.

  • We can’t be 100% sure all participants are women. Our exercise guideline says “… asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices”, so, while Bellabeat focuses on devices for women, Fitabase mission statement says: “Our mission at Fitabase is to enable researchers to use the latest tools, devices, and apps to further our collective knowledge. We’re dedicated to making it as easy as possible for researchers to deploy the “new tools of wellness” to measure, track, and engage their participants”, there is no mention about their data being exclusively from women and thus we can’t jump to that conclusion, none of the tables that we are going to use has any field for sex/gender either. So, while our conclutions aims to a female population, we can’t be sure our 33 participants are women.

  • We have to assume the data collected is represetative of the “normal habits” of our participants. In other words, we assume they are not bahaving differently because they know their data is being recorded. This happens a lot on social research, we all want to give the best impresion if we know we are being observed, right?

  • The sample it is quite small, this leads to sampling bias. This is a big problem as we aim to generalize our conclusions to a wider population (Bellabeat customers) but our sample consist of only 33 participants, this means we are more vulnerable to get wrong conclusions by outliers or atypical values.

  • There are different numbers of observations per participant regarding different areas. This is related to the previous point, we do not only have a small sample, but also, during the 31 days period the data was gathered, their participation rate declined, we will explore this trend in detail in the “Findings” section below.

  • We have no information about preexisting medical conditions afecting the participants or any other background. In the findings section, you’ll see a correlation between physical activity and Body mass index (BMI), however, the overall physical status depends on many different factors, and not consider them may oversimplify a complex matter.

FINAL NOTE: In the findings section below, you’ll see terms such as overweight, obese, etc., personally, I don’t like using these terms … but I have to, as I need to keep my language consistent with medical terminology and standard categories.

Data Cleaning and Processing

The Fitabase dataset consist of 18 files, however, for this exercise I’ve decided to work with only 5 of them, they are:

  • weightLogInfo_merged.csv.
  • dailyActivity_merged.csv.
  • sleepDay_merged.csv.
  • dailySteps_merged.csv.
  • hourlyIntensities_merged.csv.

First, let’s mention the data that I’ve decided to remove:

  • dailyActivity_merged.csv. Records where SedentaryMinutes equals 1440.
  • dailySteps_merged.csv. Records where StepTotal equals 0.
  • hourlyIntensities_merged.csv. Records where AverageIntensity equals 0.

What was the criteria to remove those records? They were removed because on all of those cases we can assume that the device (not the owner) remainded static for the day, in other words, the device is active, but not being used by the participant.

Let’s talk about date format homologation. Among the tables, we find date / datetime fields formated differently, of course, this becomes an issue while trying to combine the information and thus, it has to be homologated to a single format. The tables we had to manipulate were:

  • dailyActivity_merged.csv. ActivityDate field is transformed using the function: “mdy()”.
  • sleepDay_merged.csv. SleepDay field is trasformed usign the function: “mdy_hms()”.
  • dailySteps_merged.csv. ActivityDay field is transformed using the function: “mdy()”.

There is also a case where the the analyisis was focused on time instead of date, the table was:

  • hourlyIntensities_merged.csv. Activity_Hour field is transformed using the function: hour()

The approach allowed me to extract the hour from a datetime field.

Finally, let’s talk about how the data was summarized for the analysis. As mentioned before, there are differences regarding the consistency among participants, for example, some of them reporting sleep hours every day versus some of them reporting just a couple of times, to tackle these differences, I’ve summarized data by averages, for example, let’s say we have a participant who reported being active 10, 15, and 20 minutes from april 28th to april 30th, this participant gets a daily average of 15 minutes and that is the value I’ll use to compare him with the rest of participants, I’ve used the same approach in every table, please keep this in mind while reading the article.

Findings

Device Usage

A common problem in research that involves lack of strict experimental control is having, at least, some degree of data lost due human errors or abandonment. Outside research, the time we dedicate to some activity can also be afected by lost of motivation or interfering events. Here was no different (it is totally expected), the general participation rate decreased over the month the data was gathered, here are a couple of examples.

Daily Levels of Intensity.

The following chart represents the decline of valid records from participants where the amount of sedentary minutes is different from 1440. To give more context, this table summarises the amount of minutes spent on different levels of intensity, from sedentary to very active, when a record shows a value of 1440 minutes spent on sedentary activity, it means that the device didn’t tracked any activity, and, unless the user stood completly motionless during the day, it’s safe to say we are observing an unvalid record and thus, it has to be removed, the rest of records, having valid information, were count by day. As we have 33 participants, we are expecting to see this amount of records by day, but as you are about to see, this rate drops significantly.

tracker_1

If we change the perspective, and see it from the side of the individual participation rate, in general, we can see that the most part of participants remained quite participative, the table below summarizes their participation per day, as a total, and as a percentage (please remember, this is for the daily activity table only).

rename(participant_records, "Total" = total_obs, 
       "Participation Percent" = part_percent)
## # A tibble: 33 × 10
##            Id   tue   wed   thu   fri   sat   sun   mon Total `Participation P…`
##         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>              <dbl>
##  1 1624580081     5     5     5     4     4     4     4    31                100
##  2 2022484408     5     5     5     4     4     4     4    31                100
##  3 2026352035     5     5     5     4     4     4     4    31                100
##  4 2320127002     5     5     5     4     4     4     4    31                100
##  5 2873212765     5     5     5     4     4     4     4    31                100
##  6 4445114986     5     5     5     4     4     4     4    31                100
##  7 4558609924     5     5     5     4     4     4     4    31                100
##  8 5553957443     5     5     5     4     4     4     4    31                100
##  9 6962181067     5     5     5     4     4     4     4    31                100
## 10 8053475328     5     5     5     4     4     4     4    31                100
## # … with 23 more rows

Let’s summarize it one step further by grouping by participation rate. The table bellow presents the count of participants according to their participation rate, 31 records represents the 100% of expected records.

rename(category_participant_records, "Participation Percent" = part_percent)
## # A tibble: 10 × 2
## # Groups:   Participation Percent [10]
##    `Participation Percent` count
##                      <dbl> <int>
##  1                  100       12
##  2                   96.8      7
##  3                   90.3      1
##  4                   80.6      3
##  5                   74.2      1
##  6                   71.0      1
##  7                   64.5      2
##  8                   58.1      3
##  9                   54.8      2
## 10                    9.68     1

This behaviour goes beyond this tracker (table) and remains consistent in different tables, let’s see another two examples

Daily sleep minutes

Here, the participation rate shows a similar pattern of declive, in addition the participation rate, in general, was much lower.

tracker_2

Steps by Day

Here, the pattern is almost identical to what we see on the first tracker.

tracker_3

Summary: Daily Use of Devices

To summarise these observations, let’s see the “whole picture” by merging the last 3 charts

tracker_4

While the pattern is clear, the question remains, what causes it? it’s hard to say, there could be many things, included, lack of motivation, device malfunction, misunderstanding of correct usage, you name it, all is speculation until we ask the participant directly. However, from these observations we’ve found some recommendations for our stakeholders:

  • Follow-up with research participants to ensure data entries remain consistent and optimal. For example, by giving a quick call to participants if we notice their devices are no longer returning any valid data.
  • Motivate participation by providing compensation. For example, by giving some discount or gift to participants who kept their participation during the whole study.

Outside the context of research:

  • Adding notifications and reports to movile apps. Giving constant feedback is a great way to keep people commited to a goal.

Physical Activity and BMI

Eight of our participants shared data about their BMI with us, this is particularly useful because opens the door to compare levels of physical activity with an overall physical state, according to the Centers for Disease Control and Prevention (CDC), Adult BMI are categorized as:

  • BMI < 18.5 = underweight
  • BMI 18.5 to 24.9 = normal
  • BMI 25 to 29.9 = overweight
  • BMI >= 30 = obese

Taking the BMI data from these eight participants we can generate categories about what we can call “Overall State” (normal, overweight, etc.), also, we can try to find correlation between their average BMI and different behaviours (minutes spent on different levels of activity, daily steps, sleep, etc.), the first table looks like this.

rename(basic_profile, "Average BMI" = average_BMI,
      "Overall State" = overall_state) 
## # A tibble: 8 × 3
##           Id `Average BMI` `Overall State`
##        <dbl>         <dbl> <chr>          
## 1 2873212765          21.6 normal         
## 2 1503960366          22.6 normal         
## 3 6962181067          24.0 normal         
## 4 8877689391          25.5 overweight     
## 5 4558609924          27.2 overweight     
## 6 4319703577          27.4 overweight     
## 7 5577150313          28   overweight     
## 8 1927972279          47.5 obese

Now, remember that we have a table that provides the daily minutes spent on each level of activity by participant and date, after the data is summarized to show the averages values per participant, the resulting table looks like the one below, let’s show only the first 6 rows.

head(average_daily_activity) #first 6 rows, column names haven't been edited
## # A tibble: 6 × 11
##           Id average_VeryAct… average_FairlyA… average_Lightly… average_Sedenta…
##        <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
## 1 1503960366            40               19.8             227.              828.
## 2 1624580081             8.68             5.81            153.             1258.
## 3 1644430081             9.57            21.4             178.             1162.
## 4 1844505072             0.18             1.82            163.             1111.
## 5 1927972279             2.28             1.33             66.4            1229.
## 6 2022484408            36.3             19.4             257.             1113.
## # … with 6 more variables: record_count <int>, total_mins <dbl>,
## #   Percent_VeryActiveMinutes <dbl>, Percent_FairlyActiveMinutes <dbl>,
## #   Percent_LightlyActiveMinutes <dbl>, Percent_SedentaryMinutes <dbl>

We can see 4 levels of activity, from very active, to sedentary. There is one type of activity that is particularly interesting, Lightly Active Minutes. There is a negative correlation between the participant’s BMI and their average daily lightly active minutes, let’s appreciate it on a table joining data from our 8 participants who reported BMI and their average lightly active minutes:

basic_profile_main %>% select(c("Id", "average_BMI", "overall_state", "average_LightlyActiveMinutes")) %>%
rename("Average Lightly Active Minutes" = average_LightlyActiveMinutes, 
      "Overall State" = overall_state, 
      "Average BMI" = average_BMI)
##           Id Average BMI Overall State Average Lightly Active Minutes
## 1 2873212765    21.57000        normal                         308.00
## 2 1503960366    22.65000        normal                         227.27
## 3 6962181067    24.02800        normal                         245.81
## 4 8877689391    25.48708    overweight                         234.71
## 5 4558609924    27.21400    overweight                         284.97
## 6 4319703577    27.41500    overweight                         236.40
## 7 5577150313    28.00000    overweight                         158.50
## 8 1927972279    47.54000         obese                          66.44

Let’s create a scatter plot, let’s add a regression line and finally, let’s calculate a correlation coefficient:

scatter_1
## `geom_smooth()` using formula 'y ~ x'

With a calculated pearson’s value of -0.87 it’s safe to say that there is evidence of a strong negative correlation.

To complement our findings, let’s compare the Pearson’s value of average BMI vs each of the activity levels.

round(cor(basic_profile_main[,2], basic_profile_main[,4:7], method = "pearson"), digits = 2)
##      average_VeryActiveMinutes average_FairlyActiveMinutes
## [1,]                     -0.28                       -0.43
##      average_LightlyActiveMinutes average_SedentaryMinutes
## [1,]                        -0.87                     0.45

With the only exception being Lightly Active Minutes, there are no other strong correlation levels.

The next question is, if this level of activity is so highly related to people’s BMI then, on average, what portion of our day we spend on this level of activity compared to other intensities? to answer this question, let’s go back to our 8 participants with known BMI and let’s see how they compare to each other.

bar_1

It is quite interesting to see how the activities that takes less than the 25% of our day have the most impact on keeping us in shape. On the other side, it is no surprise at all to see that for most part of our day we’re on sedentary activities.

Now, while we can definitively take advantage of our fitness device to control the amount of minutes we dedicate to all of these intensity levels, what happens if this is not the case? for example, when I’m using my device but not changing my routine? well, in that case we can expect to see a lot of variation throughout the day, in a single hour we can walk on the street, run to catch a bus, watch TV, etc., so, we can calculate an average activity level per hour that is representative of all those activities, hourlyIntensities_merged.csv allow us to get this information, let’s see, on average, what is the activity average per hour from our whole sample.

line_1 

The pattern is totally expected, from the 21hrs to 4hrs (next day) we see activity levels below the general average as most part of people rests and sleep during this time, but from 5hrs to 20hrs we wake up, go to work, and do all of our activities.

The chart above, while shows some truth about our sample, it can also be misleading, as the mean alone won’t tell the whole story of our data. To have a better understanding of our sample, let’s generate a boxplot.

box_1

Let’s repeat this approach, this time, using only data from out 8 participants with known BMI to see how they compare.

line_2

The results are quite similar in general, but this time we can see how the participant with higher the BMI has lower hourly levels of activity, most of them below the general average.

Again, let’s now generate a boxplot, we know the average activity per group and hour, but the questions remains, how did these hourly values vary from day to day? remember, the study consisted of 31 days, so… there was little or no variation? or on the contrary, the hourly average is rather representing a wide range of values?

box_2

There is an alternative way to represent these results, it is less accurate, but can also provide a general understanding about the differences we are studying among the groups.

line_3

The chart above has a limitation, to keep the chart simple I calculated the standard deviation (sd) per hour and group and then, using an area geom, I’m just highliting the -1/+1 sd area to give you a general sense of the dispersion.

The last question is, exactly, what is the lightly active intensity? is there a specific set of activities we can do to trigger this physical state? according to Nikki Prosch, on her article Light, Moderate, And Vigorous Activity, lightly activity is:

“Light intensity activities require the least amount of effort, compared to moderate and vigorous activities. The definition for light intensity activity is an activity that is classified as < 3 METS. One MET, or metabolic equivalent, is the amount of oxygen consumed while sitting at rest. Thus, an activity classified as 2 METS would be equal to 2 times the amount of oxygen consumed while sitting at rest (1 MET). METS are a convenient and standard method for describing absolute intensity of physical activities. Some examples of light physical activities include: walking slowly (i.e. shopping, walking around the office), sitting at your computer, making the bed, eating, preparing food, and washing dishes.”

If walking is an example of light intensity activity, then, it makes sense to think that the number of steps registered by our devices should by strongly correlated to the amount of lightly active minutes, then, given the correlation we’ve found between lightly active minutes and BMI, we can say we have found the way to stay in good shape, right? let’s see.

scatter_2
## `geom_smooth()` using formula 'y ~ x'

Unfortunately, the answer to the question above is: not necessarily. With a value of 0.39 we can conclude the correlation is rather weak. As expected, we cannot oversimplify complicated matters.

Summary: BMI and Physical Activity

What can we conclude from all these observations? there are a few things:

  • There is no question that the more active we are throughout the day, the more likely we are to stay in good physical shape.

  • However, it seems that “lightly active” is the level of intensity more closely related with BMI, instead of the more intense levels.

  • While walking is considered an example of a light intensity activity, it is not necessarily the only or main way to trigger this physical state, we need to keep in mind that each user may behave differently, and what works for some of them, may no be the right fit for others.

  • Given that our devices are capable of measuring METs, the best approach is to learn from the users themselves and provide motivation by using their own MET indicators.

Generalizing to a Wider Population

Generalizing our finding is difficult mostly for 2 reasons 1. The sample is very small and 2. because we don’t really know anything about our participants backgroud for example, we don’t know if they are from the same country, their age, their medical condition, etc., this makes difficult to compare our results with different research. To overcome this obstacle we’ll have to make one assumption: that our 33 participants are a representative sample of a population that population will be defined as Fitbase users.

So, if our sample is representative of the population, how should our results look like if we were looking at the whole population? for this exercise, we’ll use 3 metricts:

  • Average lightly Active Minutes.
  • Average daily steps.
  • Average sleep minutes.

As a starting point, our hypothesis will be that these behaviours are normally distributed.

Average lightly Active Minutes

Let’s start by using a normality test to see if data supports our hypothesis, for this exercise we’ll use the Shapiro–Wilk test.

norm_1
## 
##  Shapiro-Wilk normality test
## 
## data:  average_daily_activity$average_LightlyActiveMinutes
## W = 0.9785, p-value = 0.74

With p-value = 0.74 we have strong evidence the data is normally distributed. Let’s compare our real observations (grey histogram) with what can we expect from the whole population (blue bell).

hist_2

With a calculated total average of 204.59 daily lightly active minutes and a standard deviation of 70.07 we can estimate that 69% percent of our population ranges between 134.52 and 274.66 daily lightly active minutes,now, if the correlation with BMI is accurate, then we can estimate that their BMI ranges from 23 to 36 aproximately. Considering that BMI >= 30 is already considered obesity, then we have a serious health problem inside our population, for more context, go back to the chat titled “Correlation Between Average BMI vs. Average Lightly Active Minutes”, then, visualize the mean on the x-axis (around 200) and see where is the intersection with the blue regression line at the y-axis, its around 30.

Where do our known-BMI participants stand against the whole population? let’s see.

hist_1

It’s quite worrying to think that, in strict theory, people inside the first 3 bins are likely to have obesity problems.

Average daily steps

Again, let’s use the Shapiro–Wilk test to see if our data is normally distributed.

norm_3
## 
##  Shapiro-Wilk normality test
## 
## data:  average_dailysteps$average_StepTotal
## W = 0.98139, p-value = 0.8272

With a p-value = 0.8272, again, there is strong evidence that our data is normally distributed, let’s compare the real observation with the expected values.

hist_3

Average sleep minutes

Finally, let’s take a look on this metric.

norm_2
## 
##  Shapiro-Wilk normality test
## 
## data:  average_TotalMinutesAsleep$average_TotalMinutesAsleep
## W = 0.88339, p-value = 0.009728

Here the results are different. With p-value = 0.009728 (p < 0.01) we can’t say data is normally distributed, this is very strange, honestly, I think this happened due the small sample size. Let’s generate the chart like the ones above, but this time, we’ll add a second bell (in green) with the adjusted distribution.

hist_4

If we take a closer look into our outliers, we can confirm that, the participants who reported them, were not consistent, having a very low amount of records during the month and thus, their reported values are not likely to represent their daily habits but rather misuse of device or at least an atypical record from the user.

head(arrange(average_TotalMinutesAsleep, record_count))
## # A tibble: 6 × 3
##           Id average_TotalMinutesAsleep record_count
##        <dbl>                      <dbl>        <int>
## 1 2320127002                       61              1
## 2 7007744171                       68.5            2
## 3 1844505072                      652              3
## 4 6775888955                      350.             3
## 5 8053475328                      297              3
## 6 1644430081                      294              4

Conclusions and Discussion

Now, how can our findings apply to Bellabeat customers? and how can the company benefit from them? let’s see:

  • We can expect to see a significant portion of users will stop using the devices before a month after start using them. From the perspective of behavioral psychology, this phenomenon happens due lack of proper positive feedback or, in other words, positive behaviour reinforcement related to the monitored activity. So without a proper compensation, the behaviour is destined to disappear. I don’t have the details about how was this data gathered, for example, if our participants are new customers, they received some compensation or they just allowed to have their data gathered, etc., so this is more speculation, but our findings reminds me a lot to what happens on gyms, there is a common scenario, lots of people hire a 6-month plan, motivated to get in better shape, but when they don’t see inmediate results they abandon their goals and stop going, I feel something like that could ave happened here, but again, its speculation. What is clear, is that, most likely, this will happen in some degree with Bellabeat’s customers.

  • We can expect to see most behaviours to be normally distributed. It is not surprising at all to see that behaviours like lightly active minutes and daily steps are normally distributed, many social, biological and phychogical phenomena shows this pattern and its the reason normal distriution is an important topic those sciences. What is really important are the parameters we’ve found, in particular the average daily lightly active minutes, if the correlation is accurate, we can expect to see half our our customers having weight problems. Let’s speculate one more time, it makes sense to think that, a lot of people would purchase these fitness devices because they are looking for a solution to already existing weight / health problems and thus our average would be higher than their respective national average. I would recommend Bellabeat to survey new customers about the reason that motivated them to get the products this way we can have a better understanding of their needs and motivations.

  • Finally, we can expect to see a strong correlation between Lightly Active Minutes and BMI. This finding is quite interesting, as the correlation value is lower for physical intensities that we would often associate to excersice routines, like very active and fairly active, even more interesting is the fact the the positive correlation vs sedentary minutes is not at least as high. If the correlation could be verified on a bigger sample, we could definetively use this to motivate Bellabeat’s customers, explaining that they really don’t need an exhausting gym routine to be in better shape, but rather simple habits they can do day by day.

In the end, we need to keep in mind that there are important factor that also contribute to our weight and overall health that go beyond physical activity, just to mention some examples, alcoholism, nutrition, drug use, you name it, so if we really want to know more about how the products can or can’t help the customers, we need to investigate in much further detail, if we want to focus on a small sample, we’ll need to gather data from a much wider time period, for example, two years, and, of course, we’ll need to complement our data with interviews, surveys, etc., all the information a participan can share about their routine.

References