Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

7  Data management and wrangling

Session objectives:

  1. Learn the difference between “messy” and “tidy” data, including how to create tidy data to simplify your analysis.
  2. Perform simple transformations and subsetting of datasets, such as:
    • Subset specific columns and rows of a dataset, with filter()and select().
    • Sort rows of a dataset by a specific column, with arrange().
    • Create new or transform existing columns in a dataset, with mutate().
    • Calculate simple data summaries, with summarise().
  3. Learn about and apply the “split-apply-combine” method of analyses, with group_by() and summarise().
  4. Write “tidier” and more readable code by using the pipe (%>%) operator.

7.1 “Messy” vs. “tidy” data

This session usually takes a fair amount of time relative to the other sessions. The Quarto session is right afterwards and since it usually is done quicker than other sessions, it’s fine if this session takes up time in the Quarto session.

For the reading section, emphasize the characteristics of a “tidy” dataset.

Reading task: ~10 minutes

The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book. Before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that has already come up in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. We teach the tidyverse because of these above mentioned reasons.

Ok, back to “tidy data”. A tidy dataset is when:

  • Each variable has its own column (e.g. “Body Weight”).
  • Each observation has its own row (e.g. “Person”).
  • Each value has its own cell (e.g. “Body weight for a person at a specific date”).

Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each may be considered “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?

# Datasets come from tidyr
# Tidy:
table1
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# Partly tidy:
table2
#> # A tibble: 12 × 4
#>    country      year type            count
#>    <chr>       <dbl> <chr>           <dbl>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
# Messier:
table3
#> # A tibble: 6 × 3
#>   country      year rate             
#>   <chr>       <dbl> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
# Messy:
table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# Messy:
table4b
#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#>   <chr>            <dbl>      <dbl>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

The “most” tidy version is table1 as each column describes their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row.

table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size.

On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside. For example, you can’t tell from the data what the values in the 1999 column contain.

Tidy data has a few notable benefits:

  1. Time spent preparing your data to be tidy from the beginning can save days of added work and frustration in the long run.
  2. “Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy-to-interpret ways, especially when used within the framework of the tidyverse.

The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data. This is one simple way that you can enhance the reproducibility of your code.

7.2 Managing and working with data in R

After they read it, go over and emphasize the functions shown in the table that we will be teaching.

Reading task: ~5 minutes

When working with data, there are a few principles to follow:

  • You should always save your original raw dataset in the data-raw/ folder.
    • Note: Whether or not you save data to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
  • Never edit your raw data directly and save it in a separate location.
  • Only work with your raw data using R code. Don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
  • Save the edited data as another dataset and store it in the data/ folder.

When wrangling your data with R code, make sure to:

  • Document what you did to your data and why you did it, to help you remember later on (by using hashes to indicate comments and non-code text).
  • Write the code in a way that is as descriptive as possible and is readable enough to understand what is being done to the data.
  • Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than clever and sometimes overly-complex code.

In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here refers to maneuvering, managing, controlling, and turning your data around to clean it up, better understand it, and prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that come from the tidyverse:

List of common data wrangling tasks, along with an example and the function used for the wrangling.
Task Example Function
Select columns Remove data entry columns such as the person’s name who entered the data. select()
Rename columns Changing a column name from ‘Q1’ to ‘participant_name’. rename()
Transform or modify columns Multiplying or taking the log of a column’s values. mutate()
Subset/filter rows Keeping rows with glucose values above 4. filter()
Sort rows Show rows with the smallest value at the top. arrange()
Calculate summaries Calculating the maximum, median, and minimum age. summarise()
Run code by a group Calculate means of age by males and females. group_by() with summarise()

Tip: Sometimes you need to do some complicated wrangling to get your data “in shape” for later analyses. To save some time, you could save the wrangled data as an “output” dataset in the data/ folder. That way, you can easily use it again later rather than having to run the wrangling code every time you want to work with the data.

7.3 Load the packages and dataset

We’re going to use the US NHANES dataset to demonstrate some wrangling functions. There is a R package called {NHANES} that contains a teaching version of the original dataset, so we’ll use that for this course.

First, make sure that you have the LearningR R Project open. Second, open the R/learning.R script and write code to load the {NHANES} package. It should look like this now:

R/learning.R
library(tidyverse)
library(NHANES)

There are many ways of loading in data into R. This method loads data that is included in a package, in this case {NHANES}. We will cover how to load a dataset from a file at the end of this session.

Then, below the library(), let’s take a glimpse() at the NHANES dataset. Run this function by hitting Ctrl-Enter while the cursor is on the code to send it from the script to the R Console.

R/learning.R
# Briefly glimpse contents of dataset
glimpse(NHANES)
#> Rows: 10,000
#> Columns: 76
#> $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638,…
#> $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_…
#> $ Gender           <fct> male, male, male, male, female, male, mal…
#> $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, …
#> $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-…
#> $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 54…
#> $ Race1            <fct> White, White, White, Other, White, White,…
#> $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Education        <fct> High School, High School, High School, NA…
#> $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartne…
#> $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20…
#> $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500,…
#> $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33,…
#> $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 1…
#> $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own,…
#> $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, N…
#> $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2,…
#> $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Height           <dbl> 165, 165, 165, 105, 168, 133, 131, 167, 1…
#> $ BMI              <dbl> 32.2, 32.2, 32.2, 15.3, 30.6, 16.8, 20.6,…
#> $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.…
#> $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 6…
#> $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118…
#> $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 6…
#> $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106…
#> $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 6…
#> $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118…
#> $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 6…
#> $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118…
#> $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 6…
#> $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2…
#> $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5…
#> $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106…
#> $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.11…
#> $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, N…
#> $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood…
#> $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4…
#> $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0…
#> $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, No…
#> $ Depressed        <fct> Several, Several, Several, NA, Several, N…
#> $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, N…
#> $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA…
#> $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, N…
#> $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4,…
#> $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, N…
#> $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes…
#> $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5…
#> $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, …
#> $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, …
#> $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes,…
#> $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2,…
#> $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100,…
#> $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, …
#> $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, N…
#> $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, N…
#> $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 1…
#> $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes,…
#> $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, N…
#> $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, N…
#> $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, N…
#> $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes,…
#> $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 1…
#> $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, …
#> $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1…
#> $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Ye…
#> $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual,…
#> $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

This gives us a quick overview of what variables are in the dataset and the data types that they are.

7.4 Selecting specific columns in a dataset

Selecting columns of a dataset is a very common data wrangling task. The function for this task in RStudio is appropriately called select(). You would use select() to extract one or more variables in a dataset that you want to have a closer look at or to save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., the Age column) and clinical variables (e.g., Weight and Height columns) to perform these analyses.

For the input arguments, select() takes the dataset as the first argument, which is the first input position right after the opening bracket (, and then takes the names of the columns you want to select. The argument after the data argument is ..., which indicates that you can add as many columns as you want, separated by a ,.

R/learning.R
# Select one column by its name, without quotes
select(NHANES, Age)
#> # A tibble: 10,000 × 1
#>      Age
#>    <int>
#>  1    34
#>  2    34
#>  3    34
#>  4     4
#>  5    49
#>  6     9
#>  7     8
#>  8    45
#>  9    45
#> 10    45
#> # ℹ 9,990 more rows
# Select two or more columns by name, without quotes
select(NHANES, Age, Weight, BMI)
#> # A tibble: 10,000 × 3
#>      Age Weight   BMI
#>    <int>  <dbl> <dbl>
#>  1    34   87.4  32.2
#>  2    34   87.4  32.2
#>  3    34   87.4  32.2
#>  4     4   17    15.3
#>  5    49   86.7  30.6
#>  6     9   29.8  16.8
#>  7     8   35.2  20.6
#>  8    45   75.7  27.2
#>  9    45   75.7  27.2
#> 10    45   75.7  27.2
#> # ℹ 9,990 more rows
# To *exclude* a column, use minus (-)
select(NHANES, -HeadCirc)
#> # A tibble: 10,000 × 75
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct>
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA> 
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA> 
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA> 
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA> 
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA> 
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA> 
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA> 
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA> 
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA> 
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA> 
#> # ℹ 9,990 more rows
#> # ℹ 67 more variables: Education <fct>, MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>,
#> #   HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
#> #   Length <dbl>, Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>,
#> #   BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
#> #   BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, …

If some of your columns have similar patterns for naming at the beginning, middle, or end of the name, you can use some helper functions to choose these columns. Some commonly used helpers are:

R/learning.R
# All columns starting with letters "BP" (blood pressure)
select(NHANES, starts_with("BP"))
#> # A tibble: 10,000 × 8
#>    BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#>       <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1      113       85    114     88    114     88    112     82
#>  2      113       85    114     88    114     88    112     82
#>  3      113       85    114     88    114     88    112     82
#>  4       NA       NA     NA     NA     NA     NA     NA     NA
#>  5      112       75    118     82    108     74    116     76
#>  6       86       47     84     50     84     50     88     44
#>  7      107       37    114     46    108     36    106     38
#>  8      118       64    106     62    118     68    118     60
#>  9      118       64    106     62    118     68    118     60
#> 10      118       64    106     62    118     68    118     60
#> # ℹ 9,990 more rows
# All columns ending in letters "Day"
select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 × 3
#>    TVHrsDay CompHrsDay AlcoholDay
#>    <fct>    <fct>           <int>
#>  1 <NA>     <NA>               NA
#>  2 <NA>     <NA>               NA
#>  3 <NA>     <NA>               NA
#>  4 <NA>     <NA>               NA
#>  5 <NA>     <NA>                2
#>  6 <NA>     <NA>               NA
#>  7 <NA>     <NA>               NA
#>  8 <NA>     <NA>                3
#>  9 <NA>     <NA>                3
#> 10 <NA>     <NA>                3
#> # ℹ 9,990 more rows
# All columns containing letters "Age"
select(NHANES, contains("Age"))
#> # A tibble: 10,000 × 9
#>      Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge
#>    <int> <fct>         <int>       <int>      <int>    <int>
#>  1    34 " 30-39"        409          NA         NA       18
#>  2    34 " 30-39"        409          NA         NA       18
#>  3    34 " 30-39"        409          NA         NA       18
#>  4     4 " 0-9"           49          NA         NA       NA
#>  5    49 " 40-49"        596          NA         27       38
#>  6     9 " 0-9"          115          NA         NA       NA
#>  7     8 " 0-9"          101          NA         NA       NA
#>  8    45 " 40-49"        541          NA         NA       NA
#>  9    45 " 40-49"        541          NA         NA       NA
#> 10    45 " 40-49"        541          NA         NA       NA
#> # ℹ 9,990 more rows
#> # ℹ 3 more variables: AgeFirstMarij <int>, AgeRegMarij <int>,
#> #   SexAge <int>

For more information on using the pattern functions such as starts_with(), check out ?select_helpers (choose the “Select helpers” option in the menu that pops up).

You’ll notice that running these functions doesn’t actually change the data itself. When you run a function without assigning it using <-, the only action the function does is to send the output to your screen, and you won’t have saved that data anywhere for later use. But if you want to create a new dataset with only the columns you selected, you’ll need to assign the selected dataset to a new object.

The full NHANES dataset is 10,000 individuals (rows) with 76 parameters (columns). To make it easier to look at and since we are only interested in some of these parameters, we will subset the large dataset into something smaller and save it for later use as a new dataset.

R/learning.R
# Save the selected columns as a new data frame
# Recall the style guide for naming objects
nhanes_small <- select(NHANES, Age, Gender, BMI, Diabetes,
                       PhysActive, BPSysAve, BPDiaAve, Education)

# View the new data frame
nhanes_small
#> # A tibble: 10,000 × 8
#>      Age Gender   BMI Diabetes PhysActive BPSysAve BPDiaAve
#>    <int> <fct>  <dbl> <fct>    <fct>         <int>    <int>
#>  1    34 male    32.2 No       No              113       85
#>  2    34 male    32.2 No       No              113       85
#>  3    34 male    32.2 No       No              113       85
#>  4     4 male    15.3 No       <NA>             NA       NA
#>  5    49 female  30.6 No       No              112       75
#>  6     9 male    16.8 No       <NA>             86       47
#>  7     8 male    20.6 No       <NA>            107       37
#>  8    45 female  27.2 No       Yes             118       64
#>  9    45 female  27.2 No       Yes             118       64
#> 10    45 female  27.2 No       Yes             118       64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: Education <fct>

7.5 Renaming all column names based on the style guide

In the interests of keeping data tidy and matching the style guide, we should change the column names to be all lower case with _ for spaces between words. There’s a package that can do that for us called snakecase.

To change all the column names to snakecase, we’ll use the function rename_with(). This function takes the data as the first argument but the second argument needs to be a function, which in our case is called snakecase::to_snake_case(), but exclude the () at the end. This function will rename all columns.

R/learning.R
# Rename all columns to snake case
nhanes_small <- rename_with(nhanes_small, snakecase::to_snake_case)

# Have a look at the data frame
nhanes_small
#> # A tibble: 10,000 × 8
#>      age gender   bmi diabetes phys_active bp_sys_ave bp_dia_ave
#>    <int> <fct>  <dbl> <fct>    <fct>            <int>      <int>
#>  1    34 male    32.2 No       No                 113         85
#>  2    34 male    32.2 No       No                 113         85
#>  3    34 male    32.2 No       No                 113         85
#>  4     4 male    15.3 No       <NA>                NA         NA
#>  5    49 female  30.6 No       No                 112         75
#>  6     9 male    16.8 No       <NA>                86         47
#>  7     8 male    20.6 No       <NA>               107         37
#>  8    45 female  27.2 No       Yes                118         64
#>  9    45 female  27.2 No       Yes                118         64
#> 10    45 female  27.2 No       Yes                118         64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>

7.6 Renaming specific columns

Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory, which is particularly important if you’re sharing your work with others or in an environment where multiple people are working on the same data. We will use the function called rename(). Like select(), rename() takes the dataset as the first argument and then takes as many renaming arguments as you want (because the second argument position is ...). When renaming, it takes the form of newname = oldname.

The “gender” variable in the dataset actually describes “sex”, so let’s rename it to accurately reflect the data itself. Note that gender is a social construct, while sex is biological. This unfortunately is often incorrectly named in many scientific papers, where accuracy of terms is critical. Because we want to rename the variable in the dataset so it stays renamed, we need to assign it again to nhanes_small.

R/learning.R
nhanes_small <- rename(nhanes_small, sex = gender)
nhanes_small
#> # A tibble: 10,000 × 8
#>      age sex      bmi diabetes phys_active bp_sys_ave bp_dia_ave
#>    <int> <fct>  <dbl> <fct>    <fct>            <int>      <int>
#>  1    34 male    32.2 No       No                 113         85
#>  2    34 male    32.2 No       No                 113         85
#>  3    34 male    32.2 No       No                 113         85
#>  4     4 male    15.3 No       <NA>                NA         NA
#>  5    49 female  30.6 No       No                 112         75
#>  6     9 male    16.8 No       <NA>                86         47
#>  7     8 male    20.6 No       <NA>               107         37
#>  8    45 female  27.2 No       Yes                118         64
#>  9    45 female  27.2 No       Yes                118         64
#> 10    45 female  27.2 No       Yes                118         64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
Tip

We’ve re-assigned the object nhanes_small multiple times. If you make a mistake, the easiest solution is to re-run the code from the start again, which should fix any issues.

Now, you can see that the column has been renamed from gender to sex. What if you wanted to select some columns and then rename some of them? Would you have to create a new data object every time? No! We can make use of a very powerful tool called piping with the %>% function.

7.7 Chaining functions with the pipe

Reading task: ~5 minutes

A key component of tidy data and tidy code is making use of the “pipe” operator, %>%. There is also the base R pipe |>, which works basically the same but with some small differences. For this course we will use %>%, though you can use either.

You would use the “pipe” operator when you are writing a piece of code with multiple operations or intermediate steps that require you to save and overwrite each step as an object (see below). One advantage of the “pipe” operator is that it will help to ensure that your code is less cluttered with redundant and temporary object names.

This operator allows you to “pipe” the output from one function to the input of another function, just like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:

We need some eggs. Drive to the grocery store and buy some eggs before arriving home from work.

There are basically two actions here (“drive” and “buy”) with four inputs (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive() and buy(). In regular R (without the “pipe” operator), we would have to nest functions (reading from the inside to the outside) together to chain them:

drive(buy(drive(at_work, "grocery store"), "eggs"), "home")

This is difficult to read. Another way to chain functions would be to create temporary objects for each step:

at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")

This still isn’t very “readable”, as we are having to re-name each intermediate object with reference to the object before it. The pipe %>% operator can really simplify this:

at_work %>% 
    drive("grocery store") %>% 
    buy("eggs") %>% 
    drive("home")

Do you find this more readable and understandable? It reads how it would be done, in order of the steps taken.

The idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for the tidyverse and many other packages.

The operator %>% takes the output from the object or function from the left of the operator and puts it into the function on the right of the operator. All input goes into the first position argument of the function. Within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order to work with the pipe.

Let’s try this out on NHANES. The keyboard shortcut for the pipe is Ctrl-Shift-M (i.e., M for the magrittr package that created the pipe).

R/learning.R
# These two ways are the same
colnames(nhanes_small)
#> [1] "age"         "sex"         "bmi"         "diabetes"   
#> [5] "phys_active" "bp_sys_ave"  "bp_dia_ave"  "education"
nhanes_small %>% 
    colnames()
#> [1] "age"         "sex"         "bmi"         "diabetes"   
#> [5] "phys_active" "bp_sys_ave"  "bp_dia_ave"  "education"

The pipe automatically takes nhanes_small and puts it into the first position, so we don’t need to specify nhanes_small inside colnames() when piping.

Let’s try using the pipe with the select() and rename() functions from the previous section. Remember, both select() and rename() take a dataset as the first argument, which makes them pipe-able.

R/learning.R
nhanes_small %>% 
    select(phys_active) %>% 
    rename(physically_active = phys_active)
#> # A tibble: 10,000 × 1
#>    physically_active
#>    <fct>            
#>  1 No               
#>  2 No               
#>  3 No               
#>  4 <NA>             
#>  5 No               
#>  6 <NA>             
#>  7 <NA>             
#>  8 Yes              
#>  9 Yes              
#> 10 Yes              
#> # ℹ 9,990 more rows

We can now “read” these actions as:

Take the nhanes_small dataset and then select the “phys_active” column and then rename the “phys_active” column to “physically_active”.

7.8 Exercise: Practice what we’ve learned

Time: 10 minutes.

In the R/learning.R script file, complete the following tasks, running each code after completing the task. Notice that you aren’t actually changing the data each time, since you won’t assign (<-) to a new object.

  1. Copy and paste the code below into the script file. Replace the ___ in the select() function, with the columns bp_sys_ave, and education.

    R/learning.R
    nhanes_small %>% 
        select(___)
  2. Copy and paste the code below and fill out the blanks. Rename the bp_ variables so they don’t end in _ave, so they look like bp_sys and bp_dia. Tip: Recall that renaming is in the form new = old.

    R/learning.R
    nhanes_small %>% 
        rename(___ = ___,
               ___ = ___)
  3. Re-write this piece of code using the “pipe” operator:

    R/learning.R
    select(nhanes_small, bmi, contains("age"))
  4. Read through (in your head) the code below. How intuitive is it to read? Now, re-write this code so that you don’t need to create the temporary blood_pressure object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?

    R/learning.R
    blood_pressure <- select(nhanes_small, starts_with("bp_"))
    rename(blood_pressure, bp_systolic = bp_sys_ave)
  5. Run styler on the R/learning.R file with the Palette (Ctrl-Shift-P, then type “style file”).

  6. Lastly, add and commit these changes to the Git history with the RStudio Git interface using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

Click for the solution. Only click if you are struggling or are out of time.
# 1. Select specific columns
nhanes_small %>%
    select(bp_sys_ave, education)

# 2. Rename columns
nhanes_small %>%
    rename(bp_sys = bp_sys_ave,
           bp_dia = bp_dia_ave)

# 3. Re-write with pipe
nhanes_small %>% 
    select(bmi, contains("age"))

# 4. Re-write with pipe
nhanes_small %>% 
    select(starts_with("bp_")) %>% 
    rename(bp_systolic = bp_sys_ave)

7.9 Filtering data by row

Filtering data by row is a very common activity in data analysis, especially if you want to get rid of outliers or to subset by a categorical group. The function to subset or filter is called filter(). filter() is distinct from select() in the sense that it operates on rows, whereas select() operates on columns.

The filter() function takes a logic condition (TRUE or FALSE). As with the other functions, the first argument is the dataset and all others are the logical conditions that will apply to the row filtering. When the logical conditions equal TRUE, it means that those rows will be kept and those that are FALSE will be dropped.

Warning: Since filter() uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. If your logical condition starts getting complex, double and triple check that you know for certain that your logic code is doing what you think it is doing. It’s very easy to make mistakes at this stage, even for advanced R users.

The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==. For example, if we wanted to keep only people who are not physically active in the dataset, we would use the logic condition like this (we use select() after to show the change):

R/learning.R
nhanes_small %>%
    filter(phys_active == "No") %>%
    select(phys_active)

We would “read” this code as:

Take the nhanes_small dataset and then filter so that only rows where phys_active is equal to “No” are kept (showing just the phys_active column).

When a row in the phys_active column has the value "No", that row is kept. Otherwise, it is dropped.

There are other logic comparisons to use. Table 7.1 can be used as a reference for logical conditions in R.

Table 7.1: Logical operators in R.
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal to
!= not equal to
!x Not x (if x is true or false)
x | y x OR y
x & y x AND y

Let’s try out a few of these logical conditions with filter().

R/learning.R
# Participants who are physically active
nhanes_small %>%
    filter(phys_active != "No") %>%
    select(phys_active)
#> # A tibble: 4,649 × 1
#>    phys_active
#>    <fct>      
#>  1 Yes        
#>  2 Yes        
#>  3 Yes        
#>  4 Yes        
#>  5 Yes        
#>  6 Yes        
#>  7 Yes        
#>  8 Yes        
#>  9 Yes        
#> 10 Yes        
#> # ℹ 4,639 more rows
# Participants who have BMI equal to 25
nhanes_small %>%
    filter(bmi == 25) %>%
    select(bmi)
#> # A tibble: 35 × 1
#>      bmi
#>    <dbl>
#>  1    25
#>  2    25
#>  3    25
#>  4    25
#>  5    25
#>  6    25
#>  7    25
#>  8    25
#>  9    25
#> 10    25
#> # ℹ 25 more rows
# Participants who have BMI equal to or more than 25
nhanes_small %>%
    filter(bmi >= 25) %>%
    select(bmi)
#> # A tibble: 5,422 × 1
#>      bmi
#>    <dbl>
#>  1  32.2
#>  2  32.2
#>  3  32.2
#>  4  30.6
#>  5  27.2
#>  6  27.2
#>  7  27.2
#>  8  26.0
#>  9  26.2
#> 10  26.6
#> # ℹ 5,412 more rows

We use the | (“or”) and & (“and”) when we want to combine conditions across columns. Be careful with these operators and when combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &, both sides must be TRUE in order for the combination to be TRUE. For |, only one side needs to be TRUE in order for the combination to be TRUE. To see how they work try these:

TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
TRUE | TRUE
#> [1] TRUE
TRUE | FALSE
#> [1] TRUE
FALSE | FALSE
#> [1] FALSE

When used in filter(), combinations of logic conditions may look like this:

R/learning.R
# When BMI is 25 AND phys_active is No
nhanes_small %>%
    filter(bmi == 25 & phys_active == "No") %>%
    select(bmi, phys_active)
#> # A tibble: 11 × 2
#>      bmi phys_active
#>    <dbl> <fct>      
#>  1    25 No         
#>  2    25 No         
#>  3    25 No         
#>  4    25 No         
#>  5    25 No         
#>  6    25 No         
#>  7    25 No         
#>  8    25 No         
#>  9    25 No         
#> 10    25 No         
#> 11    25 No
# When BMI is 25 OR phys_active is No
nhanes_small %>%
    filter(bmi == 25 | phys_active == "No") %>%
    select(bmi, phys_active)
#> # A tibble: 3,701 × 2
#>      bmi phys_active
#>    <dbl> <fct>      
#>  1  32.2 No         
#>  2  32.2 No         
#>  3  32.2 No         
#>  4  30.6 No         
#>  5  28.5 No         
#>  6  25.8 No         
#>  7  35.8 No         
#>  8  31.4 No         
#>  9  31.4 No         
#> 10  25.8 No         
#> # ℹ 3,691 more rows

7.10 Modify or add columns

To modify an existing column or to add a new one, we can use the function called mutate(). You can use mutate() to compute a new variable using existing columns in your dataset. You can multiply all values in a certain column by 2, or combine columns into a new variable. Like the other functions, the first input is the dataset and the other arguments are columns to add or modify.

Take this example: In some analyses, sometimes age is reflected as months, not years. Since NHANES has age in years, to convert to months we would multiply by 12. So, we would use mutate() with the following instruction:

age = age * 12

This form is similar to how math works. The action that happens on the right hand side is put into the variable of the left hand side. When using mutate(), it looks like this:

R/learning.R
nhanes_small %>%
    mutate(age = age * 12)
#> # A tibble: 10,000 × 8
#>      age sex      bmi diabetes phys_active bp_sys_ave bp_dia_ave
#>    <dbl> <fct>  <dbl> <fct>    <fct>            <int>      <int>
#>  1   408 male    32.2 No       No                 113         85
#>  2   408 male    32.2 No       No                 113         85
#>  3   408 male    32.2 No       No                 113         85
#>  4    48 male    15.3 No       <NA>                NA         NA
#>  5   588 female  30.6 No       No                 112         75
#>  6   108 male    16.8 No       <NA>                86         47
#>  7    96 male    20.6 No       <NA>               107         37
#>  8   540 female  27.2 No       Yes                118         64
#>  9   540 female  27.2 No       Yes                118         64
#> 10   540 female  27.2 No       Yes                118         64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>

Like with filter(), you can continue to add or modify more columns by using ,. So let’s do that to a new column (e.g., log transforming BMI):

R/learning.R
nhanes_small %>%
    mutate(age = age * 12,
           logged_bmi = log(bmi))
#> # A tibble: 10,000 × 9
#>      age sex      bmi diabetes phys_active bp_sys_ave bp_dia_ave
#>    <dbl> <fct>  <dbl> <fct>    <fct>            <int>      <int>
#>  1   408 male    32.2 No       No                 113         85
#>  2   408 male    32.2 No       No                 113         85
#>  3   408 male    32.2 No       No                 113         85
#>  4    48 male    15.3 No       <NA>                NA         NA
#>  5   588 female  30.6 No       No                 112         75
#>  6   108 male    16.8 No       <NA>                86         47
#>  7    96 male    20.6 No       <NA>               107         37
#>  8   540 female  27.2 No       Yes                118         64
#>  9   540 female  27.2 No       Yes                118         64
#> 10   540 female  27.2 No       Yes                118         64
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: education <fct>, logged_bmi <dbl>

We can also have different values based on logic conditions using if_else(). Use Table 7.1 to help with creating the logic condition.

R/learning.R
nhanes_small %>%
    mutate(old = if_else(age >= 30, "Yes", "No"))
#> # A tibble: 10,000 × 9
#>      age sex      bmi diabetes phys_active bp_sys_ave bp_dia_ave
#>    <int> <fct>  <dbl> <fct>    <fct>            <int>      <int>
#>  1    34 male    32.2 No       No                 113         85
#>  2    34 male    32.2 No       No                 113         85
#>  3    34 male    32.2 No       No                 113         85
#>  4     4 male    15.3 No       <NA>                NA         NA
#>  5    49 female  30.6 No       No                 112         75
#>  6     9 male    16.8 No       <NA>                86         47
#>  7     8 male    20.6 No       <NA>               107         37
#>  8    45 female  27.2 No       Yes                118         64
#>  9    45 female  27.2 No       Yes                118         64
#> 10    45 female  27.2 No       Yes                118         64
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: education <fct>, old <chr>
Tip

Recall that the original dataset doesn’t change. If we want the added variable to be saved, we must assign it to something with <-. Putting it all together, you would enter something like this (no need to type this out yourself):

R/learning.R
nhanes_update <- nhanes_small %>%
    mutate(old = if_else(age >= 30, "Yes", "No"))

Before moving on, run styler with the Palette (Ctrl-Shift-P, then type “style file”) and commit all the files changes to the Git history with the RStudio Git interface using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

7.11 Exercise: Piping, filtering, and mutating

Time: 20 minutes.

Copy and paste the code below into the learning.R script file.

R/learning.R
# 1. BMI between 20 and 40 with diabetes
nhanes_small %>%
    # Format should follow: variable >= number or character
    filter(___ >= ___ & ___ <= ___ & ___ == ___)

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # Specifying dataset
    mutate(
        # 2. Calculate mean arterial pressure
        ___ = ___,
        # 3. Create young_child variable using a condition
        ___ = if_else(___, "Yes", "No")
    )

nhanes_modified

Then, start replacing the ___ with the appropriate code to complete the tasks below. (Hint: Create a new “Section” in the R script for this exercise by using Ctrl-Shift-R or with the Palette (Ctrl-Shift-P, then type “code section”)).

  1. Filter nhanes_small so only those participants with a BMI of more than or equal to 20 and less than or equal to 40, and keep those who have diabetes.

  2. Create a new variable called mean_arterial_pressure by applying the formula:

    \[((2 * DBP) + SBP) / 3\]

    (DBP = bp_dia_ave and SBP = bp_sys_ave) to calculate Mean Arterial Pressure. Hint: In R, use + to add, * to multiply, and / to divide.

  3. Create a new variable called young_child for cases where age is less than 6 years.

  4. Finally, add and commit these changes to the Git history with the RStudio Git Interface. Push to GitHub to synchronize with your GitHub repository.

Click for the solution. Only click if you are struggling or are out of time.
# 1. BMI between 20 and 40, with diabetes
nhanes_small %>%
    filter(bmi >= 20 & bmi <= 40 & diabetes == "Yes")

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # dataset
    mutate(
        mean_arterial_pressure = ((2 * bp_dia_ave) + bp_sys_ave) / 3,
        young_child = if_else(age < 6, "Yes", "No")
    )

nhanes_modified

7.12 Split-apply-combine: Summarizing data

Reading task: ~5 minutes

Summarising or applying simple (or complex) statistics to data is a key component of any analysis. Simple summaries or statistics can be done either on all of the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method, which involves splitting the data into groups, applying some analysis to each group, and then combining the results together.

In dplyr, you can use the function called summarise() to summarise on all the data. If you want to do a split-apply-combine analysis to, e.g., find the max height by education status, you would use the functions group_by() and then summarise(). Using group_by() splits the data up and summarise() applies an analysis, then immediately combines it back together.

The first position argument to group_by() is, as usual, the dataset. The next arguments are the columns that contain the values you want to group by. These columns must contain categorical data (e.g. education). group_by() tells R to compute the next operations on the data within each grouping, rather than on all the data.

As with the other functions, summarise() takes the dataset as the first position argument. The next arguments work similar to the arguments in mutate() with one difference: the output must create a single value (e.g. a mean). Like mutate(), you can add multiple “summaries” by adding new columns separated by ,. You would use summarise() to derive basic descriptive statistics of a certain variable, including min(), max(), mean(), median(), or sd() (standard deviation).

The group_by() function doesn’t do anything by itself so should always be used in combination with a summarise(), mutate(), arrange(), or other function. However, the summarise() function can be used on its own.

7.13 Calculating summary statistics

Let’s calculate the maximum value of the BMI variable. See what happens when you enter the following:

R/learning.R
nhanes_small %>%
    summarise(max_bmi = max(bmi))
#> # A tibble: 1 × 1
#>   max_bmi
#>     <dbl>
#> 1      NA

We get back a result of NA, which means “missing”. In R, NA values “propagate”, meaning that if there is one value missing, then the max or mean will also be missing. So, we need to tell max() to exclude any NA values from the calculation using the argument na.rm = TRUE.

R/learning.R
nhanes_small %>%
    summarise(max_bmi = max(bmi, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   max_bmi
#>     <dbl>
#> 1    81.2

To calculate another summary statistic, you would add another summary column using ,:

R/learning.R
nhanes_small %>%
    summarise(max_bmi = max(bmi, na.rm = TRUE),
              min_bmi = min(bmi, na.rm = TRUE))
#> # A tibble: 1 × 2
#>   max_bmi min_bmi
#>     <dbl>   <dbl>
#> 1    81.2    12.9

Before you start the following exercise, add and commit changes to the Git history with the RStudio Git interface.

7.14 Summary statistics by a group

While the summarise() function is useful enough on its own, it really shines when combined with group_by().

Let’s use these functions to find out the mean age and BMI between those with and without diabetes.

R/learning.R
nhanes_small %>%
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 3 × 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No          35.4      26.2
#> 2 Yes         59.2      32.6
#> 3 <NA>         1.20     29.6

Quick note: If you are using a version of dplyr >= 1.0.0, you’ll get a message informing you that it is regrouping output. This is a notification and can be ignored. If you don’t want the message displayed, you can add options(dplyr.summarise.inform = FALSE) to the top of your script and run it.

We also get a warning about there being missing values in diabetes, so we need to remove rows that have missing diabetes status values.

R/learning.R
nhanes_small %>%
    # Recall ! means "NOT", so !is.na means "is not missing"
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 2 × 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No           35.4     26.2
#> 2 Yes          59.2     32.6

Cool! We can add more columns to the grouping, so let’s do that. Let’s compare mean age and BMI by physical activity and diabetes status.

R/learning.R
nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, phys_active) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 6 × 4
#>   diabetes phys_active mean_age mean_bmi
#>   <fct>    <fct>          <dbl>    <dbl>
#> 1 No       No             46.2      28.8
#> 2 No       Yes            37.9      26.8
#> 3 No       <NA>            6.02     17.9
#> 4 Yes      No             61.7      33.4
#> 5 Yes      Yes            55.7      31.2
#> 6 Yes      <NA>           13.7      26.1

Since we don’t need the dataset grouped anymore, it’s good practice to end the grouping with ungroup().

R/learning.R
nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, phys_active) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE)) %>% 
    ungroup()
#> # A tibble: 6 × 4
#>   diabetes phys_active mean_age mean_bmi
#>   <fct>    <fct>          <dbl>    <dbl>
#> 1 No       No             46.2      28.8
#> 2 No       Yes            37.9      26.8
#> 3 No       <NA>            6.02     17.9
#> 4 Yes      No             61.7      33.4
#> 5 Yes      Yes            55.7      31.2
#> 6 Yes      <NA>           13.7      26.1

Before you start the following exercise, run styler using the Palette (Ctrl-Shift-P, then type “style file”) and commit changes to the Git history using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

7.15 Saving datasets as files

Go over how to save data to a file, but don’t cover the here package right away. Let them read about it in the section below and then briefly cover it again.

The nhanes_small data frame you created is only available after you’ve created it from NHAHES, but if you want to access it later, you can save it as a .csv file in your data/ folder using the function readr::write_csv().

R/learning.R
readr::write_csv(
  nhanes_small,
  here::here("data/nhanes_small.csv")
)
Reading task: ~5 minutes

There’s a new thing here()! The here package uses a function called here() to make it easier to manage file paths within an R Project.

So, what is a file path and why is this here package necessary? A file path is the list of folders a file is found in. For instance, your CV may be found in /Users/Documents/personal_things/CV.docx. The problem with file paths when running code (like with R) is that when you run a script interactively (e.g. what we do in class and normally), the file path and “working directory” (the R Session) are located at the Project level (where the .Rproj file is found). You can see the working directory by looking at the top of the RStudio Console.

But! When you source() an R script or run it not interactively, the R code may likely run in the folder it is saved in, e.g. in the R/ folder. So your file path data/nhanes_small.csv won’t work because there isn’t a folder called data/ in the R/ folder.

LearningR <-- R Project working directory starts here.
├── R
│   ├── README.md
│   └── learning.R <-- Working directory when running not interactively.
├── data
│   └── README.md
├── data-raw
│   └── README.md
├── doc
│   └── README.md
├── .gitignore
├── DESCRIPTION
├── LearningR.Rproj <-- here() moves file path to start in this file's folder.
├── README.md
└── TODO.md

Often people use the function setwd() in scripts, but this is never a good idea since using it makes your script runnable only on your computer… which makes it no longer reproducible. We use the here() function to tell R to go to the project root (where the .Rproj file is found) and then use that file path. This simple function can make your work more reproducible and easier for you to use later on.

For many projects, it isn’t necessary or advisable to save every single data object you create. It’s better to let the code create the data you’ll use rather than saving each new wrangled dataset you might create. However, sometimes you will want or need to save the dataset you’ve been working on, perhaps because you’ve done a lot of work to prepare it for later analyses, or because you’ve run an analysis and want to save the results. In these cases, you should definitely save the new cleaned dataset.

7.16 Loading in a dataset

Reading task: ~5 minutes

We’ve been using a teaching dataset that we load from a package, mainly so that we can focus on getting familiar with data wrangling. However, there will come a time when you want to wrangle your own data. There are several ways to load in a dataset, which are listed below. Don’t run these code, just read for now.

  1. Using the RStudio menu File -> Import Dataset -> From Text/Excel/SPSS/SAS/Stata (depending on your file type you want to import).

  2. If the file is a .csv file, use readr::read_csv() to import the dataset, for instance with the nhanes_small:

    nhanes_small <- readr::read_csv(here::here("data/nhanes_small.csv"))
  3. If the dataset is a .rda file, use load():

    load(here::here("data/dataset_name.rda"))    

    This loads the dataset into your R session so that you can use it again.

For SAS, SPSS, or Stata files, you can use the package haven to import those types of data files into R.

7.17 Summary

  • With tidy data, each variable has its own column, each observation has its own row, and each value has its own cell.
  • Use the tidyverse to load in multiple packages to tidy up data.
  • Never edit raw data. Instead, use R code to make changes and clean up the raw data, rather than manually editing the dataset.
  • Use the functions select(), rename(), filter(), mutate() (“change or modify”), and summarise() from the dplyr package to wrangle your data.
  • Use the pipe (%>%) to write easy-to-read code, similar to reading a text consisting of multiple sentences.