If you find any typos, errors, or places where the text may be improved, please let us know
by providing feedback either in the feedback survey (given during class) or
by using GitHub.
On GitHub open an
issue
or submit a pull request
by clicking the " Edit this page" link at the side of this page.
7Data management and wrangling
Session objectives:
Learn the difference between “messy” and “tidy” data, including how to create tidy data to simplify your analysis.
Perform simple transformations and subsetting of datasets, such as:
Subset specific columns and rows of a dataset, with filter()and select().
Sort rows of a dataset by a specific column, with arrange().
Create new or transform existing columns in a dataset, with mutate().
Calculate simple data summaries, with summarise().
Learn about and apply the “split-apply-combine” method of analyses, with group_by() and summarise().
Write “tidier” and more readable code by using the pipe (%>%) operator.
7.1 “Messy” vs. “tidy” data
Instructor note
This session usually takes a fair amount of time relative to the other sessions. The R Markdown session is right afterwards and since it usually is done quicker than other sessions, it’s fine if this session takes up time in the R Markdown session.
For the reading section, emphasize the characteristics of a “tidy” dataset.
Reading task: ~10 minutes
The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book. Before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that has already come up in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. We teach the tidyverse because of these above mentioned reasons.
Ok, back to “tidy data”. A tidy dataset is when:
Each variable has its own column (e.g. “Body Weight”).
Each observation has its own row (e.g. “Person”).
Each value has its own cell (e.g. “Body weight for a person at a specific date”).
Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each may be considered “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?
# Datasets come from tidyr# Tidy:table1
#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Partly tidy:table2
#> # A tibble: 12 × 4
#> country year type count
#> <chr> <dbl> <chr> <dbl>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# Messier:table3
#> # A tibble: 6 × 3
#> country year rate
#> <chr> <dbl> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Messy:table4a
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> <chr> <dbl> <dbl>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# Messy:table4b
#> # A tibble: 3 × 3
#> country `1999` `2000`
#> <chr> <dbl> <dbl>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
The “most” tidy version is table1 as each column describes their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row.
table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size.
On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside. For example, you can’t tell from the data what the values in the 1999 column contain.
Tidy data has a few notable benefits:
Time spent preparing your data to be tidy from the beginning can save days of added work and frustration in the long run.
“Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy-to-interpret ways, especially when used within the framework of the tidyverse.
The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data. This is one simple way that you can enhance the reproducibility of your code.
7.2 Managing and working with data in R
Instructor note
After they read it, go over and emphasize the functions shown in the table that we will be teaching.
Reading task: ~5 minutes
When working with data, there are a few principles to follow:
You should always save your original raw dataset in the data-raw/ folder.
Note: Whether or not you save data to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
Never edit your raw data directly and save it in a separate location.
Only work with your raw data using R code. Don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
Save the edited data as another dataset and store it in the data/ folder.
When wrangling your data with R code, make sure to:
Document what you did to your data and why you did it, to help you remember later on (by using hashes to indicate comments and non-code text).
Write the code in a way that is as descriptive as possible and is readable enough to understand what is being done to the data.
Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than clever and sometimes overly-complex code.
In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here refers to maneuvering, managing, controlling, and turning your data around to clean it up, better understand it, and prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that come from the tidyverse:
List of common data wrangling tasks, along with an example and the function used for the wrangling.
Task
Example
Function
Select columns
Remove data entry columns such as the person’s name who entered the data.
Tip: Sometimes you need to do some complicated wrangling to get your data “in shape” for later analyses. To save some time, you could save the wrangled data as an “output” dataset in the data/ folder. That way, you can easily use it again later rather than having to run the wrangling code every time you want to work with the data.
7.3 Load the packages and dataset
We’re going to use the US NHANES dataset to demonstrate some wrangling functions. There is a R package called {NHANES} that contains a teaching version of the original dataset, so we’ll use that for this course.
First, make sure that you have the LearningR R Project open. Second, open the R/learning.R script and write code to load the {NHANES} package. It should look like this now:
There are many ways of loading in data into R. This method loads data that is included in a package, in this case {NHANES}. We will cover how to load a dataset from a file at the end of this session.
Then, below the library(), let’s take a glimpse() at the NHANES dataset. Run this function by hitting Alt-Enter while the cursor is on the code to send it from the script to the R Console.
# Briefly glimpse contents of datasetglimpse(NHANES)
This gives us a quick overview of what variables are in the dataset and the data types that they are.
7.4 Selecting specific columns in a dataset
Selecting columns of a dataset is a very common data wrangling task. The function for this task in RStudio is appropriately called select(). You would use select() to extract one or more variables in a dataset that you want to have a closer look at or to save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., the Age column) and clinical variables (e.g., Weight and Height columns) to perform these analyses.
For the input arguments, select() takes the dataset as the first argument, which is the first input position right after the opening bracket (, and then takes the names of the columns you want to select. The argument after the data argument is ..., which indicates that you can add as many columns as you want, separated by a ,.
# Select one column by its name, without quotesselect(NHANES, Age)
# To *exclude* a column, use minus (-)select(NHANES, -HeadCirc)
#> # A tibble: 10,000 × 75
#> ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3
#> <int> <fct> <fct> <int> <fct> <int> <fct> <fct>
#> 1 51624 2009_10 male 34 " 30-39" 409 White <NA>
#> 2 51624 2009_10 male 34 " 30-39" 409 White <NA>
#> 3 51624 2009_10 male 34 " 30-39" 409 White <NA>
#> 4 51625 2009_10 male 4 " 0-9" 49 Other <NA>
#> 5 51630 2009_10 female 49 " 40-49" 596 White <NA>
#> 6 51638 2009_10 male 9 " 0-9" 115 White <NA>
#> 7 51646 2009_10 male 8 " 0-9" 101 White <NA>
#> 8 51647 2009_10 female 45 " 40-49" 541 White <NA>
#> 9 51647 2009_10 female 45 " 40-49" 541 White <NA>
#> 10 51647 2009_10 female 45 " 40-49" 541 White <NA>
#> # ℹ 9,990 more rows
#> # ℹ 67 more variables: Education <fct>, MaritalStatus <fct>,
#> # HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>,
#> # HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
#> # Length <dbl>, Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>,
#> # BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
#> # BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, …
If some of your columns have similar patterns for naming at the beginning, middle, or end of the name, you can use some helper functions to choose these columns. Use ?select_helpers (choose the “Select helpers” option in the menu that pops up) to read more about these functions and to get help on them. Some commonly used helpers are:
starts_with(): Select columns that begin with a pattern.
ends_with(): Select columns that end with a pattern.
contains(): Select columns that contain a pattern.
# All columns starting with letters "BP" (blood pressure)select(NHANES, starts_with("BP"))
# All columns ending in letters "Day"select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 × 3
#> TVHrsDay CompHrsDay AlcoholDay
#> <fct> <fct> <int>
#> 1 <NA> <NA> NA
#> 2 <NA> <NA> NA
#> 3 <NA> <NA> NA
#> 4 <NA> <NA> NA
#> 5 <NA> <NA> 2
#> 6 <NA> <NA> NA
#> 7 <NA> <NA> NA
#> 8 <NA> <NA> 3
#> 9 <NA> <NA> 3
#> 10 <NA> <NA> 3
#> # ℹ 9,990 more rows
# All columns containing letters "Age"select(NHANES, contains("Age"))
#> # A tibble: 10,000 × 9
#> Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge
#> <int> <fct> <int> <int> <int> <int>
#> 1 34 " 30-39" 409 NA NA 18
#> 2 34 " 30-39" 409 NA NA 18
#> 3 34 " 30-39" 409 NA NA 18
#> 4 4 " 0-9" 49 NA NA NA
#> 5 49 " 40-49" 596 NA 27 38
#> 6 9 " 0-9" 115 NA NA NA
#> 7 8 " 0-9" 101 NA NA NA
#> 8 45 " 40-49" 541 NA NA NA
#> 9 45 " 40-49" 541 NA NA NA
#> 10 45 " 40-49" 541 NA NA NA
#> # ℹ 9,990 more rows
#> # ℹ 3 more variables: AgeFirstMarij <int>, AgeRegMarij <int>,
#> # SexAge <int>
You’ll notice that running these functions doesn’t actually change the data itself. When you run a function without assigning it using <-, the only action the function does is to send the output to your screen, and you won’t have saved that data anywhere for later use. But if you want to create a new dataset with only the columns you selected, you’ll need to assign the selected dataset to a new object.
The full NHANES dataset is 10,000 individuals (rows) with 76 parameters (columns). To make it easier to look at and since we are only interested in some of these parameters, we will subset the large dataset into something smaller and save it for later use as a new dataset.
# Save the selected columns as a new data frame# Recall the style guide for naming objectsnhanes_small<-select(NHANES, Age, Gender, BMI, Diabetes,PhysActive, BPSysAve, BPDiaAve, Education)# View the new data framenhanes_small
#> # A tibble: 10,000 × 8
#> Age Gender BMI Diabetes PhysActive BPSysAve BPDiaAve
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 4 male 15.3 No <NA> NA NA
#> 5 49 female 30.6 No No 112 75
#> 6 9 male 16.8 No <NA> 86 47
#> 7 8 male 20.6 No <NA> 107 37
#> 8 45 female 27.2 No Yes 118 64
#> 9 45 female 27.2 No Yes 118 64
#> 10 45 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: Education <fct>
7.5 Renaming all column names based on the style guide
In the interests of keeping data tidy and matching the style guide, we should change the column names to be all lower case with _ for spaces between words. There’s a package that can do that for us called snakecase.
To change all the column names to snakecase, we’ll use the function rename_with(). This function takes the data as the first argument but the second argument needs to be a function, which in our case is called snakecase::to_snake_case(), but exclude the () at the end. This function will rename all columns.
# Rename all columns to snake casenhanes_small<-rename_with(nhanes_small, snakecase::to_snake_case)# Have a look at the data framenhanes_small
#> # A tibble: 10,000 × 8
#> age gender bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 4 male 15.3 No <NA> NA NA
#> 5 49 female 30.6 No No 112 75
#> 6 9 male 16.8 No <NA> 86 47
#> 7 8 male 20.6 No <NA> 107 37
#> 8 45 female 27.2 No Yes 118 64
#> 9 45 female 27.2 No Yes 118 64
#> 10 45 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
7.6 Renaming specific columns
Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory, which is particularly important if you’re sharing your work with others or in an environment where multiple people are working on the same data. We will use the function called rename(). Like select(), rename() takes the dataset as the first argument and then takes as many renaming arguments as you want (because the second argument position is ...). When renaming, it takes the form of newname = oldname.
The “gender” variable in the dataset actually describes “sex”, so let’s rename it to accurately reflect the data itself. Note that gender is a social construct, while sex is biological. This unfortunately is often incorrectly named in many scientific papers, where accuracy of terms is critical. Because we want to rename the variable in the dataset so it stays renamed, we need to assign it again to nhanes_small.
nhanes_small<-rename(nhanes_small, sex =gender)nhanes_small
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 4 male 15.3 No <NA> NA NA
#> 5 49 female 30.6 No No 112 75
#> 6 9 male 16.8 No <NA> 86 47
#> 7 8 male 20.6 No <NA> 107 37
#> 8 45 female 27.2 No Yes 118 64
#> 9 45 female 27.2 No Yes 118 64
#> 10 45 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
We’ve re-assigned the object nhanes_small multiple times. If you make a mistake, the easiest solution is to re-run the code from the start again, which should fix any issues.
Now, you can see that the column has been renamed from gender to sex. What if you wanted to select some columns and then rename some of them? Would you have to create a new data object every time? No! We can make use of a very powerful tool called piping with the %>% function.
7.7 Chaining functions with the pipe
Reading task: ~5 minutes
A key component of tidy data and tidy code is making use of the “pipe” operator, %>%. There is also the base R pipe |>, which works basically the same but with some small differences. For this course we will use %>%, though you can use either.
You would use the “pipe” operator when you are writing a piece of code with multiple operations or intermediate steps that require you to save and overwrite each step as an object (see below). One advantage of the “pipe” operator is that it will help to ensure that your code is less cluttered with redundant and temporary object names.
This operator allows you to “pipe” the output from one function to the input of another function, just like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:
We need some eggs. Drive to the grocery store and buy some eggs before arriving home from work.
There are basically two actions here (“drive” and “buy”) with four inputs (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive() and buy(). In regular R (without the “pipe” operator), we would have to nest functions (reading from the inside to the outside) together to chain them:
This still isn’t very “readable”, as we are having to re-name each intermediate object with reference to the object before it. The pipe %>% operator can really simplify this:
Do you find this more readable and understandable? It reads how it would be done, in order of the steps taken.
The idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for the tidyverse and many other packages.
The operator %>% takes the output from the object or function from the left of the operator and puts it into the function on the right of the operator. All input goes into the first position argument of the function. Within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order to work with the pipe.
Let’s try this out on NHANES. The keyboard shortcut for the pipe is Ctrl-Shift-M (i.e., M for the magrittr package that created the pipe).
# These two ways are the samecolnames(nhanes_small)
The pipe automatically takes nhanes_small and puts it into the first position, so we don’t need to specify nhanes_small inside colnames() when piping.
Let’s try using the pipe with the select() and rename() functions from the previous section. Remember, both select() and rename() take a dataset as the first argument, which makes them pipe-able.
#> # A tibble: 10,000 × 1
#> physically_active
#> <fct>
#> 1 No
#> 2 No
#> 3 No
#> 4 <NA>
#> 5 No
#> 6 <NA>
#> 7 <NA>
#> 8 Yes
#> 9 Yes
#> 10 Yes
#> # ℹ 9,990 more rows
We can now “read” these actions as:
Take the nhanes_small dataset and then select the “phys_active” column and then rename the “phys_active” column to “physically_active”.
7.8 Exercise: Practice what we’ve learned
Time: 10 minutes.
In the R/learning.R script file, complete the following tasks.
Copy and paste the code below into the script file. Replace the ___ in the select() function, with the columns bp_sys_ave, and education.
nhanes_small %>%select(___)
Copy and paste the code below and fill out the blanks. Rename the bp_ variables so they don’t end in _ave, so they look like bp_sys and bp_dia. Tip: Recall that renaming is in the form new = old.
nhanes_small %>%rename(___ = ___,___ = ___)
Re-write this piece of code using the “pipe” operator:
Read through (in your head) the code below. How intuitive is it to read? Now, re-write this code so that you don’t need to create the temporary blood_pressure object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?
Run styler on the R/learning.R file with Ctrl-Shift-P, then type “style file”.
Lastly, add and commit these changes to the Git history with the RStudio Git interface (Ctrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
# 1. Select specific columnsnhanes_small%>%select(bp_sys_ave, education)# 2. Rename columnsnhanes_small%>%rename(bp_sys =bp_sys_ave, bp_dia =bp_dia_ave)# 3. Re-write with pipenhanes_small%>%select(bmi, contains("age"))# 4. Re-write with pipenhanes_small%>%select(starts_with("bp_"))%>%rename(bp_systolic =bp_sys_ave)
7.9 Filtering data by row
Filtering data by row is a very common activity in data analysis, especially if you want to get rid of outliers or to subset by a categorical group. The function to subset or filter is called filter(). filter() is distinct from select() in the sense that it operates on rows, whereas select() operates on columns.
The filter() function takes a logic condition (TRUE or FALSE). As with the other functions, the first argument is the dataset and all others are the logical conditions that will apply to the row filtering. When the logical conditions equal TRUE, it means that those rows will be kept and those that are FALSE will be dropped.
Warning: Since filter() uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. If your logical condition starts getting complex, double and triple check that you know for certain that your logic code is doing what you think it is doing. It’s very easy to make mistakes at this stage, even for advanced R users.
The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==. For example, if we wanted to keep only people who are not physically active in the dataset, we would use the logic condition like this:
#> # A tibble: 3,677 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 49 female 30.6 No No 112 75
#> 5 33 male 28.5 No No 128 74
#> 6 60 male 25.8 No No 152 100
#> 7 38 male 35.8 No No 147 81
#> 8 44 male 31.4 Yes No 144 88
#> 9 44 male 31.4 Yes No 144 88
#> 10 59 female 25.8 Yes No 150 0
#> # ℹ 3,667 more rows
#> # ℹ 1 more variable: education <fct>
We would “read” this code as:
Take the nhanes_small dataset and then filter so that only rows where phys_active is equal to “No” are kept.
When a row in the phys_active column has the value "No", that row is kept. Otherwise, it is dropped.
There are other logic comparisons to use. Table 7.1 can be used as a reference for logical conditions in R.
Table 7.1: Logical operators in R.
Operator
Description
<
less than
<=
less than or equal to
>
greater than
>=
greater than or equal to
==
equal to
!=
not equal to
!x
Not x (if x is true or false)
x | y
x OR y
x & y
x AND y
Let’s try out a few of these logical conditions with filter().
# Participants who are physically activenhanes_small%>%filter(phys_active!="No")
#> # A tibble: 4,649 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 45 female 27.2 No Yes 118 64
#> 2 45 female 27.2 No Yes 118 64
#> 3 45 female 27.2 No Yes 118 64
#> 4 66 male 23.7 No Yes 111 63
#> 5 58 male 23.7 No Yes 104 74
#> 6 54 male 26.0 No Yes 134 85
#> 7 58 female 26.2 No Yes 127 83
#> 8 50 male 26.6 No Yes 142 68
#> 9 16 male 24.7 No Yes 126 72
#> 10 56 female 19.7 No Yes 95 69
#> # ℹ 4,639 more rows
#> # ℹ 1 more variable: education <fct>
# Participants who have BMI equal to 25nhanes_small%>%filter(bmi==25)
#> # A tibble: 35 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 24 male 25 No Yes 109 61
#> 2 24 male 25 No Yes 109 61
#> 3 32 male 25 No No 104 73
#> 4 32 male 25 No No 104 73
#> 5 32 male 25 No No 104 73
#> 6 32 male 25 No No 104 73
#> 7 55 female 25 No Yes 106 63
#> 8 6 female 25 No <NA> NA NA
#> 9 53 male 25 No No 113 84
#> 10 57 female 25 No Yes 131 75
#> # ℹ 25 more rows
#> # ℹ 1 more variable: education <fct>
# Participants who have BMI equal to or more than 25nhanes_small%>%filter(bmi>=25)
#> # A tibble: 5,422 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 49 female 30.6 No No 112 75
#> 5 45 female 27.2 No Yes 118 64
#> 6 45 female 27.2 No Yes 118 64
#> 7 45 female 27.2 No Yes 118 64
#> 8 54 male 26.0 No Yes 134 85
#> 9 58 female 26.2 No Yes 127 83
#> 10 50 male 26.6 No Yes 142 68
#> # ℹ 5,412 more rows
#> # ℹ 1 more variable: education <fct>
We use the | (“or”) and & (“and”) when we want to combine conditions across columns. Be careful with these operators and when combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &, both sides must be TRUE in order for the combination to be TRUE. For |, only one side needs to be TRUE in order for the combination to be TRUE. To see how they work try these:
TRUE&TRUE
#> [1] TRUE
TRUE&FALSE
#> [1] FALSE
FALSE&FALSE
#> [1] FALSE
TRUE|TRUE
#> [1] TRUE
TRUE|FALSE
#> [1] TRUE
FALSE|FALSE
#> [1] FALSE
When used in filter(), combinations of logic conditions may look like this:
# When BMI is 25 AND phys_active is Nonhanes_small%>%filter(bmi==25&phys_active=="No")
#> # A tibble: 11 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 32 male 25 No No 104 73
#> 2 32 male 25 No No 104 73
#> 3 32 male 25 No No 104 73
#> 4 32 male 25 No No 104 73
#> 5 53 male 25 No No 113 84
#> 6 20 female 25 No No 111 76
#> 7 23 female 25 No No 125 70
#> 8 51 female 25 No No 144 94
#> 9 51 female 25 No No 144 94
#> 10 51 female 25 No No 144 94
#> 11 51 female 25 No No 144 94
#> # ℹ 1 more variable: education <fct>
# When BMI is 25 OR phys_active is Nonhanes_small%>%filter(bmi==25|phys_active=="No")
#> # A tibble: 3,701 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 49 female 30.6 No No 112 75
#> 5 33 male 28.5 No No 128 74
#> 6 60 male 25.8 No No 152 100
#> 7 38 male 35.8 No No 147 81
#> 8 44 male 31.4 Yes No 144 88
#> 9 44 male 31.4 Yes No 144 88
#> 10 59 female 25.8 Yes No 150 0
#> # ℹ 3,691 more rows
#> # ℹ 1 more variable: education <fct>
7.10 Arranging the rows of your data by column
You may want to sort your rows by a specific column so that values are arranged in ascending or descending order. This can be done using the function called arrange(). Again, arrange() takes the dataset as the first argument, followed by the columns that you wish to arrange data by. By default, arrange() orders in ascending order.
# Arranging data by age in ascending ordernhanes_small%>%arrange(age)
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 0 female NA <NA> <NA> NA NA
#> 2 0 female NA <NA> <NA> NA NA
#> 3 0 male NA <NA> <NA> NA NA
#> 4 0 male NA <NA> <NA> NA NA
#> 5 0 female NA <NA> <NA> NA NA
#> 6 0 female NA <NA> <NA> NA NA
#> 7 0 female NA <NA> <NA> NA NA
#> 8 0 female NA <NA> <NA> NA NA
#> 9 0 female NA <NA> <NA> NA NA
#> 10 0 female NA <NA> <NA> NA NA
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
arrange() also arranges parameters of type character alphabetically:
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 64 female 27.2 Yes Yes 130 65
#> 2 59 female 25.8 Yes No 150 0
#> 3 59 female 25.8 Yes No 150 0
#> 4 56 male 25.0 Yes No 86 54
#> 5 78 female 22.6 Yes No 133 61
#> 6 31 female 29.0 No No 119 65
#> 7 31 female 29.0 No No 119 65
#> 8 26 female 25.9 No Yes 99 70
#> 9 26 female 25.9 No Yes 99 70
#> 10 26 female 25.9 No Yes 99 70
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
If we want to order the column based on descending order, this can be done with desc().
# Arranging data by age in descending ordernhanes_small%>%arrange(desc(age))
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 80 female 27.2 No No 172 86
#> 2 80 female 27.2 No No 172 86
#> 3 80 male 17.9 No No 132 60
#> 4 80 male 34.0 Yes No 125 56
#> 5 80 male 29.3 Yes Yes 149 75
#> 6 80 female 33.1 No No 117 55
#> 7 80 female 24.8 No No 138 35
#> 8 80 female 24.8 No No 138 35
#> 9 80 male 26.5 No No 113 65
#> 10 80 female 25.8 No No 137 73
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
You can also order your data by multiple columns. For instance, we could arrange first by education and then by age.
# Arranging data by education then age in ascending ordernhanes_small%>%arrange(education, age)
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 20 male 25.4 No Yes 115 67
#> 2 21 male 25.6 No No 110 52
#> 3 21 male 22.3 No No 107 67
#> 4 22 male 28.0 No No 135 70
#> 5 22 female 42.5 No No 94 72
#> 6 22 female 42.5 No No 94 72
#> 7 22 male 25.2 No No 123 63
#> 8 22 male 27.3 No No 115 78
#> 9 22 female 61.0 No No 103 64
#> 10 23 female 32.3 No No 118 55
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
7.11 Transform or add columns
To modify an existing column or to add a new one, we can use the function called mutate(). You can use mutate() to compute a new variable using existing columns in your dataset. You can multiply all values in a certain column by 2, or combine columns into a new variable. Like the other functions, the first input is the dataset and the other arguments are columns to add or modify.
Take this example: In some analyses, sometimes age is reflected as months, not years. Since NHANES has age in years, to convert to months we would multiply by 12. So, we would use mutate() with the following instruction:
age = age * 12
This form is similar to how math works. The action that happens on the right hand side is put into the variable of the left hand side. When using mutate(), it looks like this:
#> # A tibble: 10,000 × 8
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <dbl> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 408 male 32.2 No No 113 85
#> 2 408 male 32.2 No No 113 85
#> 3 408 male 32.2 No No 113 85
#> 4 48 male 15.3 No <NA> NA NA
#> 5 588 female 30.6 No No 112 75
#> 6 108 male 16.8 No <NA> 86 47
#> 7 96 male 20.6 No <NA> 107 37
#> 8 540 female 27.2 No Yes 118 64
#> 9 540 female 27.2 No Yes 118 64
#> 10 540 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 1 more variable: education <fct>
Like with filter(), you can continue to add or modify more columns by using ,. So let’s do that to a new column (e.g., log transforming BMI):
#> # A tibble: 10,000 × 9
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <dbl> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 408 male 32.2 No No 113 85
#> 2 408 male 32.2 No No 113 85
#> 3 408 male 32.2 No No 113 85
#> 4 48 male 15.3 No <NA> NA NA
#> 5 588 female 30.6 No No 112 75
#> 6 108 male 16.8 No <NA> 86 47
#> 7 96 male 20.6 No <NA> 107 37
#> 8 540 female 27.2 No Yes 118 64
#> 9 540 female 27.2 No Yes 118 64
#> 10 540 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: education <fct>, logged_bmi <dbl>
We can also have different values based on logic conditions using if_else(). Use Table 7.1 to help with creating the logic condition.
#> # A tibble: 10,000 × 9
#> age sex bmi diabetes phys_active bp_sys_ave bp_dia_ave
#> <int> <fct> <dbl> <fct> <fct> <int> <int>
#> 1 34 male 32.2 No No 113 85
#> 2 34 male 32.2 No No 113 85
#> 3 34 male 32.2 No No 113 85
#> 4 4 male 15.3 No <NA> NA NA
#> 5 49 female 30.6 No No 112 75
#> 6 9 male 16.8 No <NA> 86 47
#> 7 8 male 20.6 No <NA> 107 37
#> 8 45 female 27.2 No Yes 118 64
#> 9 45 female 27.2 No Yes 118 64
#> 10 45 female 27.2 No Yes 118 64
#> # ℹ 9,990 more rows
#> # ℹ 2 more variables: education <fct>, old <chr>
Tip
Recall that the original dataset doesn’t change. If we want the added variable to be saved, we must assign it to something with <-. Putting it all together, you would enter something like this (no need to type this out yourself):
Before moving on, run styler (Ctrl-Shift-P, then type “style file”) and commit all the files changes to the Git history with the RStudio Git interface (Ctrl-Shift-P, then type “commit”).
7.12 Exercise: Piping, filtering, and mutating
Time: 20 minutes.
Copy and paste the code below into the learning.R script file.
# 1. BMI between 20 and 40 with diabetesnhanes_small %>%# Format should follow: variable >= number or characterfilter(___ >= ___ & ___ <= ___ & ___ == ___)# Pipe the data into mutate function and:nhanes_modified <- nhanes_small %>%# Specifying datasetmutate(# 2. Calculate mean arterial pressure___ = ___,# 3. Create young_child variable using a condition___ =if_else(___, "Yes", "No") )nhanes_modified
Then, start replacing the ___ with the appropriate code to complete the tasks below. (Hint: Create a new “Section” in the R script for this exercise by using Ctrl-Shift-R).
Filter nhanes_small so only those participants with a BMI of more than or equal to 20 and less than or equal to 40, and keep those who have diabetes.
Create a new variable called mean_arterial_pressure by applying the formula:
\[((2 * DBP) + SBP) / 3\]
(DBP = bp_dia_ave and SBP = bp_sys_ave) to calculate Mean Arterial Pressure. Hint: In R, use + to add, * to multiply, and / to divide.
Create a new variable called young_child for cases where age is less than 6 years.
Finally, add and commit these changes to the Git history with the RStudio Git Interface. Push to GitHub to synchronize with your GitHub repository.
Click for the solution. Only click if you are struggling or are out of time.
# 1. BMI between 20 and 40, with diabetesnhanes_small%>%filter(bmi>=20&bmi<=40&diabetes=="Yes")# Pipe the data into mutate function and:nhanes_modified<-nhanes_small%>%# datasetmutate( mean_arterial_pressure =((2*bp_dia_ave)+bp_sys_ave)/3, young_child =if_else(age<6, "Yes", "No"))nhanes_modified
7.13 Split-apply-combine: Summarizing data
Reading task: ~5 minutes
Summarising or applying simple (or complex) statistics to data is a key component of any analysis. Simple summaries or statistics can be done either on all of the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method, which involves splitting the data into groups, applying some analysis to each group, and then combining the results together.
In dplyr, you can use the function called summarise() to summarise on all the data. If you want to do a split-apply-combine analysis to, e.g., find the max height by education status, you would use the functions group_by() and then summarise(). Using group_by() splits the data up and summarise() applies an analysis, then immediately combines it back together.
The first position argument to group_by() is, as usual, the dataset. The next arguments are the columns that contain the values you want to group by. These columns must contain categorical data (e.g. education). group_by() tells R to compute the next operations on the data within each grouping, rather than on all the data.
As with the other functions, summarise() takes the dataset as the first position argument. The next arguments work similar to the arguments in mutate() with one difference: the output must create a single value (e.g. a mean). Like mutate(), you can add multiple “summaries” by adding new columns separated by ,. You would use summarise() to derive basic descriptive statistics of a certain variable, including min(), max(), mean(), median(), or sd() (standard deviation).
The group_by() function doesn’t do anything by itself so should always be used in combination with a summarise(), mutate(), arrange(), or other function. However, the summarise() function can be used on its own.
7.14 Calculating summary statistics
Let’s calculate the maximum value of the BMI variable. See what happens when you enter the following:
We get back a result of NA, which means “missing”. In R, NA values “propagate”, meaning that if there is one value missing, then the max or mean will also be missing. So, we need to tell max() to exclude any NA values from the calculation using the argument na.rm = TRUE.
Quick note: If you are using a version of dplyr >= 1.0.0, you’ll get a message informing you that it is regrouping output. This is a notification and can be ignored. If you don’t want the message displayed, you can add options(dplyr.summarise.inform = FALSE) to the top of your script and run it.
We also get a warning about there being missing values in diabetes, so we need to remove rows that have missing diabetes status values.
nhanes_small%>%# Recall ! means "NOT", so !is.na means "is not missing"filter(!is.na(diabetes))%>%group_by(diabetes)%>%summarise(mean_age =mean(age, na.rm =TRUE), mean_bmi =mean(bmi, na.rm =TRUE))
#> # A tibble: 6 × 4
#> diabetes phys_active mean_age mean_bmi
#> <fct> <fct> <dbl> <dbl>
#> 1 No No 46.2 28.8
#> 2 No Yes 37.9 26.8
#> 3 No <NA> 6.02 17.9
#> 4 Yes No 61.7 33.4
#> 5 Yes Yes 55.7 31.2
#> 6 Yes <NA> 13.7 26.1
Before you start the following exercise, run styler (Ctrl-Shift-P, then type “style file”) and commit changes to the Git history (Ctrl-Shift-P, then type “commit”).
7.16 Saving datasets as files
Instructor note
Go over how to save data to a file, but don’t cover the here package right away. Let them read about it in the section below and then briefly cover it again.
The nhanes_small data frame you created is only available after you’ve created it from NHAHES, but if you want to access it later, you can save it as a .csv file in your data/ folder using the function readr::write_csv().
There’s a new thing here()! The here package uses a function called here() to make it easier to manage file paths within an R Project.
So, what is a file path and why is this here package necessary? A file path is the list of folders a file is found in. For instance, your CV may be found in /Users/Documents/personal_things/CV.docx. The problem with file paths when running code (like with R) is that when you run a script interactively (e.g. what we do in class and normally), the file path and “working directory” (the R Session) are located at the Project level (where the .Rproj file is found). You can see the working directory by looking at the top of the RStudio Console.
But! When you source() an R script or run it not interactively, the R code may likely run in the folder it is saved in, e.g. in the R/ folder. So your file path data/nhanes_small.csv won’t work because there isn’t a folder called data/ in the R/ folder.
LearningR <-- R Project working directory starts here.
├── R
│ ├── README.md
│ └── learning.R <-- Working directory when running not interactively.
├── data
│ └── README.md
├── data-raw
│ └── README.md
├── doc
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearningR.Rproj <-- here() moves file path to start in this file's folder.
├── README.md
└── TODO.md
Often people use the function setwd() in scripts, but this is never a good idea since using it makes your script runnable only on your computer… which makes it no longer reproducible. We use the here() function to tell R to go to the project root (where the .Rproj file is found) and then use that file path. This simple function can make your work more reproducible and easier for you to use later on.
For many projects, it isn’t necessary or advisable to save every single data object you create. It’s better to let the code create the data you’ll use rather than saving each new wrangled dataset you might create. However, sometimes you will want or need to save the dataset you’ve been working on, perhaps because you’ve done a lot of work to prepare it for later analyses, or because you’ve run an analysis and want to save the results. In these cases, you should definitely save the new cleaned dataset.
7.17 Loading in a dataset
Reading task: ~5 minutes
We’ve been using a teaching dataset that we load from a package, mainly so that we can focus on getting familiar with data wrangling. However, there will come a time when you want to wrangle your own data. There are several ways to load in a dataset, with the most common being:
Using the RStudio menu File -> Import Dataset -> From Text/Excel/SPSS/SAS/Stata (depending on your file type you want to import).
If the file is a .csv file, use readr::read_csv() to import the dataset, for instance with the nhanes_small: