docs/learning.R
select(post_meal_data, Age)
# A tibble: 31 × 1
Age
<dbl>
1 50
2 51
3 43
4 55
5 53
6 51
7 48
8 35
9 54
10 52
# ℹ 21 more rows
This session’s overall learning outcome is to:
Specific objectives are to:
Time: ~5 minutes
A large number of tasks in data wrangling can be broken down and expressed with only a few simple “verbs” (actions). Wrangling here refers to maneuvering, managing, controlling, and turning your data around to clean it up, better understand it, and prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that gets loaded with tidyverse:
Task | Function | Example |
---|---|---|
Select columns | select() |
Remove data entry columns such as the person’s name who entered the data. |
Subset/filter rows | filter() |
Keeping rows with glucose values above 4. |
Rename columns | rename() |
Changing a column name from ‘Q1’ to ‘participant_name’. |
Transform or modify columns | mutate() |
Multiplying or taking the log of a column’s values. |
Sort rows | arrange() |
Show rows with the smallest value at the top. We won’t cover this function in this course, but we do have an extra section on it in Appendix E. |
Calculate summaries | summarise() |
Calculating the maximum, median, and minimum age. We’ll cover this in Chapter 13. |
Run code by a group |
group_by() with summarise()
|
Calculate means of age by males and females. We’ll cover this in Chapter 13. |
You might notice that we have used the word “variable” in the previous sessions and here we are starting to use “column”. While often these two words mean the same, they don’t always. A column is a list of values of the same type structured in a data frame. A variable is a list of values that have specific and consistent characteristics that represent something in the real world. For instance, in R, it is entirely possible to have all the columns in a data frame stacked on top of each other in a very long format, including stacking numbers and text. A variable however, would be something like a person’s weight or age. It is something that is important to us humans, that means something specific. When we do data wrangling, we often end up working with or creating columns, but often the goal to get our data into a format where the columns are also the variables.
Taking an example from our data, the Insulin10
column would be more of a column than a variable because it is insulin measured at 10 minutes, which is very specific and isn’t really meaningful to us humans as a concept. If we split it up into two new columns, Insulin
and Time
, then both Insulin
and Time
would be variables.
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩
Take your time explaining the arguments and positions of things in the function.
Selecting columns of a dataset is a very common data wrangling task. The function for this task is appropriately called select()
. You would use select()
to select one or more columns in a dataset that you want to have a closer look at or to save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., Age
) and clinical variables (e.g., Weight
and Height
) to perform these analyses.
For the input arguments, select()
takes the data object as the first argument, which is the first input position right after the opening bracket (
, and then takes the names of the columns you want to select after this first argument position. The argument after the data argument is shown as ...
, which indicates that you can add as many columns as you want, separated by a ,
. Let’s try it out on a few columns of our data.
First, go to the bottom of your Quarto file and create a new header called ## Selecting data
. Below that header, create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Then we can start selecting some columns. Let’s select one column first. When you write the name of the column, you don’t need to use quotes around it but it will still work if you do.
docs/learning.R
select(post_meal_data, Age)
# A tibble: 31 × 1
Age
<dbl>
1 50
2 51
3 43
4 55
5 53
6 51
7 48
8 35
9 54
10 52
# ℹ 21 more rows
When you run this code with Ctrl-EnterCtrl-Enter, you will see that only the Age
column is shown in the output.
Open up and show the Environment pane to show that no new object has been created and explain the paragraph below. Reinforce that if you don’t assign the output to an object, it means you can’t do anything else with the output since nothing was created.
If we look at our environment pane in RStudio, you will notice that running these functions doesn’t create any new object in our environment. That’s because we haven’t assigned the output to a new object with <-
. When we run these actions, we are only telling R to send the output to our screen for us to look at. Shortly we will assign a new object with some selected columns to a new data object.
Let’s select a few more columns. Create a new line in the same code chunk and add the following code:
select(post_meal_data, Age, BMI, Weight)
# A tibble: 31 × 3
Age BMI Weight
<dbl> <dbl> <dbl>
1 50 27.5 92
2 51 33.7 106.
3 43 26.3 89.1
4 55 25.9 84
5 53 29.4 99.4
6 51 23.7 76.8
7 48 23.9 75.8
8 35 22 67.5
9 54 26.4 87.9
10 52 24.5 72.2
# ℹ 21 more rows
This time we have a lot more columns output. What if we want to exclude a column? We can use the minus sign -
to exclude a column. Let’s exclude the Group
column.
select(post_meal_data, -Group)
# A tibble: 31 × 84
OFS.ID Age BMI Length Weight Bone.mineral.DXA Fat.mass...DXA
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 OFS 301 50 27.5 1.83 92 3.54 30.2
2 OFS 302 51 33.7 1.77 106. 4.05 36.4
3 OFS 304 43 26.3 1.84 89.1 3.77 24.4
4 OFS 303 55 25.9 1.8 84 3.14 27.5
5 OFS 305 53 29.4 1.84 99.4 4.09 31.2
6 OFS 306 51 23.7 1.8 76.8 3.21 20
7 OFS 307 48 23.9 1.78 75.8 3.33 15.4
8 OFS 308 35 22 1.75 67.5 3.26 13.8
9 OFS 309 54 26.4 1.83 87.9 4.49 20
10 OFS 310 52 24.5 1.72 72.2 2.87 28.5
# ℹ 21 more rows
# ℹ 77 more variables: Fat.mass.DXA <dbl>, Fat.free.mass.DXA <dbl>,
# Fat.free.soft.tissue.DXA <dbl>, FP.Glucose.screen <dbl>,
# P.Glucose..5.OGTT <dbl>, P.Glucose.0.OGTT <dbl>,
# P.GLucose.30.OGTT <dbl>, P.Glucose.60.OGTT <dbl>,
# P.Glucose.90.OGTT <dbl>, P.Glucose.120.OGTT <dbl>,
# FS.Insulin.screen <dbl>, Insulin..5.OGTT.X <dbl>, …
Sometimes you might have many columns with that all follow a specific pattern, like our Insulin
or PG
columns. There are some “select” helper functions to choose these columns, the commonly used ones are:
starts_with()
: Select columns that begin with a pattern.ends_with()
: Select columns that end with a pattern.contains()
: Select columns that contain a pattern.For more information on using the pattern functions such as starts_with()
, check out ?select_helpers
(choose the “Select helpers” option in the menu that pops up).
Let’s try these out on our data. Create a new header called ## Select helpers
and then create a new code chunk below that header with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). We’ll start by selecting all columns that start with “PG” (post-meal glucose).
docs/learning.qmd
select(post_meal_data, starts_with("PG"))
# A tibble: 31 × 15
PG.15 PG.5 PG1 PG2 PG3 PG5 PG8 PG10 PG15 PG20 PG30
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.4 5.4 5.4 5.5 5.3 5.3 5.5 5.5 5.8 6.1 6.9
2 5.2 5.4 5.4 5.4 5.4 5.4 5.5 5.6 6 6.5 7.5
3 4.9 5 5.1 5.1 5.1 5 5.1 5.1 5 5.2 5.5
4 4.8 4.8 4.8 4.8 4.7 4.9 4.9 5.1 5.3 5.1 5.5
5 5.9 5.8 5.8 5.9 5.8 5.8 5.7 5.7 6.2 7.2 8
6 5 4.9 5.1 5.1 5.1 5 5 5 5.5 5.6 6.1
7 5 5.1 5 5 5.1 5.1 5.1 5.2 5.7 6.1 7.9
8 4.6 4.6 4.9 4.9 4.9 4.8 4.7 4.8 5.1 5.5 6.5
9 5.3 5.1 5.2 5.3 5.4 5.2 5.2 5.4 6 6.4 7.1
10 4.9 4.9 4.8 4.8 4.8 4.9 5.1 5 5.3 6 6.5
# ℹ 21 more rows
# ℹ 4 more variables: PG45 <dbl>, PG60 <dbl>, PG90 <dbl>, PG120 <dbl>
See how it outputs all the columns that start with “PG”? Now let’s try selecting all columns that end with “DXA”:
# A tibble: 31 × 5
Bone.mineral.DXA Fat.mass...DXA Fat.mass.DXA Fat.free.mass.DXA
<dbl> <dbl> <dbl> <dbl>
1 3.54 30.2 27.9 64.3
2 4.05 36.4 38.7 67.8
3 3.77 24.4 21.9 68.0
4 3.14 27.5 23.2 61.1
5 4.09 31.2 31.2 69.0
6 3.21 20 15.3 60.9
7 3.33 15.4 11.8 64.9
8 3.26 13.8 9.4 59.0
9 4.49 20 17.9 71.3
10 2.87 28.5 20.7 51.8
# ℹ 21 more rows
# ℹ 1 more variable: Fat.free.soft.tissue.DXA <dbl>
Just like the previous example, it outputs all the columns that end with “DXA”. Lastly, let’s try selecting all columns that contain the number “0” somewhere in the column name:
# A tibble: 31 × 29
P.Glucose.0.OGTT P.GLucose.30.OGTT P.Glucose.60.OGTT
<dbl> <dbl> <dbl>
1 5.1 9.4 6.9
2 5.4 8.4 8.4
3 4.9 6.8 4.3
4 5.15 8.8 8.4
5 5.55 9.9 10.6
6 5.6 8.9 9.1
7 5 7.6 6.3
8 5.15 6.4 7.6
9 4.9 7.4 6.8
10 5.3 8.7 10.1
# ℹ 21 more rows
# ℹ 26 more variables: P.Glucose.90.OGTT <dbl>,
# P.Glucose.120.OGTT <dbl>, Insulin.0.OGTT.X <dbl>,
# Insulin.0.OGTT <dbl>, Insulin.30.OGTT <dbl>, Insulin.60.OGTT <dbl>,
# Insulin.90.OGTT <dbl>, Insulin.120.OGTT <dbl>, PG10 <dbl>,
# PG20 <dbl>, PG30 <dbl>, PG60 <dbl>, PG90 <dbl>, PG120 <dbl>,
# CP10 <dbl>, CP20 <dbl>, CP30 <dbl>, CP60 <dbl>, CP90 <dbl>, …
Now that we’ve tested how to use select()
, let’s simplify our dataset a bit. While our data doesn’t have too many rows, there are a lot of columns to deal with, at least for this course. Let’s keep only a few columns for now, so it is easier to look at and explore.
Since we want this code to run every time we restart this Quarto document, we should put it into the setup
code chunk. Go to the setup
code chunk and at the bottom of it, add the following code:
docs/learning.qmd
small_post_meal_data <- select(post_meal_data, id, Age, BMI, Group, auc_pg, auc_ins, glykemi)
Run the code with Ctrl-EnterCtrl-Enter to assign the object, then in the Console, check what it looks like by typing out the newly created object there:
Console
small_post_meal_data
# A tibble: 31 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 0
2 2 51 33.7 FDR 944. 42243. 0
3 3 43 26.3 FDR 694. 7108. 0
4 4 55 25.9 FDR 731. 37593. 0
5 5 53 29.4 FDR 1004. 33863. 0
6 6 51 23.7 FDR 713. 15008. 0
7 7 48 23.9 FDR 965. 23196. 0
8 8 35 22 FDR 741. 15272. 0
9 9 54 26.4 FDR 848. 24426. 0
10 10 52 24.5 FDR 759. 22475. 0
# ℹ 21 more rows
We will now work with this smaller data object for a while. Let’s make sure to style our document with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), then we will render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”) to see the output. Once done, we’ll add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Finally, push the changes to your GitHub repository.
Briefly go over this again, especially the examples given.
Time: ~6 minutes
A key component of tidy data and tidy code is making use of the “pipe” |>
operator and that comes built-in with R. This base R pipe was inspired from the magrittr %>%
pipe. You may notice this style of piping in documentation online. Both pipes work exactly the same, but since the |>
comes built-in with R, we will use that one for this course.
You would use the pipe |>
operator when you are writing a piece of code with multiple operations or steps in sequence. One advantage of the pipe is that it makes your code a bit more readable and also a bit easier to modify to insert additional steps in your data analysis “pipeline”.
The pipe takes the output from one function and inserts it as input to another function, just like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”.
Let’s use an example based on English words for some action. This is the English sentence:
We need some eggs. Drive to the grocery store and buy some eggs before coming home from work.
There are basically two actions here (“drive” and “buy”) with four inputs (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive()
and buy()
. In regular R (without the “pipe” operator), we would have to nest functions (reading from the inside to the outside) together to chain them:
drive(buy(drive(at_work, "grocery store"), "eggs"), "home")
This is a bit difficult to read. Another way to make it more readable would be to create temporary objects for each step:
at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")
This still isn’t very “readable”, as we are having to re-name each intermediate object with reference to the object before it. The pipe |>
operator can really simplify this:
at_work |>
drive("grocery store") |>
buy("eggs") |>
drive("home")
Do you find this more readable and understandable? It reads how it would be done, in the order of the steps taken.
The idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for the tidyverse and many other packages.
The pipe |>
takes the output from the object or function from the left of the operator and puts it into the function on the right of the operator. All input goes into the first position argument of the function. Within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order to work with the pipe.
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩
Take your time to explain the pipe operator and how it works.
Renaming columns is another very common data wrangling task. The function for this task is also pretty clear, called rename()
. Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory, which is particularly important if you’re sharing your work with others or in an environment where multiple people are working on the same data.
Like select()
, rename()
takes the dataset as the first argument and then takes as many renaming arguments as you want (because the second argument position is ...
). When renaming, it takes the form of newname = oldname
. Almost all functions in the tidyverse take this form, where the first argument is the data object and the rest are the arguments that the function will use. The purpose of this design is to make the functions easy to use with the pipe, because the pipe automatically takes the data object and puts it into the first position of the function.
Let’s rename a few of the columns, while also starting to use the pipe operator. Create a new header called ## Renaming columns
and then create a new code chunk below that header with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Then write out the code below, so that we start using the pipe operator to rename some columns. The keyboard shortcut for the pipe is Ctrl-Shift-MCtrl-Shift-M. Let’s rename auc_pg
to be a bit clearer as auc_glucose
.
docs/learning.qmd
small_post_meal_data |>
rename(auc_glucose = auc_pg)
# A tibble: 31 × 7
id Age BMI Group auc_glucose auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 0
2 2 51 33.7 FDR 944. 42243. 0
3 3 43 26.3 FDR 694. 7108. 0
4 4 55 25.9 FDR 731. 37593. 0
5 5 53 29.4 FDR 1004. 33863. 0
6 6 51 23.7 FDR 713. 15008. 0
7 7 48 23.9 FDR 965. 23196. 0
8 8 35 22 FDR 741. 15272. 0
9 9 54 26.4 FDR 848. 24426. 0
10 10 52 24.5 FDR 759. 22475. 0
# ℹ 21 more rows
The pipe automatically takes post_meal_data
and puts it into the first position, so we don’t need to specify post_meal_data
inside the rename()
function. Run this code with Ctrl-EnterCtrl-Enter and you will see the output with the new column name. We also have a column auc_ins
that we can rename to auc_insulin
. Add this to the code chunk below the previous code.
docs/learning.qmd
small_post_meal_data |>
rename(
auc_glucose = auc_pg,
auc_insulin = auc_ins
)
# A tibble: 31 × 7
id Age BMI Group auc_glucose auc_insulin glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 0
2 2 51 33.7 FDR 944. 42243. 0
3 3 43 26.3 FDR 694. 7108. 0
4 4 55 25.9 FDR 731. 37593. 0
5 5 53 29.4 FDR 1004. 33863. 0
6 6 51 23.7 FDR 713. 15008. 0
7 7 48 23.9 FDR 965. 23196. 0
8 8 35 22 FDR 741. 15272. 0
9 9 54 26.4 FDR 848. 24426. 0
10 10 52 24.5 FDR 759. 22475. 0
# ℹ 21 more rows
Run this code again with Ctrl-EnterCtrl-Enter to see the output. Before moving on, let’s style our document with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”), then push to GitHub.
Time: ~10 minutes.
Complete the following tasks while in the docs/learning.qmd
document, running each code after completing the task. Remember that you aren’t actually changing the data each time, since you won’t assign (<-
) to a new object. Use the code below as a template to complete the tasks.
docs/learning.qmd
Create a new header called ## Select and rename tasks
at the bottom of the Quarto document. Then create a new code chunk below that header with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”) and copy and paste the template above into this code chunk. Complete the tasks below by filling in the blanks in the template:
select()
only columns that contain()
the text "OGTT"
. Run that code to get the output. How many columns are there?
select()
columns that contains()
the characters ".."
but exclude -
columns that contains()
the characters "..."
.
Based on the columns from task 2 above, use rename()
on those columns so that the ..
inside their name is .minus
. You will need to type out these column names, both the old and the new, manually. Recall that renaming is in the form new = old
.
Read through (in your head) the code below # Task 4.
above. How intuitive is it to read? Now, re-write this code to use the pipe |>
so that you don’t need to create the temporary basic_info
object, then re-read the revised version. Which do you feel is easier to “read”?
After you finish these tasks, run styler on the docs/learning.qmd
file with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), then render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”). Lastly, add and commit these changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) before pushing to GitHub.
# Task 1.
post_meal_data |>
select(contains("OGTT"))
# Task 2.
post_meal_data |>
select(contains(".."), -contains("..."))
# Task 3.
post_meal_data |>
rename(
P.Glucose.minus5.OGTT = P.Glucose..5.OGTT,
Insulin.minus5.OGTT.X = Insulin..5.OGTT.X
)
# Task 4.
post_meal_data |>
select(id, BMI, Length, Weight, Age) |>
rename(Height = Length)
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩
Emphasize the importance of being careful with logic conditions, that we humans are really bad at it.
The two functions to “subset” are select()
’ing on the column level, and filter()
’ing on the row level. You’ll use filtering to do tasks like removing outliers or incorrectly entered data, dropping missing values, or keeping only some rows based on a condition, like only participants in a certain group.
The filter()
function requires a logic condition (TRUE
or FALSE
). As with the other functions, the first argument is the dataset and all others are the logical conditions that will apply to the row filtering. When the logical conditions equal TRUE
, it means that those rows will be kept and those that are FALSE
will be dropped.
Since filter()
uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. If your logical condition starts getting complex, double and triple check that you know for certain that your logic code is doing what you think it is doing. It’s very easy to make mistakes at this stage, even for advanced R users.
Show this table on the screen and briefly explain the first few to them. Mention that they can use this table as a reference for the rest of the course.
Below is a table showing most of the logic comparisons to use in R. Table 11.2 can be used as a reference for logical conditions in R.
Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | equal to |
!= | not equal to |
!x | not x (if x is true, ! will make it false) |
x | y | x OR y |
x & y | x AND y |
The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==
. For example, if we wanted to keep only people who are have (hyper-)glycemia in the dataset, we would use the logic condition like this. First, go to the bottom of the Quarto document and create a new header called ## Filtering data
. Below that header, create a new code chunk with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Then we can write out the code together.
docs/learning.qmd
small_post_meal_data |>
filter(glykemi == 1)
# A tibble: 4 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 15 48 24.7 FDR 912. 18432. 1
2 19 30 33.4 CTR 956. 47198. 1
3 20 46 30.7 CTR 811. 43320. 1
4 22 51 30.5 CTR 959. 53492. 1
We would “read” this code as:
Take the post_meal_data dataset and then filter so that only rows where
glykemi
is equal to 1 are kept.
When a row in the glykemi
column has the value 1
, that row is kept. Otherwise, it is dropped. Let’s try out another logical condition with filter()
. Let’s see how many people are above “overweight” (BMI of 25) category, which we would use >=
for. Within the same code chunk, add the following code:
docs/learning.qmd
small_post_meal_data |>
filter(BMI >= 25)
# A tibble: 19 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 0
2 2 51 33.7 FDR 944. 42243. 0
3 3 43 26.3 FDR 694. 7108. 0
4 4 55 25.9 FDR 731. 37593. 0
5 5 53 29.4 FDR 1004. 33863. 0
6 9 54 26.4 FDR 848. 24426. 0
7 11 47 27.9 FDR 764. 23176. 0
8 16 52 27.5 FDR 824. 36487. 0
9 17 51 29.7 CTR 834. 39020. 0
10 19 30 33.4 CTR 956. 47198. 1
11 20 46 30.7 CTR 811. 43320. 1
12 21 34 28.7 CTR 720. 24708. 0
13 22 51 30.5 CTR 959. 53492. 1
14 23 53 28.9 CTR 773. 24246. 0
15 24 36 26.2 CTR 678. 16162. 0
16 26 35 25.1 CTR 823. 19637. 0
17 27 37 27.8 CTR 842. 21189. 0
18 29 40 31.8 CTR 783. 29644. 0
19 31 33 27.4 CTR 786. 43337. 0
Run this by using Ctrl-EnterCtrl-Enter. We’ll see that more than half of the participants are above overweight. Let’s see how many are in the "CTR"
group, again, within the same code chunk.
docs/learning.qmd
small_post_meal_data |>
filter(Group == "CTR")
# A tibble: 15 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 17 51 29.7 CTR 834. 39020. 0
2 18 51 21.2 CTR 787. 12343. 0
3 19 30 33.4 CTR 956. 47198. 1
4 20 46 30.7 CTR 811. 43320. 1
5 21 34 28.7 CTR 720. 24708. 0
6 22 51 30.5 CTR 959. 53492. 1
7 23 53 28.9 CTR 773. 24246. 0
8 24 36 26.2 CTR 678. 16162. 0
9 25 30 21.2 CTR 735. 23830. 0
10 26 35 25.1 CTR 823. 19637. 0
11 27 37 27.8 CTR 842. 21189. 0
12 28 49 20.2 CTR 757. 17985. 0
13 29 40 31.8 CTR 783. 29644. 0
14 30 34 20.8 CTR 763. 16591. 0
15 31 33 27.4 CTR 786. 43337. 0
Looks like there is about half the people in the control group.
Reinforce the importance of being very careful with combining logic conditions and to double and triple check that how you understand the logic is correct.
For the code chunk comparing the true with the false, type this out yourself and tell the participants to just watch. Explain as you go along.
What if we wanted to see who is overweight and who is in the control group? We would use the |
(“or”) and &
(“and”) when we want to combine conditions across columns. Be careful with these operators and when combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &
, both sides must be TRUE
in order for the combination to be TRUE
. For |
, only one side needs to be TRUE
in order for the combination to be TRUE
. To see how they work try these:
TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE
FALSE & FALSE
[1] FALSE
TRUE | TRUE
[1] TRUE
TRUE | FALSE
[1] TRUE
FALSE | FALSE
[1] FALSE
But be careful with missing values, as they will “propagate”. For example:
TRUE & NA
[1] NA
FALSE & NA
[1] FALSE
TRUE | NA
[1] TRUE
FALSE | NA
[1] NA
The result will be NA
in all cases. This is because R doesn’t know what to do with the NA
value, so it returns NA
as the result. You as the human need to figure out what to do with the NA
values before you use the logic condition on them.
Return to having them code along with you.
So, let’s see how many people are in the control group and have a BMI of 25 or more. In the same code chunk, write this code below:
docs/learning.qmd
small_post_meal_data |>
filter(Group == "CTR", BMI >= 25)
# A tibble: 11 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 17 51 29.7 CTR 834. 39020. 0
2 19 30 33.4 CTR 956. 47198. 1
3 20 46 30.7 CTR 811. 43320. 1
4 21 34 28.7 CTR 720. 24708. 0
5 22 51 30.5 CTR 959. 53492. 1
6 23 53 28.9 CTR 773. 24246. 0
7 24 36 26.2 CTR 678. 16162. 0
8 26 35 25.1 CTR 823. 19637. 0
9 27 37 27.8 CTR 842. 21189. 0
10 29 40 31.8 CTR 783. 29644. 0
11 31 33 27.4 CTR 786. 43337. 0
When we use ,
to separate our logic conditions in filter()
, it interprets that as using an &
. Looks like most people who are overweight are in the control group. Let’s replace the ,
with a |
to see how many people are in the control group OR have a BMI of 25 or more.
small_post_meal_data |>
filter(Group == "CTR" | BMI >= 25)
# A tibble: 23 × 7
id Age BMI Group auc_pg auc_ins glykemi
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 0
2 2 51 33.7 FDR 944. 42243. 0
3 3 43 26.3 FDR 694. 7108. 0
4 4 55 25.9 FDR 731. 37593. 0
5 5 53 29.4 FDR 1004. 33863. 0
6 9 54 26.4 FDR 848. 24426. 0
7 11 47 27.9 FDR 764. 23176. 0
8 16 52 27.5 FDR 824. 36487. 0
9 17 51 29.7 CTR 834. 39020. 0
10 18 51 21.2 CTR 787. 12343. 0
# ℹ 13 more rows
This lists both the people who are in the control as well as anyone who is overweight.
Let’s create a new data object with the renamed code we used above plus removing the few participants with a 1
for glykemi
. This column is actually called hyperglycemia or impaired glucose tolerance, and is when the level of glucose in the blood is higher than normal, but not high enough to be diagnosed as diabetes. This is set as 0
and 1
, likely meaning “normal” and “high”. Since there are so few of them, we’ll just remove them for now. We’ll also remove the glykemi
column since we don’t need it anymore.
Since we want this code available to us easily, let’s put it into the setup
code chunk. Go to the setup
code chunk and at the bottom of it, write out the code below to create a new data object.
docs/learning.qmd
# A tibble: 27 × 6
id Age BMI Group auc_glucose auc_insulin
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728.
2 2 51 33.7 FDR 944. 42243.
3 3 43 26.3 FDR 694. 7108.
4 4 55 25.9 FDR 731. 37593.
5 5 53 29.4 FDR 1004. 33863.
6 6 51 23.7 FDR 713. 15008.
7 7 48 23.9 FDR 965. 23196.
8 8 35 22 FDR 741. 15272.
9 9 54 26.4 FDR 848. 24426.
10 10 52 24.5 FDR 759. 22475.
# ℹ 17 more rows
Run this code with Ctrl-EnterCtrl-Enter to see the output and see what it looks like now. Great! Before moving on, let’s style our document with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), and then add and commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Finally, push the changes to your GitHub repository.
Recall that the original dataset doesn’t change. If we want the added variable to be saved, we must assign it to something with <-
.
To modify an existing column or to add a new one, we can use the function called mutate()
. You can use mutate()
to compute a new column using existing columns in your dataset. You can multiply all values in a certain column by 2, or combine columns into a new column. Like the other functions, the first input is the dataset and the other arguments are columns to add or modify. The follow the form: new_column_name = action()
.
In our data, we have a few columns we could modify or create. The column we could create is a ratio between insulin and glucose in the blood, which is a very rough measure of insulin sensitivity. This is calculated by dividing the area under the curve of insulin by that of glucose. We’ll call this new column insulin_glucose_ratio
, where a high value means insulin insensitive (or resistant) and a low value means insulin sensitive. Let’s create a new header called ## Modifying columns
and then create a new code chunk below that header with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). We’ll use some math with the /
symbol to divide values.
docs/learning.qmd
tidier_post_meal_data |>
mutate(insulin_glucose_ratio = auc_insulin / auc_glucose)
# A tibble: 27 × 7
id Age BMI Group auc_glucose auc_insulin insulin_glucose_ratio
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 FDR 806. 28728. 35.7
2 2 51 33.7 FDR 944. 42243. 44.7
3 3 43 26.3 FDR 694. 7108. 10.2
4 4 55 25.9 FDR 731. 37593. 51.4
5 5 53 29.4 FDR 1004. 33863. 33.7
6 6 51 23.7 FDR 713. 15008. 21.1
7 7 48 23.9 FDR 965. 23196. 24.0
8 8 35 22 FDR 741. 15272. 20.6
9 9 54 26.4 FDR 848. 24426. 28.8
10 10 52 24.5 FDR 759. 22475. 29.6
# ℹ 17 more rows
Run the code with Ctrl-EnterCtrl-Enter to see the new column.
The other column we could modify, rather than create, is the Group
column. This is a character column with abbreviations rather than full names as values. We can continue using the same mutate()
by adding a comma to the end of the previous line and then adding the new column. For this, we will need to use the if_else()
function to change the values. The if_else()
function takes three arguments: the condition, the value if the condition is TRUE
, and the value if the condition is FALSE
. The Table 11.2 is useful to use as a reference to help create the logic condition. So if the Group
column is equal to "CTR"
, we want to change it to "Control"
, otherwise change it to "First-degree relative"
.
docs/learning.qmd
# A tibble: 27 × 7
id Age BMI Group auc_glucose auc_insulin insulin_glucose_ratio
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 50 27.5 Firs… 806. 28728. 35.7
2 2 51 33.7 Firs… 944. 42243. 44.7
3 3 43 26.3 Firs… 694. 7108. 10.2
4 4 55 25.9 Firs… 731. 37593. 51.4
5 5 53 29.4 Firs… 1004. 33863. 33.7
6 6 51 23.7 Firs… 713. 15008. 21.1
7 7 48 23.9 Firs… 965. 23196. 24.0
8 8 35 22 Firs… 741. 15272. 20.6
9 9 54 26.4 Firs… 848. 24426. 28.8
10 10 52 24.5 Firs… 759. 22475. 29.6
# ℹ 17 more rows
Run this code with Ctrl-EnterCtrl-Enter to see the new column. To make use of this new data object, we should assign it to a new object. Similar to the small_post_meal_data
and tidier_post_meal_data
objects, we want these available to us easily, so let’s put this into the setup
code chunk. Go to the setup
code chunk and at the bottom of it, write out the code below to create a new data object.
Run this code with Ctrl-EnterCtrl-Enter to see the new column. We’ll use this new data object in our next session.
Before moving on, run styler with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), and then add and commit all the files changed to the Git history using Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) before pushing to GitHub.
Time: ~15 minutes.
Time to practice using these two functions! First, create a new header called ## Practice with filter and mutate
at the bottom of the Quarto document. Then create a new code chunk below that header with Ctrl-Alt-ICtrl-Alt-I or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “new chunk”). Copy and paste the code template below into the code chunk before doing the tasks. Then complete the tasks by replacing the ___
with what is asked in the task.
docs/learning.qmd
Use filter()
to find how many participants have a BMI of more than or equal to 25 and less than or equal to 40, and are in the “FDR” Group
. How many are there?
Create a percent_body_fat
column with mutate()
by dividing the Fat.mass.DXA
by the Weight
, then multiplying by 100. Hint: In R, use *
to multiply and /
to divide. Then pipe to filter()
to find out how many participants have a percent_body_fat
greater than 30.
Once done, style the document with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), render the document with Ctrl-Shift-KCtrl-Shift-K or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “render”), and add and commit these changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Finally, push to GitHub to synchronize with your GitHub repository.
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩
select()
function to select specific columns in a dataset.rename()
function to rename columns in a dataset.filter()
function to filter rows based on a logical condition.mutate()
function to modify existing or add new columns in a dataset.|>
to write easy-to-read code, similar to reading a text consisting of multiple sentences.This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.
select(post_meal_data, Age)
select(post_meal_data, Age, BMI, Weight)
select(post_meal_data, -Group)
select(post_meal_data, starts_with("PG"))
select(post_meal_data, ends_with("DXA"))
# All columns containing the number "0"
select(post_meal_data, contains("0"))
small_post_meal_data <- select(post_meal_data, id, Age, BMI, Group, auc_pg, auc_ins, glykemi)
small_post_meal_data |>
rename(auc_glucose = auc_pg)
small_post_meal_data |>
rename(
auc_glucose = auc_pg,
auc_insulin = auc_ins
)
# Task 1.
post_meal_data |>
select(contains("OGTT"))
# Task 2.
post_meal_data |>
select(contains(".."), -contains("..."))
# Task 3.
post_meal_data |>
rename(
P.Glucose.minus5.OGTT = P.Glucose..5.OGTT,
Insulin.minus5.OGTT.X = Insulin..5.OGTT.X
)
# Task 4.
post_meal_data |>
select(id, BMI, Length, Weight, Age) |>
rename(Height = Length)
small_post_meal_data |>
filter(glykemi == 1)
small_post_meal_data |>
filter(BMI >= 25)
small_post_meal_data |>
filter(Group == "CTR")
small_post_meal_data |>
filter(Group == "CTR", BMI >= 25)
small_post_meal_data |>
filter(Group == "CTR" | BMI >= 25)
tidier_post_meal_data <- small_post_meal_data |>
rename(
auc_glucose = auc_pg,
auc_insulin = auc_ins
) |>
filter(glykemi == 0) |>
select(-glykemi)
tidier_post_meal_data
tidier_post_meal_data |>
mutate(insulin_glucose_ratio = auc_insulin / auc_glucose)
tidier_post_meal_data |>
mutate(
insulin_glucose_ratio = auc_insulin / auc_glucose,
Group = if_else(Group == "CTR", "Control", "First-degree relative")
)
tidied_post_meal_data <- tidier_post_meal_data |>
mutate(
insulin_glucose_ratio = auc_insulin / auc_glucose,
Group = if_else(Group == "CTR", "Control", "First-degree relative")
)
# Task 1.
post_meal_data |>
filter(BMI >= 25 & BMI <= 40 & Group == "FDR")
# Task 2.
post_meal_data |>
mutate(
percent_body_fat = (Fat.mass.DXA / Weight) * 100
) |>
filter(percent_body_fat > 30)