6  Analytically reproducible documents

6.1 Learning objectives

This session’s overall learning outcome is to:

  1. Describe what a reproducible document is, explain why it is important, and use it to write R code to create an output document.

Specific objectives are to:

  1. Describe what Quarto is and how it helps with reproducibility, and why it can save you time and effort.
  2. Write and use R code within a document, so that it will automatically insert the R output into the final document.
  3. Identify what some Markdown syntax does to the formatting of the generated HTML file and write some Markdown to make some simple formatting of your document.
  4. Define some basic objects and data types in R and write some simple R code within the Quarto document.

6.2 📖 Reading task: Why and how to be reproducible?

Time: ~10 minutes

Both reproducibility and replicability are cornerstones for doing rigorous and sound science. Unfortunately, reproducibility in science is near to non-existent. However, being reproducible isn’t just about doing better science. It can also mean that:

  1. You are much more efficient and productive, as less time is spent between coding and transferring your results to a document. No need to copy and paste!
  2. You can be confident in your results, since what you report and show as figures or tables will be exactly what you get from your analysis. Again, no copying and pasting required!

Have a more reproducible workflow by using Quarto. Artwork by @allison_horst.

One common way of being more reproducible when doing data analysis in R is to use Quarto. Quarto is a way of writing R (or Python) code alongside text in a way that allows you to create documents like HTML or Word where the output from R code is directly inserted into the document. For example, if you need to make a figure, you can write R code in the Quarto document so that when you generate a Word document the figure gets inserted automatically.

A main feature of using Quarto is that Quarto, when you render the output document, will always run the code in the order used in the file and in an fresh, empty environment (a new R session). This means that the output from the code and the results will be, at least within the document, reproducible.

Tip

If you have heard of R Markdown, Quarto is the next generation version of that.

A Quarto file is a file format (a plain text format like R scripts, .csv, or .txt files) with the extension .qmd where you write text with Markdown syntax. Markdown is a markup syntax and formatting tool, like HTML, that allows you to write a document in plain text (e.g. like a .txt or .csv file). The Markdown text can then be converted into a vast range of other document types, e.g. HTML, PDF, Word documents, slides, posters, or websites. In fact, this website is built from Quarto! Check out Quarto’s Gallery to see a list of things you can create.

For now, we’re going to focus on the main reason that Quarto is used: to use R code and insert the R output into a document. By using R code in a document, you can easily switch between data analysis and document-writing. Which means that:

  • There is less time between exploring a new dataset or analysis and sharing your findings with collaborators, because the writing and documenting is mixed in with your code for analysis.
  • If you have already created a report and later get new data or find out there are problems with the existing data, updating your report is as easy as clicking a button to regenerate the results.
  • How you found and present your results is based on the exact sequence of steps given in your Quarto document, so showing others how the analysis was done is easy because the how is explicitly shown in the document.
  • Likewise, by reading others’ Quarto documents, it is easier to learn what was done in their analysis because the logic and sequence is shown in the document itself.
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩

6.3 Creating a Quarto file

Briefly explain what the Command Palette is as you use it here. Talk through this section slowly, especially the YAML header section.

Now, we will create and save a Quarto file. We’ll use the Command Palette by using Ctrl-Shift-P and then typing “quarto” and select the one that says “Create a new Quarto document”. You can also use the menu, by going to File -> New File -> Quarto Document ..., and a dialog box will then appear. Enter “Reproducible documents” in the title field and your name in the author field. HTML should be automatically selected as the output format. There’s also the option to use the “visual mode”. This mode is great if you are used to working with Word and you can test it out on your own later. For this course, we will focus on using the normal mode.

After clicking “Create”, the new file will open in RStudio. Before continuing, let’s save this file as learning.qmd in the docs/ folder.

In the newly saved docs/learning.qmd file, you will see some text that gives a brief overview of how to use the Quarto file. For now, let’s ignore the text. At the top of the file you will see something that looks a bit like this:

docs/learning.qmd
---
title: "Reproducible documents"
author: "Your Name"
format: html
---

This section is called the YAML header and it contains the metadata about the document and the settings for how Quarto should process it into another document. Most Markdown documents have this YAML header at the top of the document and they are always surrounded by --- on the top and bottom of the section.

YAML is a data format that has the form of a key: value pairing to store data. The keys in this case are title, author, and format. The values are those that follow the key (e.g. “Your Name” for author). In the case of Quarto, these key data are used to store the settings that Quarto will use to create the format output document. The keys listed above are some of many settings that Quarto has available to use.

In the case of this YAML header, the Quarto document will generate an HTML file because of the format: html setting. You can also create a word document by changing this to format: docx.

Tip

It is possible to create PDF documents, though this requires installing a LaTeX distribution such as tinytex, which can sometimes be complicated to install.

So, how do we create a HTML (or Word) document from the this document? We do that by “rendering” it. At the top of the pane near the “Save” button, there is a button with the word “Render”. To render, you either click that button or use the shortcut Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) anywhere in the Quarto document. When you click the “Render” button, a bunch of processing messages should appear in a new pane beside the Console, followed by a new window popping up or in the Viewer pane with the newly created document or.

You’ve now created a HTML document! Let’s try making a Word document. Change the YAML value in the key format: from html to docx. Then render the document again with the “Render” button or with the keybinding Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”). A Word document should open up if you have a Word processer installed. This is the basic approach to creating documents from Quarto.

6.4 RStudio layout and usage

You’ve already gotten a bit familiar with RStudio in the pre-course tasks, but if you want more details, RStudio has a great cheat sheet on how to use RStudio. The items to know right now are the “Console”, “Files”/“Help”, and “Source” tabs.

Code is written in the “Source” tab, where it saves the code and text as a file. You can send selected code to the Console from the opened file by typing Ctrl-Enter (or clicking the “Run” button). In the “Source” tab (where R scripts and Quarto files are shown), there is a “Document Outline” button (top right beside the “Run” button) that shows you the headers or “Sections” (more on that later). To open it you can either click the button, use the keybinding Ctrl-Shift-O or with the Palette (Ctrl-Shift-P, then type “outline”), go through the menu to Code -> Show Document Outline. The Command Palette is a very useful tool to learn, since you can easily access almost all features and options inside RStudio through it. Because of this reason, we will be using it a lot throughout the course. Open it up with Ctrl-Shift-P and then in the pop-up search bar, type out “document outline”. The first item should be the one we want, so hit Enter to activate the Outline.

If you can’t remember a specific keybinding in RStudio, check out the help for it by going to the menu item Help -> Keyboard Shortcuts Help.

6.5 Writing R code

Being able to insert R code directly into a document is one of the most powerful features of Quarto. This frees you from having to switch between programs when simultaneously writing text and running R code to derive output that you’d then put into your scientific document.

Running and including R code in Quarto is done using “R code chunks”. You insert these chunks into the document by placing the cursor at the location where you want the chunk to be, then using the shortcut Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). Alternatively, you can also use the Command Palette with Ctrl-Shift-P followed by type “new chunk” or from the menu item Code -> Insert Chunk.

Before we insert the code chunk, let’s delete all the text in your document, except for the YAML header (including the dashes surrounding it). Make sure that the YAML key format: is set to html. Then, place your cursor two lines below the YAML header and insert a code chunk with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). In the code chunk, type out 2 + 2. It should look something like:

```{r}
2 + 2
```

You can run R code inside the code chunk using Ctrl-Enter on the line, which will send the code 2 + 2 to the R Console, with the output appearing directly below the code chunk in the document. Note that this output is temporary.

To see how the output is inserted into the HTML document, let’s render the document using Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to see what happens. The output 4 should appear below the code chunk in the HTML document, something like this:

```{r}
2 + 2
```
[1] 4

6.6 📖 Reading task: Basics of R

Let them read it over, then briefly go over the content again. We don’t need to do most of this as a code-along, since we will be using them a lot throughout the course.

Emphasize that, in general, code with () means it is a function and that it does an action. Mention that, like everything, there are some situations where that isn’t completely true but it mostly is.

Time: ~5 minutes

Before moving on, let’s go over a bit about how R works, and what the “R session” means. An R session is the way you normally interact with R, where you would write code in the Console to tell R to do something. Normally, when you open an R session without an R Project, the session defaults to assuming you will be working in the ~/Desktop or ~ (your Home folder) location. But this location usually isn’t where you actually work and where your R code is. You normally work in the folder that has your R scripts, Quarto documents, or data files. The assumption with R Projects on the other hand, is that the R session’s working directory should be where the R Project is, since that is where you have your R scripts and data files.

In R, everything is an object and every action is a function. A function is also an object, but an object isn’t always a function. To create an object, also called a variable, we use the <- assignment operator. So, if we want to create an object called weight_kilos and assign it the value 100, we would write:

weight_kilos <- 100
weight_kilos
[1] 100

The new object now stores the value we assigned it. We can read it like:

weight_kilos contains the number 100” or “put 100 into the object weight_kilos

You can name an object in R almost anything you want, but it’s best to stick to a style guide. For example, we will use snake_case to name things.

There are also several main “classes” (or types) of objects in R: lists, vectors, matrices, and data frames. For now, the only two we will cover are vectors and data frames. A vector is a string of values, while a data frame is multiple vectors put together as columns. Data frames are a form of data that you’d typically see as a spreadsheet. This type of data is called “rectangular data” since it has two dimensions: columns and rows.

So these are vectors, which have different types like character, number, or factor:

# Character vector
c("a", "b", "c")
# Logic vector
c(TRUE, FALSE, FALSE)
# Numeric vector
c(1, 5, 6)
# Factor vector (special types of character vectors)
factor(c("low", "high", "medium", "high"))

Notice how we use the # to write comments or notes. Whatever we write after the “hash” (#) tells R to ignore it and not run it.

This is what a data frame looks like, if we look at the built-in dataset called airquality, which is a data frame object loaded by default when you start R:

head(airquality)
# A tibble: 6 × 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    41     190   7.4    67     5     1
2    36     118   8      72     5     2
3    12     149  12.6    74     5     3
4    18     313  11.5    62     5     4
5    NA      NA  14.3    56     5     5
6    28      NA  14.9    66     5     6

The c() function puts values together and head() prints the first 6 rows. Both c() and head() are functions since they do an action and they can be recognized by the () at their end. Functions take an input (known as arguments) and give back an output. Each argument is separated by a comma ,. Some functions can take unlimited number of arguments (like c()). Others, like head() can only take a few arguments. In the case of head(), the first argument requires a data frame object.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩

6.7 Adding the special “setup” code chunk

We showed a very basic example of how a code chunk works. Let’s continue by doing something slightly more complicated.

One of the major strengths of R, and many other programming languages, is in its ability for other people to create packages that simplify doing complex tasks. For example, if you need to use mixed effects models for your data analysis, you can use the lme4 package. Or if you want to create figures you can use the ggplot2 package. As you experienced from the pre-course tasks, installing packages is easy by using install.packages(). Whenever we work with R, we very rarely work only with the base R functions. We usually use a lot of functions from many other packages, because that is one of the easiest ways for you to simplify your work! No need to re-invent the wheel 😁

One “meta-package” we will use throughout the course is called tidyverse. So let’s load the package up so we can use the functions from inside it.

First, go to the top of the docs/learning.qmd file and create a new code chunk two lines below the YAML header with Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”). In the newly created code chunk, type out setup right after the r. This area after the r is where you write the code chunk label. In this case, we labeled the code chunk with the name setup. Code chunk labels should be named without _, spaces, or . and instead should be one word or be separated by -.

Emphasize the warning note below.

Warning

If you use a space in your chunk label, an error or warning may not necessarily occur, but there can be unintended side effects that you may not realize. This may likely cause quite a bit of annoyance and frustration.

You also can’t use duplicate code chunk labels in your document.

Tip

A nifty thing about using chunk labels is that you can get an overview of your code chunks using the “Document Outline” with Ctrl-Shift-O or with the Palette (Ctrl-Shift-P, then type “outline”), but only if you have this option set up in: Tools -> Global Options -> R Markdown -> Show in document outline.

The name setup also has a special meaning in Quarto. When you run other code chunks in the document in a new (or restarted) session, Quarto will know to first look for and run the code in the setup chunk. So this is the perfect place to add packages to use or code to load your data, which we will do in a later session.

The way you load a package and get access to the functions inside is by using the library() function. So let’s load the tidyverse package by writing library(tidyverse) in the code chunk. It should look like this:

```{r setup}
library(tidyverse)
```

Let’s run this code chunk by placing the cursor over the code and using Ctrl-Enter. After you run the code, you should see some text below the setup chunk that might look something like this:

── Attaching core tidyverse packages ──────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As we continue adding code to the docs/learning.qmd file, we’ll also add some text to help organize the document and maybe you’ll also find it useful to add notes to yourself as you are coding along. In order for us to do that, we need to learn a bit about Markdown syntax.

6.8 📖 Reading task: Formatting text with Markdown

Time: ~8 minutes

Formatting text in an output document like HTML or Word when using Markdown is done with the use of “special” characters or syntax. These special characters control whether the text will be bold, if it will be a header or a list, and so on. Almost every feature you will need to write a scientific document is available in Markdown, but it can’t do everything. If you can’t get Markdown to do what you want, our suggestion would be to try to fit your writing around Markdown, rather than force or fight with Markdown to do something it wasn’t designed to do. You might actually find that the simpler Markdown approach is easier than what you wanted or were thinking of doing, and that you can do quite a lot with Markdown’s capabilities.

Tip

You can access a quick guide to formatting features of Markdown using the RStudio menu: Help -> Cheat Sheets -> R Markdown Cheat Sheet. Quarto also has a great guide to the Basics of Markdown.

6.8.1 Headers

Creating headers (like chapters or sections) is done by using one or more # at the beginning of a line followed by some text. Headers should always be preceded and followed by an empty line:

# Header 1

Paragraph.

## Header 2

Paragraph.

### Header 3

Paragraph.

Header 1

Paragraph.

Header 2

Paragraph.

Header 3

Paragraph.

6.8.2 Lists

Lists are created by adding either - or 1. to the beginning of a line. An empty line must be at the start and end of the list.

For unnumbered lists, it looks like:

Markdown

- item 1
- item 2
- item 3

Output

  • item 1
  • item 2
  • item 3

And numbered lists look like:

Markdown

1. item 1
2. item 2
3. item 3

Output

  1. item 1
  2. item 2
  3. item 3

6.8.3 Text formatting

Markdown Output
**bold** bold
*italics* italics
super^script^ superscript
sub~script~ subscript
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩

6.9 🧑‍💻 Exercise: Practice using Markdown

Time: ~10 minutes.

Get some practice writing Markdown by completing these tasks in the docs/learning.qmd file. Use this scaffold below to help guide you by replacing the ___ with the instructions from above:

docs/learning.qmd

## ___

-   ___
-   ___

___

## ___


```{r}
___
```
  1. Create one level 2 header (##) called “About me” below the setup code chunk.
  2. Below this new level 2 header, insert a list (either numbered or unnumbered) with your name and affiliation(s)/institution(s).
  3. Below the list, write one or two simple sentences about yourself. Include one word in bold and another in italics.
  4. Create another level 2 header (##) called “Simple code”.
  5. Insert a code chunk below this level 2 header by using Ctrl-Alt-I or with the Palette (Ctrl-Shift-P, then type “new chunk”) that multiplies 3 * 3 and run by using Ctrl-Enter.
  6. Finally, render the document with Ctrl-Shift-K or with the Palette (Ctrl-Shift-P, then type “render”) to see how it looks.
Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor 👒 🎩

6.10 Summary

  • Making your research reproducible not only improves the scientific quality of your work, but also makes you more efficient, productive, and have more confidence in your results.
  • The R session is where you interact with R and where code gets run, which occurs in the R Console in RStudio.
  • The working directory in the R Console of the session determines where code execution happens, where files are saved, and where data is loaded from.
  • Running R code involves sending it to the R Console by using Ctrl-Enter.
  • There are many types of objects in R, the two important ones for us are vectors and data frames.
  • There are many data types in R, like character, numeric, and factor.
  • Everything in R is an object and every action is a function, which has a () at the end.
  • Use Quarto to create documents that can be turned into a variety of file types such as HTML or Word, while also ensuring the analysis is (more likely to be) reproducible.
  • Insert R code chucks in Quarto and automatically include the results in the final document.
  • Use Markdown syntax like headers (# Header 1), text formatting (**bold**) and lists (-) in the Quarto file to format the text in the output document.