5  Management of R projects

Session objectives:

  1. Create self-contained projects to be more reproducible.
  2. Use built-in tools in RStudio to make it easier to manage R projects.
  3. Become familiar with the very basics of R.
  4. Apply tools to use a consistent “grammar” and “styling” when writing R code and making files.
  5. Know of and use different approaches to getting and finding help.

5.1 What is a project and why use it?

Throughout this session, because it is the first session, take it slowly. Talk through the basics of R, including emphasizing how to troubleshoot or get help. Check for participants’ understanding using the stickies.

Reading task: ~5 minutes

Before we create a project, we should first define what we mean by “project”. What is a project? In this case, a project is a set of files that together lead to some type of scientific “output” (for instance a manuscript). Use data for your output? That’s part of the project. Do any analysis on the data to give some results? Also part of the project. Write a document, for instance a manuscript, based on the data and results? Have figures inserted into the output document? These are also part of the project.

More and more how we make a claim in a scientific product is just as important as the output describing the claim. This includes not only the written description of the methods but also the exact steps taken, that is, the code used. So, using a project setup can help with keeping things self-contained and easier to track and link with the scientific output. Here are some things to consider when working in projects:

  • Organise all R scripts and files in the same folder (also called “directory”) so it is more self-contained (doesn’t rely on other components in your computer).
  • Use a common and consistent folder and file structure for your projects.
  • Use version control to track changes to your files.
  • Make raw data “read-only” (don’t edit it directly) and use code to show what was done.
  • Whenever possible, use code to create output (figures, tables) rather than manually creating or editing them.
  • Think of your code and project like you do your manuscript or thesis: that other people will eventually look at it and review it, and that it will likely also be published or archived online.

These simple steps can be huge steps toward being reproducible in your analysis. And by managing your projects in a reproducible fashion, you’ll not only make your science better and more rigorous, it also makes your life easier too!

5.2 Exercise: How do you organise your files and projects?

Time: ~8 minutes.

This seems so basic, how files are organized on computers. We literally work with files all the time on computers. But consider, how do you organize them? Take some time to discuss and share with your neighbour.

  1. Take 1 minute to think to yourself.
  2. Take 5 minutes to discuss and share with your neighbour.
  3. For the remaining time, we will all share our thoughts with the group.

5.3 RStudio and R Projects

RStudio helps us with managing projects by making use of R Projects. RStudio R Projects make it easy to divide your work projects into a “container”, that have their own working directory (the folder where your analysis occurs), workspace (where all the R activity and output is temporarily saved), history, and documents.

Warning

File synchronizing and backup services like OneDrive or Dropbox are super common. Unfortunately, they also can be a major source of frustration and challenge when working with data analysis projects. This is mainly due to they way the synchronizing, by constantly looking at changes to files and then synchronizing when a change occurs. When doing data analysis, especially as you get more advanced and use reproducible documents and version control systems, changes to files can happen very often and very quickly. This can essentially cause these services to “spasm” and may overwrite the changes that are happening. Whenever possible, always save your work on your computer and not on these services.

There are many ways one could organise a project folder. We’ll be setting up a project folder and file structure using prodigenr We’ll use RStudio’s “New Project” menu item under:

File -> New Project.. -> New directory -> Scientific Analysis Project using prodigenr

We’ll call the new project LearningR. Save it on your Desktop/. See Figure 5.1 for the steps to do it:

Figure 5.1: Creating a new analysis project in RStudio.

You can also type the below function into the Console, but we won’t do that in this session.

Console
prodigenr::setup_project("~/Desktop/LearningR")

Emphasize and reinforce what this :: is doing and why we are doing it.

Just a reminder, when we use the :: colon here, we are saying:

Hey R, from the prodigenr package use the setup_project function.

That way, we are directly requesting R to look in the prodigenr package and use the setup_project() function. We do this because we want to be explicit about what we want to use and since we don’t need to load the full package.

After we’ve created a New Project in RStudio, we’ll have a bunch of new files and folders.

LearningR
├── R
│   └── README.md
├── data
│   └── README.md
├── data-raw
│   └── README.md
├── doc
│   └── README.md
├── .gitignore
├── DESCRIPTION
├── LearningR.Rproj
├── README.md
└── TODO.md

This forces a specific and consistent folder structure to all your work. Think of this like the “Introduction”, “Methods”, “Results”, and “Discussion” sections of your paper. Each project is then like a single manuscript or report, that contains everything relevant to that specific project. There is a lot of power in something as simple as a consistent structure. Projects are used to make life easier. Once a project is opened within RStudio the following actions are automatically taken:

  • A new R session (process) is started.
  • The R session’s working directory is set to the project directory.
  • RStudio project options are loaded.

Before moving on, let’s go over a bit about how R works, and what the “R session” means. An R session is the way you normally interact with R, where you would write code in the Console to tell R to do something. Normally, when you open an R session without an R Project, the session defaults to assuming you will be working in the ~/Desktop or ~ (your Home folder) location. But this location isn’t where you actually work. You normally work in the folder that has your R scripts or data files. The assumption with R Projects on the other hand, is that the R session working directory should be where the R Project is, since that is where you have your R scripts and data files.

Note

Each R project is designated with a .Rproj file. This file contains information about the file path and various metadata. As such, when opening an R project, you need to open it using the .Rproj file.

A project can be opened by either double clicking on the .Rproj from your file browser or from the file prompt within R Studio:

File -> Open Project

or

File -> Recent Project.. -> LearningR

Within the project we created, there are several README files in each folder that explain a bit about what should be placed there. Briefly:

  1. Documents like manuscripts, abstracts, and exploration type documents should be put in the doc/ directory (including R Markdown and Quarto files
  2. Data, raw data, and metadata should be in either the data/ directory or in data-raw/ for the raw data. We’ll explain the data-raw/ folder and create it later in the lesson.
  3. All R files and code should be in the R/ directory.
  4. Name all new files to reflect their content or function. Follow the tidyverse style guide for file naming. Either _ or - are recommended to be used instead of a space, though using - tends to be more commonly used.

Since we’ll be using Git for version control in Chapter 6, which we highly recommended to use for any project, we need to add Git to our newly created project by typing in the R Console while in the newly created LearningR project:

Console
prodigenr::setup_with_git()

This will add the .gitignore file to the project as well as to tell Git to track our project. We’ll cover this more later.

5.4 What’s in a (file) name?

It might seem so basic, but how you name your files can have a huge impact on how easy it is for others, yourself in the future, as well as computers, to work on your project.

Take some time to think about file naming. Look at the list of file names below. Which file names are good names and which aren’t? We’ll discuss afterwards why some are good names and others are not.

fit models.R
fit-models.R
foo.r
stuff.r
get_data.R
Manuscript version 10.docx
manuscript.docx
new version of analysis.R
trying.something.here.R
plotting-regression.R
utility_functions.R
code.R

Ask them explain why these might not be the best file names. It is a group activity. Use the text below as a guide for the above question.

# Bad: Has a space.
fit models.R
# Good: Descriptive with no space.
fit-models.R
# Bad: Not descriptive.
foo.r
stuff.r
# Good: Descriptive with no space.
get_data.R
# Bad: Has space
Manuscript version 10.docx
# Good: Descriptive.
manuscript.docx
# Bad: Not descriptive and has spaces.
new version of analysis.R
# Bad: Not descriptive and has dots.
trying.something.here.R
# Good: Descriptive with - or _
plotting-regression.R
utility_functions.R
# Bad: Not descriptive.
code.R

5.5 Next steps after creating the project

Now that we’ve created a project and associated folders, let’s add some more options to the project. One option to set is to ensure that every R session you start with is a “blank slate”, meaning no old data are automatically imported into the Environment. This is done by typing the following code in the Console:

Console
usethis::use_blank_slate("project")

Now, let’s add one R script that we will use in multiple sessions:

Console
usethis::use_r("learning")

The usethis::use_r() command creates R scripts in the R/ folder. As you may tell, the usethis package can be quite handy. For the first few sessions, we will be working the R scripts and then later will move over to Quarto files instead. Why? Working with R, you will be doing a lot of coding and writing in both types of files, so we want you to get practice using both.

Figure 5.2: How we can stand on the shoulders of “usethis” to be productive. Artwork by @allison_horst.

5.6 RStudio layout and usage

Open up the R/learning.R file now, which you will use to type in code for the code-along parts. You’ve already gotten a bit familiar with RStudio in the pre-course tasks, but if you want more details, RStudio has a great cheatsheet on how to use RStudio. The items to know right now are the “Console”, “Files”/“Help”, and “Source” tabs.

Code is written in the “Source” tab, where it saves the code and text as a file. You can send selected code to the Console from the opened file by typing Ctrl-Enter (or clicking the “Run” button). In the “Source” tab (where R scripts and Quarto files are shown), there is a “Document Outline” button (top right beside the “Run” button) that shows you the headers or “Sections” (more on that later). To open it you can either click the button, use the keybinding Ctrl-Shift-O or with the Palette (Ctrl-Shift-P, then type “outline”), go through the menu to Code -> Show Document Outline. The Command Palette is a very useful tool to learn, since you can easily access almost all features and options inside RStudio through it. Because of this reason, we will be using it a lot throughout the course. Open it up with Ctrl-Shift-P and then in the pop-up search bar, type out “document outline”. The first item should be the one we want, so hit Enter to activate the Outline.

If you can’t remember a specific keybinding in RStudio, check out the help for it by going to the menu item Help -> Keyboard Shortcuts Help.

5.7 Basics of using R

One useful thing to do to make your R script more readable and understandable is to use “Sections”. They’re like “headers” in Word and they split up an R script into sections, which then show up in the “Document Outline”. We can insert a section using Ctrl-Shift-R or with the Palette (Ctrl-Shift-P, then type “code section”). You can also insert the sections through the menu Code -> Insert Section.

Let them read it over, then briefly go over the content again. We don’t need to do most of this as a code-along, since we will be using them a lot over the later sessions. However, do a code-along showing how to assign data to objects, the difference between unassigned (not saved) and assigned (saved; this will be helpful in the wrangling section and piping without assigning), and how to send code to the Console.

Emphasize that, in general, code with () means it is a function and that it does an action. Mention that, like everything, there are some situations where that isn’t completely true but it mostly is.

Reading task: ~5 minutes

In R, everything is an object and every action is a function. A function is an object, but an object isn’t always a function. To create an object, also called a variable, we use the <- assignment operator:

weight_kilos <- 100
weight_kilos
[1] 100

The new object now stores the value we assigned it. We can read it like:

weight_kilos contains the number 100” or “put 100 into the object weight_kilos

You can name an object in R almost anything you want, but it’s best to stick to a style guide. For instance, we will use snake_case to name things.

There are also several main “classes” (or types) of objects in R: lists, vectors, matrices, and data frames. For now, the only two we will cover are vectors and data frames. A vector is a string of values, while a data frame is multiple vectors put together as columns. Data frames are a form of data that you’d typically see as a spreadsheet. This type of data is called “rectangular data” since it has two dimensions: columns and rows.

So these are vectors, which have different types like character, number, or factor:

# Character vector
c("a", "b", "c")
# Logic vector
c(TRUE, FALSE, FALSE)
# Numeric vector
c(1, 5, 6)
# Factor vector
factor(c("low", "high", "medium", "high"))

Notice how we use the # to write comments or notes. Whatever we write after the “hash” (#) tells R to ignore it and not run it.

This is what a data frame looks like:

head(airquality)
# A tibble: 6 × 6
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    41     190   7.4    67     5     1
2    36     118   8      72     5     2
3    12     149  12.6    74     5     3
4    18     313  11.5    62     5     4
5    NA      NA  14.3    56     5     5
6    28      NA  14.9    66     5     6

The c() function puts values together and head() prints the first 6 rows. Both c() and head() are functions since they do an action and they can be recognized by the () at their end. Functions take an input (known as arguments) and give back an output. Each argument is separated by a comma ,. Some functions can take unlimited arguments (like c()). Others, like head() can only take a few arguments. In the case of head(), the first argument is reserved for the name of the data frame.

5.8 Using auto-completion in RStudio

Really emphasize how use auto-completion is.

To more quickly type out objects in R, use “tab-completion” to finish an object name for you. Normally RStudio will start auto-completing for you as you type code, but you can manually trigger auto-completion with Tab. As you type out an object name, hit the Tab key to see a list of objects available. RStudio will not only list the objects, but also shows the possible options and potential help associated with the object.

Try it out. In the RStudio Console, start typing:

Console
col

Then hit tab. You should see a list of functions to use. Hit tab again to finish with colnames(). This simple tool can save so much time and can prevent spelling mistakes.

If we want to get more information from data frames, we can use other functions like:

Console
# Column names
colnames(airquality)
[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"    
# Structure
str(airquality)
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Summary statistics
summary(airquality)
     Ozone          Solar.R         Wind            Temp     
 Min.   :  1.0   Min.   :  7   Min.   : 1.70   Min.   :56.0  
 1st Qu.: 18.0   1st Qu.:116   1st Qu.: 7.40   1st Qu.:72.0  
 Median : 31.5   Median :205   Median : 9.70   Median :79.0  
 Mean   : 42.1   Mean   :186   Mean   : 9.96   Mean   :77.9  
 3rd Qu.: 63.2   3rd Qu.:259   3rd Qu.:11.50   3rd Qu.:85.0  
 Max.   :168.0   Max.   :334   Max.   :20.70   Max.   :97.0  
 NA's   :37      NA's   :7                                   
     Month           Day      
 Min.   :5.00   Min.   : 1.0  
 1st Qu.:6.00   1st Qu.: 8.0  
 Median :7.00   Median :16.0  
 Mean   :6.99   Mean   :15.8  
 3rd Qu.:8.00   3rd Qu.:23.0  
 Max.   :9.00   Max.   :31.0  
                              

5.9 R object naming practices

Reading task: ~5 minutes

If you’ve ever seen some old R code, you may notice that function and object names are usually short. For instance, str() is the function to see the “object structure”. Back then, there were no tab-completion tools, so typing out long names was painful. Now we have powerful auto-completion tools. So this also means that when you write R code, you should use descriptive names instead of short ones. For instance, the object weight_kilo could have been named something like x. But this doesn’t tell us what that is and doesn’t help us write better code.

The ability to read, understand, modify, and write simple pieces of code is an essential skill for a modern data analysts. So! Here’s some tips for writing R code:

  • Be descriptive with your names.
  • As with natural languages like English, write as if someone will read your code.
  • Stick to a style guide.

Even though R doesn’t care about naming, spacing, and indenting, it really matters how your code looks. Coding is just like writing. Even though you may go through a brainstorming note-taking stage of writing, you eventually need to write correctly so others can read and understand what you are trying to say. In coding, brainstorming is fine, but eventually you need to code in a readable way. That’s why using a style guide is really important.

5.10 Making code more readable

Go over this section together, not as a code-along, but instead with this section on the projector. Emphasize the “naming” vs “styling” issues topic.

The code below is in some way either wrong or incorrectly written. What is wrong with it? You don’t need to understand what the code does, just comment on the readability and anything else that might come up.

# Object names
DayOne
T <- FALSE
c <- 9

# Spacing
x[,1]
x[ ,1]
mean (x, na.rm = TRUE)
mean( x, na.rm = TRUE )
height<-feet*12+inches
df $ z
x <- 1 : 10

These issues can actually be broken down into two categories:

  • Naming issues: This issue is harder to fix and comes with experience and knowledge. For instance, the T <- FALSE is wrong because T already exists and is a short hand for TRUE while c <- 9 is wrong because c is already the name of the function c(). You normally don’t want to name code based on something that already exists in base R (“naming conflicts” between packages is fine though, since there are ways to identify and fix that). These issues can only be fixed manually.

  • Styling issues: This is much easier to fix and can largely be done automatically.

Rather than manually editing code to fit a style, we can instead do it automatically. RStudio itself has a built-in automatic styling tool, found in the menu item Code -> Reformat Code. Let’s try this styling out together. Copy and paste the code above into the R/learning.R file. Don’t run this code, we’ll just edit it to improve the styling. After pasting it, run the “Reformat Code” menu item.

The tidyverse style guide also has a package called styler that automatically fixes code to fit the style guide. With styler you can style multiple files at once, one file at a time, or based on code you select and highlight. We will make a lot of use of styling the file we are working on instead. We can do that through the Palette (Ctrl-Shift-P, then type “style file”), which should show the “Style active file” option. You’ll try it out in the next exercise.

The thing to note, is that styler isn’t perfect, for instance, it can’t change objects that are named T or c to something else. But styler is a good starting point to manually fixing up your code.

Paste the code again and run styler on the file with the the Palette (Ctrl-Shift-P, then type “style file”). Fixes it for us!

5.11 Packages, data, and file paths

A major strength of R is in its ability for others to easily create packages that simplify doing complex tasks (e.g. running mixed effects models with the lme4 package or creating figures with the ggplot2 package) and for anyone to easily install and use that package. So make use of packages!

For instance, a “metapackage” we will use throughout the course is called tidyverse, which we can load by writing this at the top of our script files:

R/learning.R

Managing which packages our analysis depends on is covered in the intermediate and advanced courses. In this course, we will get you to write library() at the top of the file for each package that the file’s code depends on. Open up the R/learning.R file and add it to the top of the script.

5.12 Encountering problems and finding help

Briefly go over this section with them, especially emphasize “Restart R”, reading the error or warning message, and checking for missing commas, brackets or misspelled words.

Reading task: ~10 minutes
Figure 5.3: A common and frequent experience when working in R. Artwork by @allison_horst.

You will encounter problems and errors when working with R, and you will encounter them all the time. In fact, a large amount of your time in R will be spent figuring out solutions to these errors (“debugging”). For this course, we have a short cheatsheet that lists the tools and functions we will cover, which can help with problems forgetting function names or their usage. RStudio also has many cheatsheets of its own, which you can find with the Command Palette (Ctrl-Shift-P, then type “cheatsheet”). However, even with these cheatsheets, you will still encounter other problems like errors or warnings. Error messages will appear in red text in your Console and will start with the word “Error:”. Warning messages are also in red text, but are often either harmless or informative, so make sure to read the message and see if it says “Error” or not. Here are some initial steps to take when you encounter an error:

  1. First, try to stay calm; problems happen to everyone, no matter their skill level. You can fix it! 😄
  2. Read through the error message and try to understand what R is telling you. Some common error messages include:
    • “Could not find function”: Usually means that you have misspelled the function or an R package has not loaded properly.
    • “Object not found”: Usually means that you have not initialized (created) the object or the object is initialized but empty.
    • “Error in…”: Usually means that you are referring to an object that doesn’t exist.
    • “Unexpected symbol in…”: Usually means that you misspelled a variable or object name, so R can’t find it.
  3. Go over the code again and carefully check for any mistakes:
    • Missing commas or pipes?
    • Missing end brackets like ], ), or }?
    • Capitalized something that shouldn’t be capitalized?
    • Object or column name misspelled?
    • Forgot to load your data before working on it?
    • Forgot to load or re-load your packages? Packages are automatically unloaded when you exit RStudio and R. So you need to load them each new session with the library() function.
  4. Go back to the start of the code and run each line one at a time, to see where the problem occurs. You will get an opportunity to practice this later, once you are working with bigger chunks of code.

If you still can’t find the problem, here are some other steps to take:

  1. Restart the R session with Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-P, then type “restart”) or with the menu item Session -> Restart R. Then load your packages (and data if needed) and run the code from the beginning, tracking which objects get created, and if the proper object name is used later on.

  2. (Rarely need to do) Close/re-open RStudio and try again.

  3. Use help() or ? to access built-in documentation about a function or package. You may be using the function incorrectly, so find out more about the function by looking at the built-in documentation. The documentation will open up in the “Help” pane of RStudio (bottom right-hand corner). Try it out: Enter either of the following commands into the Console and run it (hit Enter).

    Console
    ?colnames
    
    help(colnames)

    Sometimes, this documentation can be hard to read and seem overly complex for a beginner. You can also try finding the website for the package you are having trouble with, as they often have guides that are a little easier to understand. The tidyverse packages all have amazing documentation that you can use to help you with problems you may have.

  4. Consider explaining the problem out loud to a colleague or friend. (or even a rubber duck!) You might find that, in verbally going through the problem and explaining it, you will likely come up with the solution yourself.

  5. Take a break and come back to it later!

  6. Google it. Chances are that someone has already encountered that error and has asked about it online. In fact, those who are “experts” in coding languages like R are experts largely because of their skill in knowing the right words or terms or questions to ask Google. Usually googling the error message will be enough to find the answer, but sometimes you’ll need to include “R” or “rstats” and the relevant package or function as a keyword in your search.

  7. If all else fails, you can always turn to the trusty online R community. Check StackOverflow, a coding-related question and answer website, to see whether your issue has already been asked and solved by others. If it hasn’t and you are considering submitting a question, make sure to read the posting guides beforehand to ensure that you are asking the question in a helpful way.

Final words: It is important to always work towards writing “better” and “neater” code, as this can make it easier to break down pieces of code and troubleshoot problems. Ways to integrate this into your practice are to review documents like the tidyverse style guides regularly and perhaps join an online coding community.

5.13 Quality of life settings

Before ending, we’re going to set some RStudio options that will help you out a lot. Go to Tools -> Global Options... and do these tasks:

  1. In “General”, under the “Basic” tab, uncheck all boxes under “R Session”, “Workspaces”, and “History”, as well as changing the “Save workspace to .RData on exit” to “Never”.
  2. In “Code”, under the “Editing” tab, change the “Tab width” to 2. The tidyverse style guide as well as styler both use 2 spaces for tabs, and since we are using the package, we can set this option here to save us editing issues.
  3. In “Code”, under the “Saving” tab, check all the boxes under “General” and “Auto-save”. This last one, the “Auto-save”, will help out a lot, since one of the biggest “troubleshooting issues” we encounter when helping during the version control session is that people forget to save. This solves that problem.

5.14 Summary

  • Use R Projects in RStudio (e.g. with prodigenr).
  • Use a standard folder and file structure.
  • Use a consistent style guide for code and files.
  • Keep R scripts simple, focused, and short.
  • Use tab auto-completion when writing code.
  • Use ? to get help on an R object.