If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
5 Management of R projects
Session objectives:
- Create self-contained projects to be more reproducible.
- Use built-in tools in RStudio to make it easier to manage R projects.
- Become familiar with the very basics of R.
- Apply tools to use a consistent “grammar” and “styling” when writing R code and making files.
- Know of and use different approaches to getting and finding help.
5.1 What is a project and why use it?
5.2 Exercise: How do you organise your files and projects?
Time: ~8 minutes.
This seems so basic, how files are organized on computers. We literally work with files all the time on computers. But consider, how do you organize them? Take some time to discuss and share with your neighbour.
- Take 1 minute to think to yourself.
- Take 5 minutes to discuss and share with your neighbour.
- For the remaining time, we will all share our thoughts with the group.
5.3 RStudio and R Projects
RStudio helps us with managing projects by making use of R Projects. RStudio R Projects make it easy to divide your work projects into a “container”, that have their own working directory (the folder where your analysis occurs), workspace (where all the R activity and output is temporarily saved), history, and documents.
File synchronizing and backup services like OneDrive or Dropbox are super common. Unfortunately, they also can be a major source of frustration and challenge when working with data analysis projects. This is mainly due to they way the synchronizing, by constantly looking at changes to files and then synchronizing when a change occurs. When doing data analysis, especially as you get more advanced and use reproducible documents and version control systems, changes to files can happen very often and very quickly. This can essentially cause these services to “spasm” and may overwrite the changes that are happening. Whenever possible, always save your work on your computer and not on these services.
There are many ways one could organise a project folder. We’ll be setting up a project folder and file structure using prodigenr We’ll use RStudio’s “New Project” menu item under:
File -> New Project.. -> New directory -> Scientific Analysis Project using prodigenr
We’ll call the new project LearningR
. Save it on your Desktop/
. See Figure 5.1 for the steps to do it:
You can also type the below function into the Console, but we won’t do that in this session.
Just a reminder, when we use the ::
colon here, we are saying:
Hey R, from the prodigenr package use the
setup_project
function.
That way, we are directly requesting R to look in the prodigenr package and use the setup_project()
function. We do this because we want to be explicit about what we want to use and since we don’t need to load the full package.
After we’ve created a New Project in RStudio, we’ll have a bunch of new files and folders.
LearningR
├── R
│ └── README.md
├── data
│ └── README.md
├── data-raw
│ └── README.md
├── doc
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearningR.Rproj
├── README.md
└── TODO.md
This forces a specific and consistent folder structure to all your work. Think of this like the “Introduction”, “Methods”, “Results”, and “Discussion” sections of your paper. Each project is then like a single manuscript or report, that contains everything relevant to that specific project. There is a lot of power in something as simple as a consistent structure. Projects are used to make life easier. Once a project is opened within RStudio the following actions are automatically taken:
- A new R session (process) is started.
- The R session’s working directory is set to the project directory.
- RStudio project options are loaded.
Before moving on, let’s go over a bit about how R works, and what the “R session” means. An R session is the way you normally interact with R, where you would write code in the Console to tell R to do something. Normally, when you open an R session without an R Project, the session defaults to assuming you will be working in the ~/Desktop
or ~
(your Home folder) location. But this location isn’t where you actually work. You normally work in the folder that has your R scripts or data files. The assumption with R Projects on the other hand, is that the R session working directory should be where the R Project is, since that is where you have your R scripts and data files.
Each R project is designated with a .Rproj
file. This file contains information about the file path and various metadata. As such, when opening an R project, you need to open it using the .Rproj
file.
A project can be opened by either double clicking on the .Rproj
from your file browser or from the file prompt within R Studio:
File -> Open Project
or
File -> Recent Project.. -> LearningR
Within the project we created, there are several README files in each folder that explain a bit about what should be placed there. Briefly:
- Documents like manuscripts, abstracts, and exploration type documents should be put in the
doc/
directory (including R Markdown and Quarto files- We will cover this later in Chapter 8.
- Data, raw data, and metadata should be in either the
data/
directory or indata-raw/
for the raw data. We’ll explain thedata-raw/
folder and create it later in the lesson. - All R files and code should be in the
R/
directory. - Name all new files to reflect their content or function. Follow the tidyverse style guide for file naming. Either
_
or-
are recommended to be used instead of a space, though using-
tends to be more commonly used.
Since we’ll be using Git for version control in Chapter 6, which we highly recommended to use for any project, we need to add Git to our newly created project by typing in the R Console while in the newly created LearningR
project:
This will add the .gitignore
file to the project as well as to tell Git to track our project. We’ll cover this more later.
5.4 What’s in a (file) name?
It might seem so basic, but how you name your files can have a huge impact on how easy it is for others, yourself in the future, as well as computers, to work on your project.
Take some time to think about file naming. Look at the list of file names below. Which file names are good names and which aren’t? We’ll discuss afterwards why some are good names and others are not.
fit models.R
fit-models.R
foo.r
stuff.r
get_data.R
Manuscript version 10.docx
manuscript.docx
new version of analysis.R
trying.something.here.R
plotting-regression.R
utility_functions.R
code.R
5.5 Next steps after creating the project
Now that we’ve created a project and associated folders, let’s add some more options to the project. One option to set is to ensure that every R session you start with is a “blank slate”, meaning no old data are automatically imported into the Environment. This is done by typing the following code in the Console:
Now, let’s add one R script that we will use in multiple sessions:
The usethis::use_r()
command creates R scripts in the R/
folder. As you may tell, the usethis package can be quite handy. For the first few sessions, we will be working the R scripts and then later will move over to Quarto files instead. Why? Working with R, you will be doing a lot of coding and writing in both types of files, so we want you to get practice using both.
5.6 RStudio layout and usage
Open up the R/learning.R
file now, which you will use to type in code for the code-along parts. You’ve already gotten a bit familiar with RStudio in the pre-course tasks, but if you want more details, RStudio has a great cheatsheet on how to use RStudio. The items to know right now are the “Console”, “Files”/“Help”, and “Source” tabs.
Code is written in the “Source” tab, where it saves the code and text as a file. You can send selected code to the Console from the opened file by typing Ctrl-Enter (or clicking the “Run” button). In the “Source” tab (where R scripts and Quarto files are shown), there is a “Document Outline” button (top right beside the “Run” button) that shows you the headers or “Sections” (more on that later). To open it you can either click the button, use the keybinding Ctrl-Shift-O or with the Palette (Ctrl-Shift-P, then type “outline”), go through the menu to Code -> Show Document Outline
. The Command Palette is a very useful tool to learn, since you can easily access almost all features and options inside RStudio through it. Because of this reason, we will be using it a lot throughout the course. Open it up with Ctrl-Shift-P and then in the pop-up search bar, type out “document outline”. The first item should be the one we want, so hit Enter
to activate the Outline.
If you can’t remember a specific keybinding in RStudio, check out the help for it by going to the menu item Help -> Keyboard Shortcuts Help
.
5.7 Basics of using R
One useful thing to do to make your R script more readable and understandable is to use “Sections”. They’re like “headers” in Word and they split up an R script into sections, which then show up in the “Document Outline”. We can insert a section using Ctrl-Shift-R or with the Palette (Ctrl-Shift-P, then type “code section”). You can also insert the sections through the menu Code -> Insert Section
.
5.8 Using auto-completion in RStudio
To more quickly type out objects in R, use “tab-completion” to finish an object name for you. Normally RStudio will start auto-completing for you as you type code, but you can manually trigger auto-completion with Tab
. As you type out an object name, hit the Tab
key to see a list of objects available. RStudio will not only list the objects, but also shows the possible options and potential help associated with the object.
Try it out. In the RStudio Console, start typing:
Then hit tab. You should see a list of functions to use. Hit tab again to finish with colnames()
. This simple tool can save so much time and can prevent spelling mistakes.
If we want to get more information from data frames, we can use other functions like:
#> [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
#> 'data.frame': 153 obs. of 6 variables:
#> $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
#> $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
#> $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
#> $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
#> $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
#> Ozone Solar.R Wind Temp
#> Min. : 1.0 Min. : 7 Min. : 1.70 Min. :56.0
#> 1st Qu.: 18.0 1st Qu.:116 1st Qu.: 7.40 1st Qu.:72.0
#> Median : 31.5 Median :205 Median : 9.70 Median :79.0
#> Mean : 42.1 Mean :186 Mean : 9.96 Mean :77.9
#> 3rd Qu.: 63.2 3rd Qu.:259 3rd Qu.:11.50 3rd Qu.:85.0
#> Max. :168.0 Max. :334 Max. :20.70 Max. :97.0
#> NA's :37 NA's :7
#> Month Day
#> Min. :5.00 Min. : 1.0
#> 1st Qu.:6.00 1st Qu.: 8.0
#> Median :7.00 Median :16.0
#> Mean :6.99 Mean :15.8
#> 3rd Qu.:8.00 3rd Qu.:23.0
#> Max. :9.00 Max. :31.0
#>
5.9 R object naming practices
5.10 Making code more readable
The code below is in some way either wrong or incorrectly written. What is wrong with it? You don’t need to understand what the code does, just comment on the readability and anything else that might come up.
# Object names
DayOne
T <- FALSE
c <- 9
# Spacing
x[,1]
x[ ,1]
mean (x, na.rm = TRUE)
mean( x, na.rm = TRUE )
height<-feet*12+inches
df $ z
x <- 1 : 10
These issues can actually be broken down into two categories:
Naming issues: This issue is harder to fix and comes with experience and knowledge. For instance, the
T <- FALSE
is wrong becauseT
already exists and is a short hand forTRUE
whilec <- 9
is wrong becausec
is already the name of the functionc()
. You normally don’t want to name code based on something that already exists in base R (“naming conflicts” between packages is fine though, since there are ways to identify and fix that). These issues can only be fixed manually.Styling issues: This is much easier to fix and can largely be done automatically.
Rather than manually editing code to fit a style, we can instead do it automatically. RStudio itself has a built-in automatic styling tool, found in the menu item Code -> Reformat Code
. Let’s try this styling out together. Copy and paste the code above into the R/learning.R
file. Don’t run this code, we’ll just edit it to improve the styling. After pasting it, run the “Reformat Code” menu item.
The tidyverse style guide also has a package called styler that automatically fixes code to fit the style guide. With styler you can style multiple files at once, one file at a time, or based on code you select and highlight. We will make a lot of use of styling the file we are working on instead. We can do that through the Palette (Ctrl-Shift-P, then type “style file”), which should show the “Style active file” option. You’ll try it out in the next exercise.
The thing to note, is that styler isn’t perfect, for instance, it can’t change objects that are named T
or c
to something else. But styler is a good starting point to manually fixing up your code.
Paste the code again and run styler on the file with the the Palette (Ctrl-Shift-P, then type “style file”). Fixes it for us!
5.11 Packages, data, and file paths
A major strength of R is in its ability for others to easily create packages that simplify doing complex tasks (e.g. running mixed effects models with the lme4 package or creating figures with the ggplot2 package) and for anyone to easily install and use that package. So make use of packages!
For instance, a “metapackage” we will use throughout the course is called tidyverse, which we can load by writing this at the top of our script files:
Managing which packages our analysis depends on is covered in the intermediate and advanced courses. In this course, we will get you to write library()
at the top of the file for each package that the file’s code depends on. Open up the R/learning.R
file and add it to the top of the script.
5.12 Encountering problems and finding help
5.13 Quality of life settings
Before ending, we’re going to set some RStudio options that will help you out a lot. Go to Tools -> Global Options...
and do these tasks:
- In “General”, under the “Basic” tab, uncheck all boxes under “R Session”, “Workspaces”, and “History”, as well as changing the “Save workspace to .RData on exit” to “Never”.
- In “Code”, under the “Editing” tab, change the “Tab width” to 2. The tidyverse style guide as well as styler both use 2 spaces for tabs, and since we are using the package, we can set this option here to save us editing issues.
- In “Code”, under the “Saving” tab, check all the boxes under “General” and “Auto-save”. This last one, the “Auto-save”, will help out a lot, since one of the biggest “troubleshooting issues” we encounter when helping during the version control session is that people forget to save. This solves that problem.
5.14 Summary
- Use R Projects in RStudio (e.g. with prodigenr).
- Use a standard folder and file structure.
- Use a consistent style guide for code and files.
- Keep R scripts simple, focused, and short.
- Use tab auto-completion when writing code.
- Use
?
to get help on an R object.