1 Syllabus
We are in a special time in research. Researchers face several large scale technological and societal changes:
- Researchers are experiencing higher demands from funding agencies, universities, and peers for transparency and rigor in our research.
- Research is getting more and more complex, which requires higher degrees of (potentially highly distributed and virtual) team-based science.
- There is a higher public attention on research, with mass participation and attention through the Internet and social media.
- More access to powerful computing resources and massive datasets is leading to an increasing rise in more complex analytics and data processing, such as through machine learning and AI.
- Increasingly one’s research output is someone else’s research input, such as with meta-research1 or meta-analysis.
- The growing presence of large language models (LLMs) and similar tools that can help with coding, writing, and other aspects of research, which makes it even more important to learn how to code, what code means, and what it is doing.
At the same time, institutional support, training, and incentive structures for researchers to adapt to these changes are far behind what is necessary to keep pace. Connected to many of these changes are reproducible and open scientific practices, none of which do researchers get sufficient training on.
Reproducibility in particular requires more than just writing code. It requires using tools and practices that enforce or enable a higher degree of reproducibility, organisation, and transparency or record-keeping.
This workshop will introduce many of the core concepts and practices for doing reproducible and open data analysis to get you familiar with and prepared for the type of work needed in research now and in the future. We use a very practical approach based largely on code-along sessions (instructor and learner coding together), hands-on exercises, reading activities, and a team project.
This workshop lasts 3 days and is split into multiple sessions listed in the schedule (Chapter 3).
1.1 Learning outcome and objectives
The overall aim of this workshop is to enable you to:
- Describe the fundamentals of what an open and reproducible data analysis looks like and then create a project that applies some of the basics of these concepts using R, RStudio, Git, and Quarto.
Broken down into specific objectives for each session, we’ve designed the workshop to enable you to do the following:
- Describe why the contents of a project, like files, should be self-contained within a single folder and sub-folders, and explain how that helps with reproducibility and helping you keep organized.
- Use built-in tools in RStudio to make it easier to manage R projects.
- Explain why naming files is the first step to effectively communicating the contents of the file and describe a commonly used style guide for file naming.
Analytically reproducible documents
- Describe what Quarto is and how it helps with reproducibility, and why it can save you time and effort.
- Write and use R code within a document, so that it will automatically insert the R output into the final document.
- Identify what some Markdown syntax does to the formatting of the generated HTML file and write some Markdown to make some simple formatting of your document.
- Define some basic objects and data types in R and write some simple R code within the Quarto document.
- Explain what “formal” version control is and recognize its importance.
- Use the basic functionality of Git for version control through RStudio’s integrated Git interface.
- Describe and apply the basic workflow of Git version control: View changes to files, record and save those changes to a history, called “committing”.
- Describe different ways of importing data into R and identify which method to use based on the file extension, then use one of these methods to import a dataset.
- Explain how to ensure your project is self-contained by using relative file paths within the R project and use the
here::here()function to make these paths. - Use a tool called “auto-completion” to more quickly type out objects in R.
- Use options in Quarto Markdown’s code chunks to control what gets shown from the output of the code chunk when generating the output document.
- Describe the difference between “messy” and “tidy” data and explain why you should aim to make data tidier.
- Describe and list some features of the “Grammar of Graphics” approach to creating plots, and connect that to how ggplot2 works.
- Identify ways to present your figures in an output document by using Quarto chunk options.
- Explain why some commonly used graphs in science are inappropriate to use for certain data, like the barplot with mean and standard error.
- Use the
geom_point()andgeom_smooth()functions to create a scatterplot of two continuous variables, orgeom_bar()for showing counts of discrete variables. - Explain the importance of writing readable code that follows a consistent style guide and use the styler package to help with that.
- Describe the difference between a “remote” and a “local” repository.
- Explain why GitHub can be an effective way to collaborate with others on a project.
- Use GitHub to store your Git repository by connecting your R project to GitHub.
- Use “pushing” and “pulling” to synchronize any changes you make between the local and remote repositories.
- Select specific columns in a dataset using the
select()function. - Rename columns in a dataset using the
rename()function. - Filter rows in a dataset using the
filter()function. - Modify or add columns in a dataset using the
mutate()function. - Use the pipe operator
|>to chain actions together.
Data wrangling with visualizing
- Apply what was learned separately in the wrangling and visualizing sessions by piping the output of one into the other.
- Explain why the choice of colour matters when it comes to having figures be accessible and understandable to more people, especially to those with colour blindness.
- Use features of ggplot2 to make plots that contain three or more variables in them, through the use of colours and facetting.
- Describe the “split-apply-combine” method of analyses, then use
group_by()together withsummarise()to calculate summary statistics, by categorical variables. - Create tables of results in a document using
knitr::kable().
Because learning and coding is ultimately not just a solo activity, in addition to the team project work, during this workshop we also aim to provide opportunities to chat with fellow participants, learn about their work and how they do analyses, and to build networks of support and collaboration.
The workshop will place particular emphasis on research in diabetes, health, and metabolism; it will be taught by instructors working in this field and it will use relevant examples where possible.
1.2 Tangible goals
In this workshop, our main tangible goal is to:
- Create a project that has a report (in HTML or Word) where you reproducibly import some data, process it a bit, and create some figures and tables, all done in a way that makes it easier for you and others to collaborate together.
We’ll achieve this by:
- Have a self-contained project (within a single folder).
- Have a record of changes made to the files.
- Make it easier for others to collaborate.
- Make it simpler to connect the project with a scientific output like a paper.
- Structure analyses to be more reproducible (or at least more easily inspectable).
Specifically, the tools we will use to achieve these goals are to:
- Use RStudio to write and run R code.
- Use the tidyverse bundle of R packages to wrangle and visualize data.
- Use the Git interface in RStudio to track changes to your files.
- Use GitHub to store the Git “repository” (folder) to collaborate and share with others.
- Use Quarto to write reproducible documents.
Evidence-based evaluation and development of research methods.↩︎