6  Version control with Git

Session objectives:

  1. Learn about “formal” version control and its importance.
  2. Learn about Git for version control and apply RStudio’s integrated Git tools.
  3. Learn and apply the basic workflow of Git version control: View changes to files, record and save those changes to a history, and synchronize those changes to an online repository (GitHub).
  4. Use GitHub to collaborate with others on a project.

6.1 What is version control?

Before starting this session, get them to share some of the things they discussed during the social activity.

They will have already read it once before, during the pre-course tasks. Tell them that they’ll read it again to refresh their memories and reinforce the concepts. This is important because Git and version control is something that is very difficult to conceptually understand and takes some time and practice.

For the reading parts, let them read it first and then walk through the material again, to reinforce the importance of version control and doing it formally. So give them a heads up that you’ll be repeating things, specifically to reinforce the concepts.

It’s important in this session to go slowly. Version control is a challenging topic and isn’t something most people have ever learned about or dealt with. So take it slow and make sure everyone is on the same page. Make use of stickies frequently to assess how everyone is doing.

Reading task: ~8 minutes

This session is very text and reading heavy compared to other sessions. This is mostly because this topic requires a mental paradigm shift in how you view files, and requires you to change your habits of how you normally work. While you may get more short-term use out of the R portion of this course, knowing and using version control concepts and tools will fundamentally change how you work over the long term. While the concepts are quite difficult, the tools to use the concepts aren’t, and using them often will make the concepts easier to understand.

Figure 6.1: ‘Final’ version of a document, using a common but manual ‘version control system’.
Figure 6.2: Filenaming used in the commonly used ‘version control’.

Does this way of saving files and keeping track of versions look familiar? While the above images are teasing a bit, there is truth to it: It is the most commonly used “version control”.

This form of version control, while common, is fairly primitive, informal, and very manual. It isn’t ideal because it requires making multiple copies of the same file, even if changes are made to only one small part of the file. This approach also makes it difficult to find specific changes.

There are, however, formal version control systems that automatically manage changes to a file or files. These formal version control systems take snapshots of changes done to files, which are usually called “revisions” or “commits”. These “commits” record what was changed since the previous “commit”. When you make these “commits”, you have to create a short message on what or why you made a change. These “commits” and their messages are stored as a log entry in a history. This history then has all this information, for each commit, on which file or files were changed, what was changed within the file(s), who changed it, and a short message about the change. This is extremely useful, especially when working in teams, or for yourself 6 months in the future (because you will forget things), since you can go back and quickly see what happened and why.

To understand how incredibly powerful version control is, think about these questions:

  • How many files of different versions of a manuscript or thesis do you have laying around after getting feedback from your supervisor or co-authors?
  • Have you ever wanted to experiment with your code or your manuscript and need to make a new file so that the original is not modified?
  • Have you ever deleted something and wish you hadn’t?
  • Have you ever forgotten what you were doing on a project, or why you chose a particular strategy or analysis?

All these problems can be fixed by using formal version control! There are so many good reasons to use version control, especially in science:

  • Transparency of work done to demonstrate or substantiate your scientific claim.
  • Claim to first discovery, since you have a time-stamped history of your work.
  • Defence against fraud, because of the transparency.
  • Evidence of contributions and work, since who does what is tracked.
  • Keeping track of changes to files easily, by looking at the history of changes.
  • Easy collaboration, because you can work on a single file/folder rather than emailing versions around.
  • Organized files and folders, since there is one single project folder and one single version of each file, rather than multiple versions of the same file.
  • Less time finding things, because everything is organized and in one place.

In this session we’ll be covering a version control tool called Git. While Git on its own can be quite difficult to use, RStudio thankfully has an amazing and straight-forward integration to it.

6.2 What is Git?

Reading task: ~5 minutes

Git is one of several version control system tools available. It was developed to help software programmers to develop and manage their work on Linux (an operating system like Mac or Windows). Sadly, it was designed by and for software programmers and not for non-programming users like us researchers. So why do we teach it? Because Git has so many great features that fit with how science and data analysis is done.

  • Like R, it is open source, so it’s free and anyone verify how trustworthy and correct the code is.
  • It is very popular and so has a very large online community that provides support, documentation, and tutorials on how to use it.
  • The vast majority of open source projects and work, such as making R packages, are done using Git and are hosted on GitHub, which is a company that hosts Git “repositories” (i.e. projects) online.
    • All RStudio code and tidyverse packages are on GitHub.
  • There are many open scientific projects that use Git and are hosted on GitHub, e.g. rOpenSci organization or MRC Integrated Epidemiology Unit.
  • RStudio has an amazing interface and integration with Git.

While learning Git and version control can be difficult and has a steep learning curve, like learning R, it is ultimately an investment into your future productivity and effectiveness as a researcher. It is very much worth it to learn and use it as often as you can.

While many people may use Git to manage their R scripts, you can also manage other non-code based files like Word or images in Git. Version control is useful for any project that uses any type of files since they can be saved in the Git history (though there are some limitations to using Git for files that are not plain text like R scripts are). You can save files either by adding and committing them (as we will learn to do shortly) or by uploading them to your repository on GitHub, as pictured below.

Figure 6.3: GitHub: Select “Upload files” from the “Add file” drop-down menu.

6.3 Basics of Git

Just like above, they’ve already read this section but should read it again to reinforce the concepts. Briefly go over it again, especially making sure everyone has their Git properly configured. Even though they do that already during the pre-course tasks, sometimes things happen in the meantime, so its good to check again.

Reading task: ~8 minutes

Git works by tracking changes to files at the project level (i.e. for every R Project). So you won’t track your entire, for instance, Documents/ folder. When file changes are saved and put into the history, this history is called a “repository” (also called a “repo” for short). We’ll explain more about what a repository is later. What you do with Git is more or less to:

  • Set up Git in your project or folder by starting it as a “repository”.
  • Tell Git to track a file by preparing it to be saved to the history.
  • Save changes to files in the history with a message you recorded about the change.

Other things you can do with Git:

  • Check what’s been changed or added in your files since the last save.
  • Check the history for what was previously changed or added.

When working with GitHub, there are extra things you can do (more on this later):

  • Synchronize the Git repository on your computer with the repository on your GitHub, called “push” (upload) and “pull” (download).

So first off, what exactly is the Git repository? The Git repository works at the project (the folder) level because it stores the version history in the hidden .git/ folder. In Windows, this folder will probably not be hidden, but in Mac and Linux, files and folders that start with . are automatically hidden. The .git/ folder itself is the repository used by Git to store the file changes and history of the project. So don’t delete it!

Another important file for managing the repository is the .gitignore file. This file tells Git to not track (or “watch”) certain files, such as temporary files. This is particularly important for making sure you don’t save personal data in the Git history. Personal data should be stored in a secure location and should only be accessed in the R Project by loading it from that secure location.

LearningR
├── .git/ <-- Git repository stored here
├── R/
├── data/
├── data-raw/
├── doc/
├── .gitignore <-- Tells Git which files NOT to save
├── LearningR.Rproj
├── DESCRIPTION
└── README.md

Setting up a Git repository can be done in several ways:

  • When starting a new R Project, the “Create a git repository” option can be selected from the “New Projects” setup instructions in RStudio.
  • For existing R Projects, you can type in the console usethis::use_git() (don’t do this for the course!).

6.4 Using Git in RStudio

Since they will be using the Git interface quite a bit, really take your time walking through it and describing it. Show where things are and what things to focus on for this course.

Git was initially created to be used in the terminal (i.e command-line). However, because RStudio has a very nice interface for working with Git, we’ll be using that interface so we don’t have to switch to another application. While the terminal provides full access to Git’s power and features, the vast majority of daily use can be done through RStudio’s interface.

To access the Git interface in RStudio, click the Git icon beside the “Go to” search bar (see Figure 6.4) and then click the “Commit…” option using Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”).

Figure 6.4: The location of the Git button in RStudio.

The Git interface should look something like Figure 6.5 below. A short written description is given below the image.

Figure 6.5: The Git interface in RStudio.
  1. This is the “Changes” and “History” buttons that allow you to switch between views. Changes is what is currently changed in your files relative to the last history item. History is the record of what was done, to what file, when, and by whom.
  2. These are the “Push” and “Pull” buttons that are used to synchronize with GitHub, which we will cover later in the session.
  3. This is the panel that lists the files that have been modified in some way. You add (“stage”) files here that you want to be put into the history.
  4. This is the Commit Message box where you write the message about the changes that will be put into the history.
  5. This is the panel that shows what text has been modified, added, or removed from the file selected in panel 3. Green highlight indicates that something has been added, while red indicates a removal. Changes are detected at the line level (what line in the file). For files that aren’t plain text-based (e.g. Word), you can not see what specifically was changed, it will only say that there is a change.

Do this part as a code-along, after having explained the above first.

So far, it should show a bunch of files that we’ve added and used over the previous session. In the Git interface, select the README.md file. You should see the text in the file, all in green. Green means the text has been added. Red, which you will see shortly, means text was removed.

Now click the “Staged” checkbox besides the README.md file to get it ready to be saved into the history. You’ve now “added” it to the staged area. Note that when you have a lot of files to stage, you can stage them all at once by pressing Ctrl-A to highlight and select all the files and then clicking the “Staged” checkbox (or hitting Space) on one of the files. The box on the right side is where you type out your “commit” message. “Commit” means you save something to the history of changes. You “commit” it to the history, like you “commit” something to your own memory.

Reading task: ~5 minutes

Before we move on, there are some things to know about how Git works. In Git, there are three “states” that a file can be in, listed below and summarised in Figure 6.6.

  1. The Working folder state is where all files are, whether they are “untracked” or “tracked”. Untracked is when Git sees the file, but it has not yet entered the history. Tracked is when the file has been saved in the history and Git “watches” it for changes.
  2. The Staged state is when a file has a change that is different compared to the version in the history and it has been checked (“added”) into the “Staged” area (by ticking the checkbox beside the file in the Git interface).
  3. The History or Committed state is when a “commit” message has been written and the file with its changes has been saved into the repository history.
%%{init:{'themeCSS': ".actor {stroke: DarkBlue;fill: White;stroke-width:1.5px;}", 'sequence':{'mirrorActors': false}}}%%
sequenceDiagram
    participant W as Working folder
    participant S as Staged
    participant H as History
    W->>S: Add
    S->>H: Commit
Figure 6.6: The three states that files and folders can be in, when using Git.

This system allows us to keep a journal (a log) of what has been changed, why it has been changed, who changed it, and when. Figure 6.7 below shows an example log of the history of a previous version of this lesson, which makes it easy to get an overview of what is happening in a project.

Figure 6.7: An example of a history for this course’s repository.

You may notice that the messages in the log give a bit of detail about why a change was made, though it’s not always the case. Sometimes a message like “minor edit” is enough, because it was a minor edit.

A general tip for writing an effective commit message is to be concise but meaningful. Writing down meaningful messages can save you a lot of time in the future when you come back to a project after some time and forget what you were doing. With a well written history you can get a quick idea or reminder about the state of the project.

Verbally explain the above to reinforce the concepts, than do the rest as a code-along.

For going over the Git history pane, demonstrate how you can open a file at that commit, so that nothing is ever lost.

Ok, now write something like “Add initial README file” in the commit message box and commit the change. After clicking “Commit”, you’ll notice that the README.md file is no longer on the left side. That’s because we’ve put the change into the history. We can view the history by clicking the “History” button in the top-left corner of the RStudio Git interface. Here you can see what has been done in previous commits.

The “History” section is quite powerful. As long as you commit something into the Git history, it will never be completely gone1. For instance, we can see the full contents of a file at a specific commit by clicking the commit, moving to the file you want to look into, and clicking the link that says View file @ .... Try that with the first commit of the README file. See how it shows what was there before you made more changes?

1 This isn’t completely true, you can delete stuff, like if you accidentally add a password or personal data.

Next, open up the README.md file in RStudio using the Files tab. At the top of the file, write your name and your field of research, and then save the file. Open up the Git interface again (with the Git icon or with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”)). You should now see the added text in green. Alright, now “Stage” the change (click the checkbox), write a message like “added my name to README file”, and commit the change. Go back to the history and you should see the two commits in your repository. If you don’t see it in the history, you likely need to click the refresh button at the top.

A question that may come up is “how often should I commit”? In general, it’s better to commit fairly frequently and to commit changes that are related to each other and to the commit message. Following this basic principle will make your history easier for you to read and make it easier for others as well.

6.5 Exercise: Committing to history

Time: 15 minutes.

When working on your own projects and when you use Git, you will be committing a lot of changes to your files into the Git history. Part of the initial barrier is simply getting used to this workflow of committing what you’ve changed. Use this exercise to get some practice. We will be using this workflow often throughout the rest of the course.

  1. Practice the add-commit (“add to staging”-“committing to history”) sequence by adding and committing each of the remaining files in your R project one at a time into the Git history (e.g. the .gitignore, the .R files, and the .Rproj file). While you could add and commit them all at once, we want you to do them one at a time so you practice using this workflow.
    • Make sure to write a meaningful and short message about what you added and why. In this case, the “why” is simply that you are saving the file into the history for the first time.
  2. Once all the files have been added and committed, add a new line to the R/learning.R with an R comment (starts with a #). Type out something like “This will be used for testing out Git”. Add and commit that new line you’ve written.

6.6 “Remotes”: Storing your repository online

Briefly go over this next section, especially highlighting the image.

A version control system that didn’t include a type of external backup wouldn’t be a very good system, because if something happened to your computer, you’d lose your Git repository. In Git, this “external” backup is called a “remote” (meaning it is something that is separate from and in a different location, usually online, than the main repository). The remote repository is essentially a duplicate copy of the history (the .git/ folder) of your local repository (on your computer), so when you synchronize with the remote, as illustrated in Figure 6.8, it only copies over the changes made as commits in the history.

One of the biggest reasons why we teach Git is because of the popularity of several Git repository hosting sites. The most popular one is GitHub (which this course is hosted on). In this session, we’ll be covering GitHub not only because it is very popular, but also because the R community is almost entirely on GitHub.

graph TB
    linkStyle default interpolate basis
    A('Remote':<br>GitHub) --- B('Local':<br>Your computer)

    style A fill:White,stroke:DarkBlue,stroke-width:1.5px;
    style B fill:White,stroke:DarkBlue,stroke-width:1.5px;
Figure 6.8: The ‘remote’ vs ‘local’ repository, or online vs on your computer.

Let’s get familiar with GitHub. More details about manually creating repositories on GitHub is found in Section D.1.

Go over the interface of GitHub, especially where repositories are listed, the sidebar of the landing page (of your account), and where your account settings are.

Warning

When using GitHub, especially in relation to health research, you need to be mindful of what you save into the Git history and what you put up online. Some things to think about are:

  • Do not save any personal or sensitive data or files in your Git repository.
  • Generally don’t save very large files, like big image files or large datasets.

In both cases, it’s better to use another tool to store files like that, rather than through Git and GitHub.

6.7 Using GitHub as a remote

Reading task: ~3 minutes

Making and cloning a GitHub repository is the first step to linking a local repository to a remote one. After that, to keep your GitHub repository synchronized, you need to “push” (upload) and “pull” (download) any changes you make to the repository on your computer, as demonstrated in Figure Figure 6.9. It isn’t done automatically because Git is designed with having control in mind, so you must do this synchronization manually. “Pushing” is when changes to the history are uploaded to GitHub while “pulling” is when the history is downloaded from GitHub.

graph TB
    linkStyle default interpolate basis
    A('Remote':<br>GitHub) -- Pull --> B('Local':<br>Your computer)
    B -- Push --> A

    style A fill:White,stroke:DarkBlue,stroke-width:1.5px;
    style B fill:White,stroke:DarkBlue,stroke-width:1.5px;
Figure 6.9: Synchronizing with GitHub: ‘Pushing’ and ‘pulling’.

So, when we put the concepts back into the framework of the “states”, first introduced in Section 6.3, pushing and pulling happen only to the history. Things that you’ve changed and then saved to the history, either on the remote or the local repository, are synchronized from or to GitHub. So, as shown in Figure 6.10, pushing copies the history over to GitHub and pulling copies the history from GitHub. Since changes saved in the history also reflect the working folder (the files and folders you actually see and interact with), “pulling” also updates the files and folders.

%%{init:{'themeCSS': ".actor {stroke: DarkBlue;fill: White;stroke-width:1.5px;}", 'sequence':{'mirrorActors': false}}}%%
sequenceDiagram
    participant W as Working folder
    participant S as Staged
    participant H as History
    participant R as GitHub
    W->>S: Add
    S->>H: Commit
    H->>R: Push
    R->>H: Pull
    R->>W: Pull
Figure 6.10: Which states get ‘pushed’ and ‘pulled’.

Interacting with GitHub through R requires us to use something called a “personal access token”, which you will learn about and create in the next exercise.

6.8 Exercise: Creating a GitHub token with usethis

Time: ~20 minutes

Since we use R, there is a really useful set of functions from usethis to make it easy interact with and setup connections to GitHub from RStudio. Complete the Connect to GitHub guide for this exercise. In the end, you should have your LearningR project on GitHub.

On your own, run the commands in the guides. After they complete the tasks, make sure that they have the “Push” and “Pull” buttons as well as having the LearningR on GitHub.

6.9 Synchronizing with GitHub

After creating the token, we can now push and pull any changes you make to the files.

  1. Make sure you are in the LearningR R Project, which you should see in the top right corner, above the Console pane. If you aren’t, switch to it by clicking the button in the top right corner and selecting the LearningR project from the menu.
  2. Open up the README.md and add a random sentence somewhere near the top of the file.
  3. Save the file.
  4. Open the Git interface, by hitting Ctrl-Alt-M (or Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”)) anywhere in RStudio or going to the Git button -> Commit.
  5. Stage the file.
  6. Add a commit message.
  7. Commit the new change by clicking the “Commit” button.
  8. Click the “Push” button in the top right corner of the Git interface (Box 2 of Figure 6.5). A pop-up will indicate that it’s pushing and will tell you when it’s done.

Now let’s try the opposite by committing and pulling changes from GitHub to your local repository.

  1. Go to your LearningR GitHub repository. You should see the new change is also on the GitHub repository.
  2. Click the README.md file on the GitHub website and then click the “Edit” button (see the video below, which shows it for random repository called learning-github).
  3. Add another random sentence somewhere near the top of the file.
  4. Scroll down to the commit message box, and type out a commit message.
  5. Click the “Commit” button.
  6. Go back to RStudio, open the Git interface and now click the “Pull” button in the top right corner beside the “Push” button.
  7. Wait for it to finish pulling and check your README.md file for the new change. You’ve now updated your project!

6.10 Collaborating using Git and GitHub

After they’ve read the activity, briefly go over the image and emphasize why collaborating this way makes things easier. If you have some personal experiences, please share them!

Reading task: ~10 minutes

While all of the previous Git tools we covered are extremely useful when working alone, we’ve been building up to using Git for it’s main and biggest advantage: to easily collaborate with others on a project.

Using the concept and structure of remote repositories like GitHub combined with the idea of saving changes to files in a history, collaborating with others on a common project is much easier and more powerful. Think of it like Dropbox on steroids.

Let’s go back to the concept of remote repositories. Since a local repository is simply a copy of a remote repository, anyone else can collaborate on your project by copying the remote repository. When they want to contribute back, they make commits to their local copy and push up to the remote. Then you can pull to your local copy and do the same thing. This is illustrated in Figure Figure 6.11.

graph TB
    linkStyle default interpolate basis
    A('Remote':<br>GitHub) -- Pull --> B('Local':<br>Your computer)
    B -- Push --> A
    A -- Pull --> C('Local':<br>Collaborator's<br>computer)
    C -- Push --> A

    style A fill:White,stroke:DarkBlue,stroke-width:1.5px;
    style B fill:White,stroke:DarkBlue,stroke-width:1.5px;
    style C fill:White,stroke:DarkBlue,stroke-width:1.5px;
Figure 6.11: Collaborating with others using Git and GitHub by having a shared central GitHub repository.

For public GitHub repositories, anyone can copy your repository and contribute back, so working with collaborators is easy. When you have a private repository, you need to explicitly add collaborators in GitHub.

You add someone by going to Settings -> Manage Access -> Invite a collaborator (also shown in the video below).

We won’t have you do this for this course, since you’ve all been added as collaborators to your group’s repository. Instead, you can focus on getting practice collaborating on a Git project in the group work later in the course.

Note: A big challenge you’ll encounter with becoming better with this way of collaborating is that most of your collaborators will likely not be familiar with it, at least until they take this course. 😉 Sadly, even experienced people struggle with this and there is no easy answer on how to handle this. The best way (in our opinion) is to start training any colleague who is interested in collaborating this way and slowly you’ll surround yourself with collaborators who also work this way.

6.11 Summary

  • Use the version control system Git to track changes to your files, to more easily manage your project, and to more easily collaborate with others.
  • Git tracks files in three states: “Working directory”, “Staged”, and “History”.
    • The Git repository contains the history.
  • The main actions to move between states are:
    • “Add to staging”
    • “Commit to history”
  • When committing to history, keep messages short and meaningful. Focus more on why the change was made, not what.
  • “Remotes” are external storage locations for your Git repository. GitHub is a popular remote repository hosting service.
    • Downloading a Git repository from GitHub is called “cloning”.
  • “Pushing” and “pulling” are actions to upload and download to the remote repository (which usually is called “origin”).
  • Almost all Git actions can be done using RStudio’s Git interface.