19  Collaborating with GitHub

GitHub is a popular online service for hosting Git repositories. It also makes collaborating on projects much easier. In this session, we will cover what GitHub is and how to use it with your R project that is a Git repository.

19.1 Learning objectives

  1. Describe the difference between a “remote” and a “local” repository.
  2. Explain why GitHub can be an effective way to collaborate with others on a project.
  3. Use GitHub to store your Git repository by connecting your R project to GitHub.
  4. Use “pushing” and “pulling” to synchronize any changes you make between the local and remote repositories.

19.2 “Remotes”: Storing your repository online

Briefly go over this next section, especially highlighting the image.

A version control system that didn’t include a type of external backup wouldn’t be a very good system, because if something happened to your computer, you’d lose your Git repository. In Git, this “external” backup is called a “remote”, meaning it is something that is separate and in a different location, usually online, than the main repository. The remote repository is essentially a duplicate copy of the history, found in the .git/ folder, of your local repository on your computer. So when you synchronize with the remote, as illustrated in , it only copies over the changes made as commits in the history.

One of the biggest reasons why we teach Git is because of the popularity of several Git repository hosting sites. The most popular one is GitHub (which this workshop is hosted on). In this session, we’ll be covering GitHub not only because it is very popular, but also because the R community is almost entirely on GitHub.

'Remote':
GitHub

'Local':
Your computer

Figure 19.1: The ‘remote’ vs ‘local’ repository, or online vs on your computer.

Let’s get familiar with the GitHub interface.

Go over the interface of GitHub, especially where repositories are listed, the sidebar of the landing page (of your account), and where your account settings are.

Also, reinforce the warning and note below.

Warning

When using GitHub, especially in relation to health research, you need to be mindful of what you save into the Git history and what you put up online. Some things to think about are:

  • Do not save any personal or sensitive data or files in your Git repository.
  • Generally don’t save very large files, like big image files or large (non-personal) datasets.

In both cases, it’s better to use another tool to store files like that, rather than through Git and GitHub.

Note

Some research projects require working on restricted server environments (such as Denmark Statistics when doing research on the Danish register data), where access to the internet is not available. This means that you can’t use GitHub or any other online Git repository hosting service. However, you can still use Git on those servers without using a remote.

19.3 📖 Reading task: Using GitHub as a remote

Time: ~3 minutes

Making and cloning a GitHub repository is the first step to linking a local repository to a remote one. We are creating a GitHub repository from an existing local one, but you can also create one on GitHub first. More details about manually creating repositories on GitHub is found in .

After connecting your local Git repository, to keep your GitHub repository synchronized, you need to “push” (upload) and “pull” (download) any changes you make to the repository on your computer, as shown in . It isn’t done automatically because Git is designed with having control in mind, so you must do this synchronization manually. “Pushing” is when changes to the history are uploaded to GitHub while “pulling” is when the history is downloaded from GitHub.

Pull

Push

'Remote':
GitHub

'Local':
Your computer

Figure 19.2: Synchronizing with GitHub: ‘Pushing’ and ‘pulling’.

So, when we put the concepts back into the framework of the “states”, first introduced in , pushing and pulling happen only to the history. Things that you’ve changed and then saved to the history, either on the remote or the local repository, are synchronized from or to GitHub. So, as shown in , pushing copies of the history over to GitHub and pulling copies of the history from GitHub. Since changes saved in the history also reflect the working folder (the files and folders you actually see and interact with), “pulling” also updates the files and folders.

GitHubHistoryStagedWorking folderAddCommitPushPullPull
Figure 19.3: Which states get ‘pushed’ and ‘pulled’.

Interacting with GitHub through R requires us to use something called a “personal access token”, which you will learn about and create in the next exercise.

CautionOrigami hats up!

When you’re ready to continue, place the paper hat on your computer to indicate this to the teacher 👒 🎩

19.4 📖 Reading task: Authenticating with GitHub

Very briefly reinforce the importance of using a PAT rather than your own password. Also emphasize the use of a password manager.

Time: ~3 minutes.

Any time we do anything on the internet, there is some risk to having our information maliciously hacked. This is no different when using GitHub, so if you can, you should try to be more secure with what you send across the internet. In fact, most functions that relate to Git or using GitHub require using more secure features in order to work. usethis makes this much easier, thankfully, with several functions. The usethis website has a really well written guide on setting it up. Here is a very simplified version of what they recommend that is relevant for what we are doing in this workshop.

  • Use personal access tokens (PAT, or simply called a “token”) when interacting with your GitHub remote repositories while outside of the GitHub website (e.g. when using R or usethis). PAT’s are like temporary passwords that provide limited access to your GitHub account, like being able to upload to or download from your GitHub repositories, but not being able to delete them. They are useful because you can easily delete the PAT if you think someone got access to it so that you can stop the PAT from being used for harmful purposes.

  • Use a password manager to save the PAT for later use. Using password managers is basically a requirement for having secure online accounts, because they can generate random and long passwords that you don’t have to remember. This is why we recommended you install one during the pre-workshop tasks.

  • Use packages like gitcreds to give usethis access to the PAT and to interact with your GitHub repositories. You normally would use gitcreds every time you restart your computer or after a certain period of time.

Tip

As a reminder, a password manager is an app or web service that let’s you save or create passwords for all your accounts, like banking or social media. Instead of having to remember multiple passwords used across multiple accounts, or the very insecure approach of one or two passwords for all your accounts, you instead need to remember only one very secure password that contains all your other very secure passwords. You can google “password manager” and your operating system (Windows, MacOS) to search for possible ones to install or use. We recommend using Bitwarden.

CautionOrigami hats up!

When you’re ready to continue, place the paper hat on your computer to indicate this to the teacher 👒 🎩

19.5 Authenticating with GitHub

Go through this with them slowly, explaining things along the way. Especially emphasize the importance of saving the PAT in a password manager, at the least to not save it in a file in the project itself.

Since we use R, there is a really useful set of functions from usethis to make it easy interact with and setup connections to GitHub from RStudio. So, let’s connect our projects to GitHub! Before we can link the projects’ Git repository to GitHub, we need to inform GitHub that we are the owner of the account by authenticating ourselves.

Very likely no one has set up a PAT yet with GitHub, but we’ll run a check just to see by running this function in the Console.

Console
usethis::gh_token_help()

Which should output something like this:

• GitHub host: 'https://github.com'
• Personal access token for 'https://github.com': <unset>
• To create a personal access token, call `create_github_token()`
• To store a token for current and future use, call `gitcreds::gitcreds_set()`
ℹ Read more in the 'Managing Git(Hub) Credentials' article:
  https://usethis.r-lib.org/articles/articles/git-credentials.html

The output is saying the token is <unset>, which means we need to set it up on our computer. We do that by typing the next function in the Console to create the token on GitHub (if one isn’t set already).

Console

This function sends us to the GitHub “Generate new token” webpage with all the necessary settings checked. Set the “Expiry date” to 90 days (this is a good security feature). Then, scroll down to the bottom without changing anything else and click the green button at the bottom called “Generate token”. Afterwards there will be a very long string that’s been generated that starts with ghp_. Save this token in your password manager. Alternatively, leave this page open to copy and paste the token whenever it’s needed. If this token gets lost, no worries! It’s very easy create a new one by re-running the usethis::create_github_token() function.

Important

Do not save the token in the project we’re working in for the workshop! Save it somewhere else.

This token we created is what we’ll use every time we open up RStudio and interact with GitHub through R. A token does not need to be created for each of our R projects. A new token is only needed when the current token expires (typically every couple of months) or if we’ve lost the token.

In the Console, run:

Console
gitcreds::gitcreds_set()

And then copy and paste your token into the prompt in the Console. This token usually gets saved for the day (it gets cached), but after restarting you may likely need to run the action again. If it asks to replace an existing one, select the “yes” option. Doing this is a bit like using the two-factor authentication (2FA) you have to do when you, for instance, want to access your online bank account or other government website. In this case, you are telling GitHub (when interacting with it through RStudio, like uploading and downloading your changes) that you are who you claim to digitally be.

Mention this helpful function below, but you don’t need to run it.

Tip

There is another great helper function that runs a lot of checks and gives some advice when it finds potential problems.

usethis::git_sitrep()

Just to be aware, using this function outputs a lot of stuff, most of which you probably don’t even need to know or don’t even know what it means. That’s ok, since it is meant as a diagnostic tool.

19.6 Linking your project to GitHub

As with above, go through this section slowly, explaining things as you go along. Briefly expand on and reinforce what the word “origin” means when it comes up.

Visually show the diagrams and explain it.

Now that we have authenticated ourselves to GitHub, we can connect our project’s Git repository to GitHub. If you are new to Git and GitHub, we strongly recommend starting your first work project(s) as private, in case you accidentally add files you aren’t supposed to. It will also help get you get more comfortable with using Git and GitHub. However, for this workshop, we will be keeping it public. To make it private, we would add the argument private = TRUE to the function below. For now, go to the Console and run this function to make a public repository:

Console
usethis::use_github()
Note

You may have to manually enter your username and password, even though you used gitcreds::gitcreds_set().

If you have troubles logging in, you may need to update Git.

Tip

You might notice the word origin when referring to remotes. The word origin is the default short name to refer to the location of the remote (the GitHub URL). You will probably see this word in many other places to refer to a remote.

The use_github() function will take your project and upload it to GitHub. Now, whenever you use Git and save your changes to the Git history, whenever you “Push” your changes it will be sent to your project on GitHub. The diagram below shows how it conceptually looks like:

Push

Pull

Your local

GitHub

Figure 19.4: Schematic showing a local repository connected to GitHub’s remote repository.

The “Your local” is your own computer. Whenever you “push” to GitHub, it means it will upload your file changes (like synchronizing in Dropbox). Whenever you “pull” from GitHub, it takes any changes made on GitHub and downloads them to your “Local” computer.

Using GitHub (because of Git) is one of the most effective ways to collaborate on a project. Hundreds of companies and hundreds of thousands of workers use Git and services like GitHub to work together on massive projects. The way collaboration works would conceptually look like:

Push

Pull

Push

Pull

Your local

GitHub

Collaborator's
local

Figure 19.5: Schematic showing a local repository, GitHub’s remote repository, and a collaborator’s repository.

This approach to collaborating makes it much easier to contribute directly (not through emails) to projects and to more easily help others out with issues.

19.7 Synchronizing with GitHub

Go through this section slowly, remembering to make use of the stickies/origami hats to check that everyone is following along.

After we’ve created the token and put our project onto GitHub, we can now push and pull any changes you make to the files. Let’s practice how it works.

Open up the docs/learning.qmd and write a random sentence somewhere near the top of the file (below the YAML header). Then save the file. Open the Git interface with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”), “stage” the file, and write a commit message. Click the “Commit” button.

Now time to test out pushing the change to GitHub. Click the “Push” button in the top right corner of the Git interface. A pop-up will indicate that it’s pushing and will show some text after it’s pushed. Go to our GitHub repository to see that it worked!

Now let’s try the opposite way, by making a change on GitHub, committing the changes there, and then pulling changes from GitHub to your local repository (on your computer).

While in your LearningR GitHub repository, click into the docs/ folder and then click the learning.qmd file there. Then click the “Edit” button (the pencil icon) in the top right corner of the file view. You’ll be taken to a web-based editor.

Write another random sentence somewhere near the top of the file (below the YAML header). Scroll down to the commit message box, and type out a commit message. Then, click the “Commit” button. We’ve now made a change on the repository in GitHub. Let’s synchronize it to our local repository.

Go back to RStudio, open the Git interface and now click the “Pull” button in the top right corner beside the “Push” button. Wait for it to finish pulling and check your docs/learning.qmd file for the new change. You’ve now updated your project!

19.8 📖 Reading task: Collaborating using Git and GitHub

After they’ve read the text, briefly go over the image and emphasize why collaborating this way makes things easier. If you have some personal experiences, please share them!

Time: ~10 minutes

While Git and GitHub are useful on even when you work alone, it’s main and biggest advantage is that it makes it much easier to collaborate with others on a project.

Academia is unfortunately far behind when it comes to using modern tools to effectively collaborate together. Most researchers still use emails to send Word or other files back and forth, and while this is a very simple and non-technical way to collaborate as it requires very little learning or training to use, it is not very effective.

Usually this style of collaborating revolves around one or two people doing most of the actual writing or direct contributing, while others give feedback or indirectly contribute through discussions or meetings. You might be familiar with using “Track changes” in Word when doing this style of collaborating. If your collaborators are a bit more technical, you all might be using Google Docs to do real-time collaboration together.

When you use Git and GitHub and write in Quarto documents, this style of collaborating isn’t possible. For one, there is no “track changes” feature in Quarto documents. Instead, you need to use another way of collaborating, one that is much more effective and extremely powerful. This workflow of using Git and GitHub has been tried and tested by tens of thousands of teams in tens of hundreds of companies globally. One of the goals of this workshop is to slowly move researchers more into the modern era, using more modern technology, tools, and workflows so that we can produce better research faster.

How does the workflow look like when using Git and GitHub? It works by using the concept of remotes that we introduced earlier. Since a local repository is a copy of a remote repository, anyone else can collaborate on your project by copying the remote repository. When they want to contribute back, they make commits to their local copy and push those changes up to the remote. Then you can pull those changes to your local repository and do the same thing by committing and then pushing. This is illustrated in .

Pull

Push

Pull

Push

'Remote':
GitHub

'Local':
Your computer

'Local':
Collaborator's
computer

Figure 19.6: Collaborating with others using Git and GitHub by having a shared central GitHub repository.

A disadvantage to this workflow is, in order to use it to effectively, it takes time to learn and get used to. For instance, you may think that you can just collaborate with others by making changes directly to the same file at the same time. But a problem comes up when you both push and pull changes to the same file. You will encounter something called a “merge conflict”, which you’ll have to learn how to resolve. Git knows that changes were made to the same file, but it doesn’t know which change to keep and which to discard. You have to manually resolve the conflict by opening the file and deciding which change to keep.

So, how do you manage this when collaborating with others? Always dealing with merge conflicts sounds time-consuming and frustrating, doesn’t it? Well, that’s because Git wasn’t designed for that style of collaborating. Instead, Git uses a concept called “branches” to more effectively manage multiple collaborators working on the same project. However, branches are a bit more of an advanced topic that we won’t be covering in this workshop. Instead, when collaborating with others, we recommend that each collaborator create their own separate file to work on. For instance, if you are collaborating on writing a report together, each collaborator would create their own Quarto document to write in. When everyone is finished writing, one person can then merge all the documents together into one final report. This way, you avoid merge conflicts entirely. This is the approach we will get you to use for the team project at the end of this workshop.

Tip

For public GitHub repositories, anyone can copy your repository and contribute back (only if you want their contribution), so working with collaborators is easy. When you have a private repository, you need to explicitly add collaborators in GitHub.

You add someone to a private (or public) repository by going to “Settings > Manage Access > Invite a collaborator”. We won’t do this for the workshop, but we’re telling you how you can just in case you want to after the workshop.

Tip

As we mentioned, when working with others (or even yourself) through GitHub, you will eventually encounter “merge conflicts”. This happens when a change has been made to the same line in the same file, but in different commits, either by you or someone else. This usually happens if you make a change on GitHub as the remote while also making a change on your local repository without pulling first.

When this happens, Git will not know which change to keep and will ask you to resolve the conflict. You resolve the conflict by opening the file in RStudio, finding the conflict, and deciding which change to keep and which to remove. After you’ve resolved the conflict, you would then stage the file and commit it.

We won’t go into merge conflicts in this workshop, though you might deal with them during the team project. If you want to learn more about them after the workshop, GitHub has a practical tutorial on it. We also have an extra appendix section to deal with merge conflicts in .

Note

A big challenge you’ll encounter with becoming better with this way of collaborating is that most of your collaborators will likely not be familiar with it, at least until they take this workshop. 😉

Sadly, even experienced researchers struggle with this workflow (mainly due to not being familiar with Git) and there is no easy answer on how to handle this. The best way (in our opinion) is to start training any colleague who is interested in collaborating this way and slowly surround yourself with collaborators who also work this way.

Tip

Want to see how others use Git and GitHub in their research as examples? Check out the Examples section of the Guides website.

CautionOrigami hats up!

When you’re ready to continue, place the paper hat on your computer to indicate this to the teacher 👒 🎩

19.9 Summary

  • “Remotes” are external storage locations for your Git repository. GitHub is a popular remote repository hosting service.
    • Downloading a Git repository from GitHub is called “cloning”.
  • “Pushing” and “pulling” are actions to upload and download to the remote repository (which usually is called “origin”), so that you can synchronize your changes.
  • Collaborating together using Git and GitHub is a powerful way of working, but does take some learning. Collaborating involves each collaborator creating a copy of the remote repository that you each push and pull to as you work together.

19.10 Survey

Please complete the survey for this session:

Feedback survey! 🎉