10. Introduction to version control#
Section author: Gavin Huttley
“Version control” refers to software tools that are designed to efficiently keep track of changes to your plain text files. As the phrase implies, the support relates to recording different versions of files. Here, the word version is not limited to a release version [1].
While used predominantly for programming [2], they can be applied to any application that uses plain text files [3]. Familiarity with version control is thus crucial for bioinformaticians. Here’s just a short list of some advantages from using it:
makes it easier to experiment with different solutions to a problem
makes it easier to collaborate with other people
makes moving your code between computers easier
makes it easier to ensure the reproducibility of your work
Plain text files are the medium for recording programs in nearly all programming languages.
For example, many data files from genomics are plain text files.
As a budding professional computational scientist, writing code is your core business so anything that makes that job easier is a good thing!
In this topic I provide a functional introduction [4] to the version control tool git. git
is a sophisticated (and very complex) command line tool with extensive capabilities. It is not the only such tool available [5]. You will become familiar with using it in a terminal. Note that some IDE’s expose a sophisticated graphical user interface to using git
that make using it much easier.
Only the bare minimum necessary to use basic capabilities of git.
I personally use mercurial, but git
is way more popular.
10.1. Getting set up#
These instructions are focussed on having a repository (see Glossary of key version control terms for definitions) that is hosted at GitHub and for which there will be a clone on a computer that you will use for writing your code. If you don’t already have one, sign up for a GitHub account.
You also need git
installed on the machine where you will be writing / running the code [6].
If you are doing my course, git
has already been installed on the class server.
On your developer machine you need to inform git
what your user name and email address are. These details are used to “sign” every commit you make. This is your attribution and informs others who made what changes. On your developer machine, in the terminal
$ git config --global user.name "Your Name"
$ git config --global user.email "YourEmail@example.com"
I also strongly recommend to change the log message editor from the default (vim
) to nano
[7].
Which reminds me of a joke – “How do you generate a random string? Put a first year Computer Science student in vim
and ask them to save and exit.”
$ git config --global core.editor nano
10.2. A demo project#
10.2.1. Create a demo project on GitHub#
Once your account is setup, create a new repository. For the purpose of demonstration, I’m going to assume you name it demo
.
Check the “Add a README file” option. Check the “Add .gitignore” option and select python from the popup. Check the “Choose a license” option and pick whichever one you like.
10.2.2. Cloning the repository to your development computer#
In this case, you will clone onto the machine where you will be developing your code. I assume you have gone through the process of creating an ssh key and followed GitHub’s instructions for adding that to your account [8].
You will make life easier for yourself if you upload a SSH key to your GitHub account. This requires you create a SSH key. See instructions here for doing both of these things.
$ git clone git@github.com:YourUserName/YourRepo.git
This creates a directory named YourRepo
on the system.
10.2.3. Add a python file to your repository#
You first need to change into the directory that contains your repository. In the terminal, this is
$ cd YourRepo
When you list all [9] the contents of this directory you will see the .git
directory
ls -a
, which shows hidden files and folders too.
10.2.3.1. Create a file to add#
Note
Skip this step if you already have a file you want to add!
Now create a python file that contains just a print statement
$ echo 'print("Hello World")' > demo.py
10.2.3.2. Add a file#
We tell git
we want to add a file to your repository using,
$ git add demo.py
This command just “stages” the file, meaning you have told git
to include this change when you make the next commit.
10.2.3.3. Commit the file!#
You have not finished this until you commit the staged change!
$ git commit -m "Added a demo python script"
10.2.4. Look at the history of your repository#
$ git log
10.2.5. Push your change to GitHub#
$ git push
10.2.6. Tips for effective use of version control#
10.2.6.1. Do#
track text files
commit changes that are logically related
think of log messages as your lab notebook entries to help you (and others) to understand what you were thinking when changed the files
write meaningful log messages
commit often
push to GitHub often [10]
It’s your backup!
10.2.6.2. Do NOT#
add really big files to a repository
add binary files to a repository
add secrets [11] to a repository!
include a massive number of changes in one commit
Any type of information that would allow someone to cause you trouble! For example, passwords, application tokens, account names.
10.3. Glossary of key version control terms#
- add
Adding a file to a your repository.
- clone
An independent copy of a repository. It is not required to be identical to the original.
- commit
The act of recording changes to a file by version control software.
- config
Configure the version control software.
- conflict
Where someone else has made a change to a repository affecting the same lines as your change.
- diff
A comparison of contents of two files / directories that shows only the differences.
- .gitignore
A file that contains patterns that match files you do not want to be included in the repository.
- log
Command to show the history of commits.
- log message
Text that describes the purpose of the changes being committed to a repository.
- manifest
Listing of files that are being tracked in a repository.
- merge
The step of resolving conflicting repository versions.
- repository
Short for software repository. This is a directory of (typically plain text source code) files pertaining to a project.
- repo
See repository.
- tracked
Refers to files whose contents are being recorded by version control software.
- pull
Updating a repository by pulling changes from another (possibly on another computer) repository.
- push
Pushing changes recorded locally to another (possibly on another computer) repository.
- reset
See revert.
- revert
To remove all changes made to the working copy of a file.
- stage
Staging a file means informing
git
that changes to that file are to be included on the next commit step.- working copy
The files in a repository that are visible (they are not under the
.git
directory).