Version Control

Purpose
#

Modern Version Control Systems (VCS) let you easily (and often automatically) answer questions like:

Who wrote this module?
When was this particular line of this particular file edited? By whom? Why was it edited?
Over the last 1000 revisions, when/why did a particular unit test stop working?

Git’s data model
#

snapshot
#

Git models the history of a collection of files and folders within some top-level directory as a seiries of snapshots.

Git terminology:

In Git, a file is called a “blob”(a bunch of bytes)
directory is called tree, it maps names to blobs or tress (directory can also contain other directories)
snapshot: the top-level tree that is being tracked.

Below is an example of a tree/snapshot:

<root> (tree)
|
+- foo (tree)
|  |
|  + bar.txt (blob, contents = "hello world")
|
+- baz.txt (blob, contents = "git is wonderful")

This top-level tree contains two elements, a tree “foo” (that itself contains one element, a blob “bar.txt”), and a blob “baz.txt”.

Modeling history: relating snapshots
#

In Git, a history is a directed acyclic graph (DAG) of snapshots.

Each snapshot in Git refers to a set of “parents”, the snapshots that preceded it. It’s a set of parents rather than a single parent (as would be the case in a linear history) because a snapshot might descend from multiple parents, for example, due to combining (merging) two parallel branches of development

Git calls these snapshots “commit”s. Visualizing a commit history might look something like this:

o <-- o <-- o <-- o
            ^
             \
              --- o <-- o

In the ASCII art above, the os correspond to individual commits (snapshots). The arrows point to the parent of each commit (it’s a “comes before” relation, not “comes after”). After the third commit, the history branches into two separate branches. This might correspond to, for example, two separate features being developed in parallel, independently from each other. In the future, these branches may be merged to create a new snapshot that incorporates both of the features, producing a new history that looks like this, with the newly created merge commit shown in bold:

o <-- o <-- o <-- o <---- **o**
            ^            /
             \          v
              --- o <-- o

Commits in Git are immutable. This doesn’t mean that mistakes can’t be corrected, however; it’s just that “edits” to the commit history are actually creating entirely new commits, and references (see below) are updated to point to the new ones.

Data model, as pseudocode
#

// a file is a bunch of bytes
type blob = array<byte>

// a directory contains named files and directories
type tree = map<string, tree | blob>

// a commit has parents, metadata, and the top-level tree
type commit = struct {
    parents: array<commit>
    author: string
    message: string
    snapshot: tree
}

Objects and content-addressing
#

An object is a blob, tree, or commit:

type object = blob | tree | commit

In Git data store, all objects are content-addressed by their SHA-1 hash.

objects = map<string, object>

def store(object):
    id = sha1(object)
    objects[id] = object

def load(id):
    return objects[id]

Blobs, trees, and commits are unified in this way: they are all objects. When they reference other objects, they don’t actually contain them in their on-disk representation, but have a reference to them by their hash.

For example, the tree for the example directory structure above (visualized using git cat-file -p 698281bc680d1995c5f4caaf3359721a5a58d48d), looks like this:

00644 blob 4448adbf7ecd394f42ae135bbeed9676e894af85    baz.txt
040000 tree c68d233a33c5c06e0340e4c224f0afca87c8ce87    foo

The tree itself contains pointers to its contents, baz.txt (a blob) and foo (a tree). If we look at the contents addressed by the hash corresponding to baz.txt with git cat-file -p 4448adbf7ecd394f42ae135bbeed9676e894af85, we get the following:

git is wonderful

References
#

Now, all snapshots can be identified by their SHA-1 hashes. That’s inconvenient, because humans aren’t good at remembering strings of 40 hexadecimal characters.

Git’s solution to this problem is human-readable names for SHA-1 hashes, called “references”. References are pointers to commits. Unlike objects, which are immutable, references are mutable (can be updated to point to a new commit). For example, the master reference usually points to the latest commit in the main branch of development.

references = map<string, string>

def update_reference(name, id):
    references[name] = id

def read_reference(name):
    return references[name]

def load_reference(name_or_id):
    if name_or_id in references:
        return load(references[name_or_id])
    else:
        return load(name_or_id)

With this, Git can use human-readable names like “master” to refer to a particular snapshot in the history, instead of a long hexadecimal string.

One detail is that we often want a notion of “where we currently are” in the history, so that when we take a new snapshot, we know what it is relative to (how we set the parents field of the commit). In Git, that “where we currently are” is a special reference called “HEAD”.

Repositories
#

Finally, we can define what (roughly) is a Git repository: it is the data objects and references.

On disk, all Git stores are objects and references: that’s all there is to Git’s data model. All git commands map to some manipulation of the commit DAG by adding objects and adding/updating references.

Whenever you’re typing in any command, think about what manipulation the command is making to the underlying graph data structure. Conversely, if you’re trying to make a particular kind of change to the commit DAG, e.g. “discard uncommitted changes and make the ‘master’ ref point to commit 5d83f9e”, there’s probably a command to do it (e.g. in this case, git checkout master; git reset --hard 5d83f9e).

Staging area
#

Motivation:
#

For example, imagine a scenario where you’ve implemented two separate features, and you want to create two separate commits, where the first introduces the first feature, and the next introduces the second(and the first) feature.

o(original snapshot) <-- o(feature 1) <---o(feature 2)

Git accommodates such scenarios by allowing you to specify which modifications should be included in the next snapshot through a mechanism called the “staging area”.

Git command-line interface(CLI)
#

Basics
#

git help <command>: get hero for a git command
git init: creates a new git repo, with data stored in the .git directory
git status: tells you what’s going on
git add <filename>: adds files to staging area
git commit: creates a new commit
- Write good commit messages!
- Even more reasons to write good commit messages!
git log: shows a flattened log of history
git log --all --graph --decorate: visualizes history as a DAG
git diff <filename>: show changes you made relative to the staging area
git diff <revision> <filename>: shows differences in a file between snapshots
git checkout <revision>: updates HEAD and current branch

Branching and merging
#

git branch: shows branches
git branch <name>: creates a branch
git checkout <branch name>: switches to another branch
git merge <revision>: merges into current branch
git mergetool: use a fancy tool to help resolve merge conflicts
git rebase: rebase set of patches onto a new base

Merge Conflict:
#

In the below branching and merging

o(org) <--- o(feature 1) <--- o
^            								/
 \          					 		 v
  --- o(feature 2) ------	/

# org file
Line 1: Hello World
Line 2: This is an example file.

# feature 1
Line 1: Hello World
Line 2: This is an example file.
Line 3: Adding a line for clarity.

# feature 2
Line 1: Hello World
Line 2: This is a modified example file.

If comparing to the original file, the modification are on different lines, then merging has no conflict, but imagine, two features modified a same line(e.g., both writing new line 3), there will be conflict and git will let developer to manually resolve it.

Remote
#

git remote: list remotes
git remote add <name> <url>: add a remote
git push <remote> <local branch>:<remote branch>: send objects to remote, and update remote reference
git branch --set-upstream-to=<remote>/<remote branch>: set up correspondence between local and remote branch
git fetch: retrieve objects/references from a remote
git pull: same as git fetch; git merge
git clone: download repository from remote

Others
#

git config: Git is highly customizable
git clone --depth=1: shallow clone, without entire version history
git add -p: interactive staging
git rebase -i: interactive rebasing
git blame: show who last edited which line
git stash: temporarily remove modifications to working directory
git bisect: binary search history (e.g. for regressions)
.gitignore: specify intentionally untracked files to ignore

TODO and future learning plan:
#

Workflows: we taught you the data model, plus some basic commands; we didn’t tell you what practices to follow when working on big projects (and there are many different approaches).
GitHub: Git is not GitHub. GitHub has a specific way of contributing code to other projects, called pull requests.

References:
#

Youtube Videos: https://www.youtube.com/watch?v=2sjqTHE0zok&t=48s
The missing semester: https://missing.csail.mit.edu/2020/version-control/

Purpose#

Git’s data model#

snapshot#

Modeling history: relating snapshots#

Data model, as pseudocode#

Objects and content-addressing#

References#

Repositories#

Staging area#

Motivation:#

Git command-line interface(CLI)#

Basics#

Branching and merging#

Merge Conflict:#

Remote#

Others#

TODO and future learning plan:#

References:#