This is where your title goes

An intuition on branching and
collaborative workflows

The what and the why and the how

Git workflows in datasets

Code: psychoinformatics-de.github.io/rdm-course/91-branching

Collaborative failures

	"I can't continue my work on the project because my colleague is working on it at the moment"
"I have such a good rephrasing of the discussion, but my PI wanted to work on this part of the manuscript for the past two weeks"
	"Alright team, I propose everyone reviews the proposal and adds changes and comments, and [poor scientific coordinator] will go through all documents and merge everything!"

Collaboration in parallel improves things:

Your Git revision history is a timeline of changes

This timeline develops on a "branch" (by default "main" or "master")

Branch names

Datasets can have unlimited branches, each with their own timeline of changes
Each branch has a unique name, and this name serves as an identifier of the timeline
The default branch is typically called main or master

This default name can be configured in general using
```
git config --global init.defaultbranch main
```

Or initialized during dataset creation using

datalad create mydataset --initial-branch main

Running git status shows you which branch you're on

$ git status
On branch main
nothing to commit, working tree clean

Running git branch shows you which branch you're on and which other branches you have
```
$ git branch
  git-annex
* main
```
The git-annex branch special, and only modified by git-annex

How to do branching - basic workflow and commands

The default branch will be
created together with the dataset

$ datalad create mydataset

How to do branching - basic workflow and commands

Every commit (datalad save)
on this branch
progresses its timeline

$ datalad save -m \
"adding preprocessing pipeline"

How to do branching - basic workflow and commands

Every commit (datalad save)
on this branch
progresses its timeline

$ datalad save -m \
"adding preprocessing pipeline"

But sometimes you're not sure if a new thing you're trying will work out in the end

My own very first version controlled project:

How to do branching - basic workflow and commands

You can create new
branches for transparency,
structure, sandboxing new
developments, collaboration, or fun

$ git branch preproc
$ git checkout preproc
# or shorter:
$ git checkout -b preproc

The new branch shares the
history with its base
branch but adds
independent new changes

$ datalad save -m \
"Added parametrization A"

How to do branching - basic workflow and commands

You can add as many
changes to the branch
as you want - the default
branch "stays in the past"
while you test new changes
After a few changes,
you might be confident
to run your script on
data and save the results

$ datalad save -m \
"Tweak parameter, add comments"
$ ...
$ datalad save -m /
"Compute results"

How to do branching - basic workflow and commands

When done with sand-
boxing and the results
look ok, you could integrate
the changes from preproc
into the default branch.
You can jump between
branches, and merge
one or more branches into
another branch

$ git checkout main
$ git merge preproc

Advantages:
- Transparency
- Cleanliness
- If sandboxing fails don't
merge and your default
branch stays orderly
- Keep different preprocessing
in parallel

How to do branching - Time is fluid

Branches allow parallel
developments. While you
tweaked the parameters,
a fix for a path problem
was fixed in a new branch
and merged to main

# create & enter a new branch
# from main
$ git branch fix-paths
$ git checkout fix-paths
$ datalad save -m \
"Fix:Change abs to rel paths"

# merge the fix into main
$ git checkout main
$ git merge fix-paths

However: How does preproc get the crucial fix
from main?

How to do branching - Time is fluid

You can merge main
(contains the fix) into
preproc to keep preproc
up to date with main's new developments

$ git checkout preproc
$ git merge main

Results can safely be
computed when the fix has
made it into preproc's
timeline

$ datalad save -m \
"Compute results"

How to do branching - time is fluid

Merging preproc into
main adds all the
changes main doesn't
yet know about from preproc

$ git checkout main
$ git merge preproc

Development that doesn't
require sandboxing and
won't lead to disorder
can continue on main

$ datalad save -m \
"add DOI to README"

Summary - solitary branching

Branching relies completely on Git commands. The most important are:

git branch [branchname]: Create a new branch
git checkout [branchname]: Switch to a different, existing branch
git checkout -b [branchname]: Create a new branch and switch to it (shortcut)
git merge [branchname]: Integrate the changes from one branch into the one currently checked out

Some advantages of branching in a dataset only you work on are:

Sandboxing developments
Keeping parallel developments (e.g., different preprocessing flavours)
Cleanliness and order, and slightly more exciting visualizations of your history

Questions!

Branching workflows in collaborations

How to do branching - across time and space

In collaborative workflows
each collaborator has their
own copy of a dataset,
and there also is a
central dataset used
to let collaborators
synchronize their work
To let others collaborate
on your dataset, you
put it to a central place
e.g., repository hosting
services like GitHub

$ datalad create-sibling-github \
mydataset -n upstream \
--access-protocol ssh

Once you have created
the central sibling
you can push your changes

$ datalad push --to upstream

Detour: Authentication and access

Have an account on that service
Create a personal tokens for authentication
Set up SSH keys to use the SSH protocol for repository access

Detour: Authentication and access

Personal tokens for authentication

Detour: Authentication and access

Personal tokens for authentication

Detour: Authentication and access

Personal tokens for authentication

Detour: Authentication and access

Personal tokens for authentication

Detour: Authentication and access

Personal tokens for authentication

Detour: Authentication and access

Personal tokens for authentication

datalad push

upstream

How to do branching - across time and space

Your own dataset now
has a sibling on GitHub
The repo on GitHub is
called "mydataset",
and your local dataset
knows this sibling
under the name upstream

Detour: Authentication and access

Set up SSH keys to use the SSH protocol for repository access

Different protocols exist to synchronize changes between dataset siblings (e.g., pushing local changes upstream or pulling/updating from upstream). The most important ones are "HTTPS" and "SSH"
If you want to use SSH (which can be more convenient), you need an account and an SSH key pair. This is a set of two files with character gibberish. You create them from the command line with an OS-specific command, e.g.
```
ssh-keygen -t ed25519 -C "your_email@example.com"
```
(see here for instructions for each OS)

Detour: Authentication and access

Set up SSH keys to use the SSH protocol for repository access

One file is secret, one is public (ends with .pub).
Add the contents of the public file to your GitHub account:

Detour: Authentication and access

Set up SSH keys to use the SSH protocol for repository access

One file is secret, one is public (ends with .pub).
Add the contents of the public file to your GitHub account:

How to do branching - across time and space

Your collaborator gets a
copy of the central dataset
by cloning (via preferred protocol) from GitHub.

# via ssh
$ datalad clone \
git@github.com:adswa/mydataset.git
# via https:
$ datalad clone \
https://github.com/adswa/mydataset.git

You can get the clone
URL right from GitHub:
For consistency, they can
name the sibling dataset
upstream as well By default, the dataset one clones from is
known as "origin" to the local clone

$ git remote rename origin upstream

How to do branching - across time and space

All collaborators can
work in parallel.
They could work on the default branch, but this is bad practice and impractical - its better to use new branches

How to do branching - integrating other's changes

GitHub let's you add
"collaborators" to repos.
If collaborators are added,
they can push their
changes directly to the
central repo

# your collaborator runs
$ datalad push --to upstream
# or with Git
$ git push upstream fix-paths

If they are not a collaborator
they need to create a fork
of the repository under their
account and clone & push there.

How to do branching - integrating other's changes

When pushing the branch
to upstream, GitHub
prompts you to create a
pull request other repository hosting services also call this a merge request because it is a request to merge the new branch into the default one
Merging the pull request
merges the collaborators
fix-paths branch into
the default branch.

How to do branching - integrating other's changes

Others can integrate the
new changes if they need them

# you run on branch preproc
$ git pull upstream main

How to do branching - integrating other's changes

Once ready, its your time
to push the changes and do a PR

# you run on branch preproc
$ datalad push --to upstream

How to do branching - integrating other's changes

Once ready, its your time
to push the changes and do a PR

# you run on branch preproc
$ datalad push --to upstream

How to do branching - integrating other's changes

Once ready, its your time
to push the changes and do a PR

# you run on branch preproc
$ datalad push --to upstream

How to do branching - integrating other's changes

Summary - collaborative branching

Branching workflows ensure clean, parallel development by multiple people
Collaborative workflows require a network of datasets:

clone: A dataset that was cloned from elsewhere.
sibling/remote: A dataset (clone) that a given dataset knows about. Can be established automatically (e.g., a clone knows its original dataset), or by hand (via "datalad siblings add --name [name] --url [url]" or "git remote add [name] [url]").
fork: A clone on a repository hosting site. “Forking” a repo from a different user “clones” it to your user account. Necessary when you don’t have permissions to push changes to the other users repository but still want to propose changes. Not necessary when you are a collaborator on the repository via the hosting service’s web interface.
upstream vs origin: Any clone knows its origin as a remote (by default called "origin"). A dataset can have multiple remotes (e.g., a different users’ dataset on GitHub, your own fork of this repository on GitHub). Convention: the original dataset on GitHub is "upstream", your fork of it is "origin". This involves adding a sibling/remote by hand and potentially renaming siblings/remotes (via git remote rename [name] [newname]).

Questions!

Merge conflicts

Merge conflicts arise when a file version-controlled in Git contains conflicting changes, for example when two collaborators modified the exact same line with different changes, and Git does not have a strategy to resolve the conflict
A merge conflict indicates:

"Before I merge, help me choose which modification to keep"
A merge conflict looks like this:

$ git pull upstream master                                                                    1 !
From github.com:adswa/mydataset
 * branch            master     -> FETCH_HEAD
Auto-merging code/preproc.sh
CONFLICT (content): Merge conflict in code/preproc.sh
Automatic merge failed; fix conflicts and then commit the result.

Tips for resolving merge conflicts

git status can guide you through resolving the merge conflict. Run it frequently

$ git status                                                                                  1 !
On branch preproc
You have unmerged paths.
  (fix conflicts and run "git commit")
  (use "git merge --abort" to abort the merge)

Unmerged paths:
  (use "git add file..." to mark resolution)
	both modified:   code/preproc.sh

no changes added to commit (use "git add" and/or "git commit -a")

"I'm in a merge conflict!"

How to emergency-abort

What to do next

Which files contain conflicts

Tips for resolving merge conflicts

Take a look into the file(s), in an editor or from the command line. Git has special mark-up to indicate conflicting changes.
"<<<<<<<" followed by a refspec (e.g., HEAD, a branch name, a commit SHA) until "======" indicates one set of changes.
Everything after "=======" until ">>>>>>>" followed by a refspec indicates the other set of changes

$ git diff
diff --cc code/preproc.sh
index fc3f8e8,14a0a13..0000000
--- a/code/preproc.sh
+++ b/code/preproc.sh
@@@ -1,3 -1,1 +1,7 @@@
++<<<<<<< HEAD
 +this is a script for processing
 +some parameter changes
 +some more parameter tweaks
++=======
+ fixed paths!
++>>>>>>> 9217a4b101159e6b5aab0a548aeb75fb82cca798

The refspec identifier shows you where the change is from (e.g., "HEAD" means "most recent commit on this branch")

This is your most recent change

This is a conflicting change

There can be multiple conflicts in a single file

Tips for resolving merge conflicts

To fix a merge conflict...

Remove any lines you don't want to keep. You can keep lines from both change sets!
Remove the "<<<<<<", ">>>>>>", and "======" conflict mark-up afterwards

$ git diff
diff --cc code/preproc.sh
index fc3f8e8,14a0a13..0000000
--- a/code/preproc.sh
+++ b/code/preproc.sh
@@@ -1,3 -1,1 +1,4 @@@
 +this is a script for processing
 +some parameter changes
 +some more parameter tweaks
 +fixed paths!

git add the file
When all files with conflicts are added, git commit to resolve the merge

$ git add code/preproc.sh
(datalad) adina@juseless in ~/scratch/mydataset on preproc+ (merge)
$ git commit
[preproc 1cab31a] Merge branch 'master' of github.com:adswa/mydataset into preproc

Tips for resolving merge conflicts

Merge conflicts are usually harmless
As with many other problems, git status will tell you what to do next and which commands to run
You could configure Git to use merge strategies resulting in fewer manual resolutions, e.g., "always keep my changes if others' changes conflict". More information: git-scm.com/docs/merge-strategies

An intuition on branching and collaborative workflows

Collaborative failures

Collaboration in parallel improves things:

Your Git revision history is a timeline of changes

This timeline develops on a "branch" (by default "main" or "master")

Branch names

How to do branching - basic workflow and commands

How to do branching - basic workflow and commands

How to do branching - basic workflow and commands

How to do branching - basic workflow and commands

How to do branching - basic workflow and commands

How to do branching - basic workflow and commands

How to do branching - Time is fluid

How to do branching - Time is fluid

How to do branching - time is fluid

Summary - solitary branching

Questions!

Branching workflows in collaborations

How to do branching - across time and space

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

How to do branching - across time and space

Detour: Authentication and access

Detour: Authentication and access

Detour: Authentication and access

How to do branching - across time and space

How to do branching - across time and space

How to do branching - integrating other's changes

How to do branching - integrating other's changes

How to do branching - integrating other's changes

How to do branching - integrating other's changes

How to do branching - integrating other's changes

How to do branching - integrating other's changes

How to do branching - integrating other's changes

Summary - collaborative branching

Questions!

Merge conflicts

Tips for resolving merge conflicts

Tips for resolving merge conflicts

Tips for resolving merge conflicts

Tips for resolving merge conflicts

An intuition on branching and
collaborative workflows