"Motivation & Basics of version control" starts in

Participation modes

Prerequisites: Installation and Configuration

  • Your installed version of DataLad should be 0.16.1
  • datalad --version
    0.16.1
  • DataLad relies on Git to create a revision history with detailed information on what was changes, when, and how. Therefore, you should tell Git who you are and configure a Git identity (name and email)
  • $ git config --list
    user.name=Adina Wagner
    user.email=adina.wagner@t-online.de
    [...]
    
  • Set a Git identity using
    $ git config set --global user.name "Adina Wagner"
    $ git config set --global user.email "adina.wagner@t-online.de"
    Find installation and configuration instructions at handbook.datalad.org

Using DataLad

  • DataLad can be used from the command line
  • datalad create mydataset
  • ... or with its Python API
  • import datalad.api as dl
    dl.create(path="mydataset")
  • ... and other programming languages can use it via system call
  • # in R
    > system("datalad create mydataset")
    

Using DataLad

  • Every DataLad command consists of a main command followed by a sub-command. The main and the sub-command can have options.
  • Example (main command, subcommand, several subcommand options):
    $ datalad save -m "Saving changes" --recursive 
  • Use --help to find out more about any (sub)command and its options, including detailed description and examples (q to close). Use -h to get a short overview of all options
    $ datalad save -h
          Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
                        [-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
                        [--version]
                        [PATH ...]
    
    Use '--help' to get more comprehensive information.
              

DataLad Datasets

  • DataLad's core data structure
    • Dataset = A directory managed by DataLad
    • Any directory of your computer can be managed by DataLad.
    • Datasets can be created (from scratch) or installed
    • Datasets can be nested: linked subdirectories
  • Let's start by creating a dataset:
  • $ datalad create -c text2git my-dataset
Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset

DataLad Datasets

A DataLad dataset is a joined Git + git-annex repository

What is version control?

Illustration adapted from Scriberia and The Turing Way
  • keep things organized
  • keep track of changes
  • revert changes or go back to previous states

Why version control?


Version Control

  • DataLad knows two things: Datasets and files

  • Every file you put into a in a dataset can be easily version-controlled, regardless of size, with the same command: datalad save
  • Local version control

    Procedurally, version control is easy with DataLad!


    Advice:
    • Save meaningful units of change
    • Attach helpful commit messages

    Preview: Start to record provenance

    • Have you ever saved a PDF to read later onto your computer, but forgot where you got it from?
    • Digital Provenance = "The tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred"
    • The history of a dataset already contains provenance, but there is more to record - for example: Where does a file come from? datalad download-url is helpful

    Summary - Local version control

    datalad create creates an empty dataset.
    Configurations (-c yoda, -c text2git) are useful (details soon).

    A dataset has a history to track files and their modifications.
    Explore it with Git (git log) or external tools (e.g., tig).

    datalad save records the dataset or file state to the history.
    Concise commit messages should summarize the change for future you and others.

    datalad download-url obtains web content and records its origin.
    It even takes care of saving the change.

    datalad status reports the current state of the dataset.
    A clean dataset status (no modifications, not untracked files) is good practice.

    Questions!

    Awkward silence can be bridged with awkward MC questions :)

    Teaser: Time-travelling

    Comprehensive walk-through handbook.datalad.org/basics/101-137-history.html
    • Mistakes are not forever anymore: Past changes can transparently be undone
    • Become a time-bender: Travel back in time or rewrite history
    • Prerequisite: Understand Git IDs and "refs"
      • Commit hash/Commit SHA: A 40-character string identifying each commit
      • Branch names, e.g., main
      • Tags, e.g., v.0.1
      • A pointer to the checked-out (current) commit on the current branch, HEAD

    Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#breaking-things-and-repairing-them

    Summary: Interacting with Git's history (teaser)

    Interactions with Git's history require Git commands, but are immensely powerful
    More in handbook.datalad.org/basics/101-137-history.html

    git restore is a dangerous (!), but sometimes useful command:
    It removes unsaved modifications to restore files to a past, saved state. What has been removed by it can not be brought back to life!

    git revert [hash] transparently undoes a past commit
    It will create a new entry in the revision history about this.

    Commands that will be introduced later:
    git checkout lets you time-travel.
    Commands that are out of scope but useful to know:
    git rebase changes and git reset rewinds history without creating a commit about it (see Handbook chapter for examples).
    A life-saver that is not well-known: git reflog
    A time-limited backlog of every past performed action, can undo every mistake except git restore and git clean.

    Questions!

    Awkward silence can be bridged with awkward MC questions :)

    A look underneath the hood

    (In-depth explanations how and why things work, with plenty of teasers to additional features)

    There are two version control tools at work - why?

    Git does not handle large files well.

    There are two version control tools at work - why?

    Git does not handle large files well.

    And repository hosting services refuse to handle large files:

    git-annex to the rescue! Let's take a look how it works

    Consuming datasets

    • Here's how to get a dataset:

    Consuming datasets

    • Here's how a dataset looks after installation:

    Plenty of data, but little disk-usage

    • Cloned datasets are lean. "Meta data" (file names, availability) are present, but no file content:
    • $ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
        install(ok): /tmp/studyforrest-data-phase2 (dataset)
        $ cd studyforrest-data-phase2 && du -sh
        18M	.
    • files' contents can be retrieved on demand:
    $ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
      get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
  • Have more access to your computer than you have disk-space:
  • # eNKI dataset (1.5TB, 34k files):
    $ du -sh
    1.5G	.
    # HCP dataset (~200TB, >15 million files)
    $ du -sh
    48G	. 

    Git versus Git-annex

    Data in datasets is either stored in Git or git-annex
    By default, everything is annexed, i.e., stored in a dataset annex by git-annex


    Git git-annex
    handles small files well (text, code) handles all types and sizes of files well
    file contents are in the Git history and will be shared upon git/datalad push file contents are in the annex. Not necessarily shared
    Shared with every dataset clone Can be kept private on a per-file level when sharing the dataset
    Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files Useful: Large files, private files


    Git versus Git-annex

    Useful background information for demo later. Read this handbook chapter for details
    Git and Git-annex handle files differently: annexed files are stored in an annex. File content is hashed & only content-identity is committed to Git.
    • Files stored in Git are modifiable, files stored in Git-annex are content-locked
    • Annexed contents are not available right after cloning, only content identity and availability information (as they are stored in Git). Everything that is annexed needs to be retrieved with datalad get from whereever it is stored.

    Git versus Git-annex

    Git versus Git-annex

      When sharing datasets with someone without access to the same computational infrastructure, annexed data is not necessarily stored together with the rest of the dataset (more in the session on publishing).
      Transport logistics exist to interface with all major storage providers. If the one you use isn't supported, let us know!

    Git versus Git-annex

      Users can decide which files are annexed:

    • Pre-made run-procedures, provided by DataLad (e.g., text2git, yoda) or created and shared by users (Tutorial)
    • Self-made configurations in .gitattributes (e.g., based on file type, file/path name, size, ...; rules and examples )
    • Per-command basis (e.g., via datalad save --to-git)

    text2gitText versus binary files

    An overview of text- versus binary files and implications for version control is in psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#file-types-text-vs-binary

    Disk-space aware workflows

    • Clone the input data:
    • $ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
      install(ok): /tmp/machinelearning-books (dataset)
      $ cd machinelearning-books && du -sh
      348K	.
      $ ls
      A.Shashua-Introduction_to_Machine_Learning.pdf
      B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
      C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
      D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
      [...]
    • retrieve annexed file's contents on demand:
    • $ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
        get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]
    • Drop annexed file's contents when done:
    • $ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
        drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]

    Distributed availability

    • git-annex conceptualizes file availability information as a decentral network. A file can exist in multiple different locations. git annex whereis tells you which are known:
    • $ git annex whereis inputs/images/chinstrap_02.jpg
      whereis inputs/images/chinstrap_02.jpg (1 copy)
      	00000000-0000-0000-0000-000000000001 -- web
      	c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]
      
        web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
      ok
      
    • If a file has no other known storage locations, drop will warn
      • Here is a file with a registered remote location (the web)
      • $ datalad drop inputs/images/chinstrap_02.jpg
        drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
        $ datalad get inputs/images/chinstrap_02.jpg
        get(ok): inputs/images/chinstrap_02.jpg (file)
        
      • Here is a file without a registered remote location (the web)
      • $ datalad drop inputs/images/chinstrap_01.jpg
        drop(error): inputs/images/chinstrap_01.jpg (file)
                     [unsafe; Could only verify the existence of 0 out of 1 necessary copy;
                     (Use --reckless availability to override this check, or adjust numcopies.)]
    • Delineation and advantages of decentral versus central RDM: In defense of decentralized research data management

    Data protection

    Why are annexed contents write-protected? (part I)

    • Where the filesystem allows it, annexed files are symlinks:
      $ ls -l inputs/images/chinstrap_01.jpg
      lrwxrwxrwx 1 adina adina 132 Apr  5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
      xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
      
      (PS: especially useful in datasets with many identical files)
    • The symlink reveals git-annex internal data organization based on identity hash:
      $ md5sum inputs/images/chinstrap_01.jpg
      2e043a5654cec96aadad554fda2a8b26  inputs/images/chinstrap_01.jpg
      
    • git-annex write-protects files to keep this symlink functional - Changing file contents without git-annex knowing would make the hash change and the symlink point to nothing
    • To (temporarily) remove the write-protection one can unlock the file

    Detour & Teaser: Reproducible data analysis

    Your past self is the worst collaborator: Full comic at http://phdcomics.com/comics.php?f=1979

    Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#data-processing

    Reproducible execution & provenance capture

    datalad run wraps a command execution and records its impact on a dataset.

    Reproducible execution & provenance capture

    datalad run wraps a command execution and records its impact on a dataset.

    commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
    Author: Adina Wagner [adina.wagner@t-online.de]
    Date:   Mon Apr 18 12:31:47 2022 +0200
    
        [DATALAD RUNCMD] Convert the second image to greyscale
    
        === Do not change lines below ===
        {
         "chain": [],
         "cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
         "dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
         "exit": 0,
         "extra_inputs": [],
         "inputs": [],
         "outputs": [],
         "pwd": "."
        }
        ^^^ Do not change lines above ^^^
    
    diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
    new file mode 120000
    index 0000000..5febc72
    --- /dev/null
    +++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
    @@ -0,0 +1 @@
    +../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
    \ No newline at end of file
    

    The resulting commit's hash (or any other identifier) can be used to automatically re-execute a computation (more on this tomorrow)

    Data protection

    Why are annexed contents write-protected? (part 2)

    • When you try to modify an annexed file without unlocking you will see "Permission denied" errors.
      Traceback (most recent call last):
        File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
          grey.save(args.output_file)
        File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
          fp = builtins.open(filename, "w+b")
      PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
      
    • Use datalad unlock to make the file modifiable. Underneath the hood (given the file system initially supported symlinks), this removes the symlink:
      $ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
      $ ls outputs/images_greyscale/chinstrap_02_grey.jpg
      -rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpg
    • datalad save locks the file again. Locking and unlocking ensures that git-annex always finds the right version of a file.

    Reproducible execution & provenance capture

    datalad run wraps a command execution and records its impact on a dataset.
    In addition, it can take care of data retrieval and unlocking

    datalad rerun

    • datalad rerun is helpful to spare others and yourself the short- or long-term memory task, or the forensic skills to figure out how you performed an analysis
    • But it is also a digital and machine-reable provenance record
    • Important: The better the run command is specified, the better the provenance record
    • Note: run and rerun only create an entry in the history if the command execution leads to a change.


    • Task: Use datalad rerun to rerun the script execution. Find out if the output changed

    Summary - Underneath the hood

      Files are either kept in Git or in git-annex.
      datalad save is used for both, but configurations (e.g., text2git), dataset rules (e.g., in a .gitattributes file, or flags change the default behavior of annexing everything

      Annexed files behave differently from files kept in Git:
      They can be retrieved and dropped from local or remote locations, they are write-protected, their content is unkown to Git (and thus easy to keep private).

      datalad clone installs datasets from URLs or local or remote paths
      Annexed files contents can be retrieved or dropped on demand, file contents of files stored in Git are available right away.

      datalad unlock makes annexed files modifiable, datalad save locks them again.
      (It is generally easier to get accidentally saved files out of the annex than out of Git - see handbook.datalad.org/basics/101-136-filesystem.html for examples)

      datalad run records the impact of any command execution in a dataset.
      Data/directories specified as --input are retrieved prior to command execution, data/directories specified as --output unlocked.

      datalad rerun can automatically re-execute run-records later.
      They can be identified with any commit-ish (hash, tag, range, ...)

    Questions!

    Awkward silence can be bridged with awkward MC questions :)

    Before we continue...

    Let your energy level define how we progress: