Brainhack Global 2020 Ankara
πŸ§ πŸ’»

An introduction to DataLad

Adina Wagner
@AdinaKrik

Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center JΓΌlich
ReproNim/INCF fellow


DataLad can help
with small or large-scale
data management
Free,
open source,
command line tool & Python API

some Basics

  • A command-line tool, available for all major operating systems (Linux, macOS/OSX, Windows), MIT-licensed
  • Build on top of Git and Git-annex
  • Allows...
  • ... version-controlling arbitrarily large content
    version control data and software alongside to code!
    ... transport mechanisms for sharing and obtaining data
    consume and collaborate on data (analyses) like software
    ... (computationally) reproducible data analysis
    Track and share provenance of all digital objects
    ... and much more
  • Completely domain-agnostic

A few things that DataLad can help with

  • Getting data
  • Keeping a project clean and orderly
  • Computationally reproducible data analysis


There is much more, and you can read about it in
The DataLad Handbook (handbook.datalad.org)

Acknowledgements

Software
  • Michael Hanke
  • Yaroslav Halchenko
  • Joey Hess (git-annex)
  • Kyle Meyer
  • Benjamin Poldrack
  • 26 additional contributors
Documentation project
  • Michael Hanke
  • Laura Waite
  • 28 additional contributors
Funders

Collaborators

Everything happens in DataLad datasets

  • DataLad's core data structure
    • Dataset = A directory managed by DataLad
    • A Git/git-annex repository
    • Any directory of your computer can be managed by DataLad.
    • Datasets can be created (from scratch) or installed
File viewer and terminal view of a DataLad dataset

Using DataLad

  • DataLad can be used from the command line
  • datalad create mydataset
  • ... or with its Python API
  • import datalad.api as dl
    dl.create(path="mydataset")
  • ... and other programming languages can use it via system call
  • # in R
    > system("datalad create mydataset")
    

Getting data

  • Datasets can be used to distribute data
  • You can clone a dataset from a public or private place and get access to the data it tracks
a screenrecording of cloning studyforrest data from github

  • Datasets are light-weight: Upon installation, only small files and meta data about file availability are retrieved, but no file content.
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
  install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
  18M	.          # its tiny!

Getting data

  • A cloned dataset gets you access to plenty of data, but has only little disk-usage
  • Specific file contents can be retrieved on demand via datalad get:
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
  • You can also drop file content if you don't need it anymore with datalad drop:
$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]
  • Feature: Have access to more data than your computer has disk-space!
  • # eNKI dataset (1.5TB, 34k files):
    $ du -sh
      1.5G	.
    # HCP dataset (80TB, 15 million files)
    $ du -sh
      48G	.
      

    Getting data

    Keeping a project clean and orderly

    adapted from https://dribbble.com/shots/3090048-Front-end-vs-Back-end
    ⬆
    This a metaphor for most projects after publication

    Keeping a project clean and orderly

    • Much of neuroscientific research is computationally intensive, with complex workflows from raw data to result, and plenty of researchers degrees of freedom
    Poline et al., 2011

    Complex analysis ➝ chaotic projects

    "Shit, which version of which script produced these outputs from which version of what data?"
    CC-BY Scriberia and The Turing Way

    Keeping a project clean and orderly

    CC-BY Scriberia & The Turing Way
      Version control
    • keep things organized
    • keep track of changes
    • revert changes or go
      back to previous states

    Keeping a project clean and orderly

    First, let's create a new data analysis dataset with datalad create
    $ datalad create -c yoda myanalysis
    [INFO   ] Creating a new annex repo at /tmp/myanalysis
    [INFO   ] Scanning for unlocked files (this may take some time)
    [INFO   ] Running procedure cfg_yoda
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    create(ok): /tmp/myanalysis (dataset) 
  • -c yoda applies useful pre-structuring and configurations:
  • $ tree
    .
    β”œβ”€β”€ CHANGELOG.md
    β”œβ”€β”€ code
    β”‚Β Β  └── README.md
    └── README.md
    

    Version Control

    • DataLad knows two things: Datasets and files

  • Every file you put into a in a dataset can be easily version-controlled, regardless of size, with the same command: datalad save
  • Version control

  • Example: Add a new file into a dataset
  • # create a data analysis script
    $ datalad status
    untracked: code/script.py (file)
    $ git status
    On branch master
    Untracked files:
      (use "git add file..." to include in what will be committed)
    	code/script.py
    
    nothing added to commit but untracked files present (use "git add" to track)
        

    Version control

  • Example: Add a new file into a dataset
  • # create a data analysis script
    $ datalad status
    untracked: code/script.py (file)
    $ git status
    On branch master
    Untracked files:
      (use "git add file..." to include in what will be committed)
    	code/script.py
    
    nothing added to commit but untracked files present (use "git add" to track)
        

    Version control

  • Example: Add a new file into a dataset
  • # create a data analysis script
    $ datalad status
    untracked: code/script.py (file)
    $ git status
    On branch master
    Untracked files:
      (use "git add file..." to include in what will be committed)
    	code/script.py
    
    nothing added to commit but untracked files present (use "git add" to track)
        
  • Save the dataset modification
  •  $ datalad save -m "Add a k-nearest-neighbour clustering analysis" code/script.py 

    Version controlling data allows to track data changes and uniquely identify precise versions that were used in your analysis

    Local version control

    Procedurally, version control is easy with DataLad!


      Stay flexible:

    • Non-complex DataLad core API (easier than Git)
    • Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)

    Advice:
    • Save meaningful units of change
    • Attach helpful commit messages

    Intuitive data analysis structure

  • You can link datasets together in superdataset-subdataset hierarchies:

  • $ cd myanalysis
    # we can install analysis input data as a subdataset to the dataset
    $ datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/
    [INFO   ] Scanning for unlocked files (this may take some time)
    [INFO   ] Remote origin not usable by git-annex; setting annex-ignore
    install(ok): input (dataset)
    add(ok): input (file)
    add(ok): .gitmodules (file)
    save(ok): . (dataset)
    action summary:
      add (ok: 2)
      install (ok: 1)
      save (ok: 1)
    

    Intuitive data analysis structure

  • You can link datasets together in superdataset-subdataset hierarchies:

  • $ tree
    .
    β”œβ”€β”€ CHANGELOG.md
    β”œβ”€β”€ code
    β”‚Β Β  β”œβ”€β”€ README.md
    β”‚Β Β  └── script.py
    └── input
      Β  └── iris.csv

    Basic organizational principles for datasets

    Keep everything clean and modular
  • An analysis is a superdataset, its components are subdatasets, and its structure modular
  • β”œβ”€β”€ code/
      β”‚   β”œβ”€β”€ tests/
      β”‚   └── myscript.py
      β”œβ”€β”€ docs
      β”‚   β”œβ”€β”€ build/
      β”‚   └── source/
      β”œβ”€β”€ envs
      β”‚   └── Singularity
      β”œβ”€β”€ inputs/
      β”‚   └─── data/
      β”‚       β”œβ”€β”€ dataset1/
      β”‚       β”‚   └── datafile_a
      β”‚       └── dataset2/
      β”‚           └── datafile_a
      β”œβ”€β”€ outputs/
      β”‚   └── important_results/
      β”‚       └── figures/
      └── README.md
    • do not touch/modify raw data: save any results/computations outside of input datasets
    • Keep a superdataset self-contained: Scripts reference subdatasets or files with relative paths
    Find out more about organizational principles in the YODA principles!

    Computationally reproducible data analysis


    This a metaphor for reproducing (your own) research
    a few months after publication
    ⬇
    Write-up: handbook.datalad.org/en/latest/basics/101-130-yodaproject.html

    A classification analysis on the iris flower dataset

    Write-up: handbook.datalad.org/en/latest/basics/101-130-yodaproject.html

    Reproducible execution & provenance capture

    datalad run

    Computational reproducibility

    How can I execute the analysis script on my input data in a computationally reproducible manner?
    $ datalad run -m "analyze iris data with classification analysis" \
      --input "input/iris.csv" \
      --output "prediction_report.csv" \
      --output "pairwise_relationships.png" \
      "python3 code/script.py"
    [INFO   ] Making sure inputs are available (this may take some time)
    get(ok): input/iris.csv (file) [from web...]
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): pairwise_relationships.png (file)
    add(ok): prediction_report.csv (file)
    save(ok): . (dataset)
    action summary:
      add (ok: 2)
      get (notneeded: 2, ok: 1)
      save (notneeded: 1, ok: 1)
          

    Computational reproducibility

    How can I execute the analysis script on my input data in a computationally reproducible manner?
    $ datalad run -m "analyze iris data with classification analysis" \
      --input "input/iris.csv" \
      --output "prediction_report.csv" \
      --output "pairwise_relationships.png" \
      "python3 code/script.py"
    [INFO   ] Making sure inputs are available (this may take some time)
    get(ok): input/iris.csv (file) [from web...]
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): pairwise_relationships.png (file)
    add(ok): prediction_report.csv (file)
    save(ok): . (dataset)
    action summary:
      add (ok: 2)
      get (notneeded: 2, ok: 1)
      save (notneeded: 1, ok: 1)
          

    Computational reproducibility

  • A datalad run command produces a machine-readable record, identifiable via commit hash
  • $ git log
    commit df2dae9b5af184a0c463708acf8356b877c511a8 (HEAD -> master)
    Author: Adina Wagner adina.wagner@t-online.de
    Date:   Tue Dec 1 11:58:18 2020 +0100
    
        [DATALAD RUNCMD] analyze iris data with classification analysis
    
        === Do not change lines below ===
        {
         "chain": [],
         "cmd": "python3 code/script.py",
         "dsid": "9ffdbfcd-f4af-429a-b64a-0c81b48b7f62",
         "exit": 0,
         "extra_inputs": [],
         "inputs": [
          "input/iris.csv"
         ],
         "outputs": [
          "prediction_report.csv",
          "pairwise_relationships.png"
         ],
         "pwd": "."
        }
        ^^^ Do not change lines above ^^^
    

    Computational reproducibility

  • A datalad run command produces a machine-readable record, identifiable via commit hash
  • $ git log
    commit df2dae9b5af184a0c463708acf8356b877c511a8 (HEAD -> master)
    Author: Adina Wagner adina.wagner@t-online.de
    Date:   Tue Dec 1 11:58:18 2020 +0100
    
        [DATALAD RUNCMD] analyze iris data with classification analysis
    
        [...]
    
  • You can rerun this hash to repeat the analysis:
     $ datalad rerun df2dae9b5af1
    datalad rerun df2dae9b5af18
    [INFO   ] run commit df2dae9; (analyze iris data...)
    [INFO   ] Making sure inputs are available (this may take some time)
    unlock(ok): pairwise_relationships.png (file)
    unlock(ok): prediction_report.csv (file)
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): pairwise_relationships.png (file)
    add(ok): prediction_report.csv (file)
    action summary:
      add (ok: 2)
      get (notneeded: 3)
      save (notneeded: 2)
      unlock (ok: 2)
            
  • Computational reproducibility

    • Code may fail (to reproduce) if run with different software
    • Datasets can store (and share) software environments (Docker or Singularity containers) and reproducibly execute code inside of the software container, capturing software as additional provenance
    • DataLad extension: datalad-container

    datalad-containers run

    Computational reproducibility

  • You can add (any amount of) software containers to your dataset to link a software environment to your analysis
  • $ datalad containers-add software --url shub://adswa/resources:2
    [INFO   ] Initiating special remote datalad
    add(ok): .datalad/config (file)
    save(ok): . (dataset)
    containers_add(ok): /tmp/myanalysis/.datalad/environments/software/image (file)
    action summary:
      add (ok: 1)
      containers_add (ok: 1)
      save (ok: 1)
    
    Write-up: http://handbook.datalad.org/en/latest/basics/101-133-containersrun.html

    Computational reproducibility

  • datalad containers-run will execute the command in the specified software environment
  • $ datalad containers-run -m "rerun analysis in container" \
      --container-name midterm-software \
      --input "input/iris.csv" \
      --output "prediction_report.csv" \
      --output "pairwise_relationships.png" \
      "python3 code/script.py"
    [INFO] Making sure inputs are available (this may take some time)
    [INFO] == Command start (output follows) =====
    [INFO] == Command exit (modification check follows) =====
    unlock(ok): pairwise_relationships.png (file)
    unlock(ok): prediction_report.csv (file)
    add(ok): pairwise_relationships.png (file)
    add(ok): prediction_report.csv (file)
    save(ok): . (dataset)
    action summary:
      add (ok: 2)
      get (notneeded: 4)
      save (notneeded: 1, ok: 1)
      unlock (ok: 2)
  • ... And a datalad rerun will repeat the analysis in the specified software environment
  • A quick summary of this sneak peek

    • Getting data
      • You can retrieve DataLad datasets with "datalad clone url/path"
      • A dataset allows you to retrieve data on demand via "datalad get"
      • You can drop unused data to free up disk space with "datalad drop"
    • Keeping projects clean
      • Create a dataset for data analysis using "datalad create -c yoda mydatasetname"
      • In this dataset, DataLad can version control data of any size with "datalad save"
      • You can link individual datasets as reusable and intuitive modular components, for example your input data to your analysis, with "datalad clone -d . url"
    • Computational reproducibility
      • "datalad run" can create a digital, machine-readable, and re-executable record of how you did your data analysis
      • You or others can redo the analysis automatically with "datalad rerun"
      • You can even link software environments to your analysis with the "datalad-container" extension, and run analysis with "datalad containers-run"

    Is there more?

    Resources and Further Reading

    Comprehensive user documentation in the
    DataLad Handbook (handbook.datalad.org)
    • High-level function/command overviews,
      Installation, Configuration, Cheatsheet
    • Narrative-based code-along course
    • Independent on background/skill level,
      suitable for data management novices
    • Step-by-step solutions to common
      data management problems, like
      how to make a reproducible paper