DataLad - an Introdduction

Dr. Adina Wagner
mas.to/@adswa

Institute of Neuroscience and Medicine,
Brain & Behavior (INM-7)
Research Center Jülich



Slides: DOI 10.5281/zenodo.13806403
files.inm7.de/adina/talks/html/andani.html

DataLad

(datalad.org)


  • Domain-agnostic command-line tool (+ graphical user interface), built on top of Git & Git-annex
  • Major features:
  • Version-controlling arbitrarily large content
    Version control data & software alongside to code!
    Transport mechanisms for sharing & obtaining data
    Consume & collaborate on data (analyses) like software
    (Computationally) reproducible data analysis
    Track and share provenance of all digital objects
    (... and much more)

Let's try DataLad

For convenience, we work online today:
datalad-hub.inm7.de
username:
The spice or herb you got as a user name: iffmd.fz-juelich.de/s/lxzHBdTia#
password:
Set at first login, at least 8 characters


On your own machines, DataLad is available via pip, conda, apt, brew; On all major operating systems: See handbook.datalad.org/r.html?install

Acknowledgements

DataLad software
& ecosystem
  • Psychoinformatics Lab,
    Research center Jülich
  • Center for Open
    Neuroscience,
    Dartmouth College
  • Joey Hess (git-annex)
  • >100 additional contributors
Funders
Collaborators

DataLad usecases

  • Publish or consume datasets via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services
  • Behind-the-scenes infrastructure component for data transport and versioning (e.g., used by OpenNeuro, brainlife.io , the Canadian Open Neuroscience Platform (CONP), CBRAIN)
  • Central data management and archival system
  • Decentral data and metadata catalog
  • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
  • a screenrecording of cloning studyforrest data from github a screenrecording of browsing open neuro a screenrecording of cloning REMODNAV paper dataset from github

    Further resources and stay in touch

    Let's try DataLad

    datalad-hub.inm7.de
    username:
    The spice or herb you got as a user name: iffmd.fz-juelich.de/s/lxzHBdTia#`
    password:
    Set at first login, at least 8 characters

    Git identity setup

    Check Git identity:
            
                git config --get user.name
                git config --get user.email
            
        
    Configure Git identity:
                
                    git config --global user.name "Adina Wagner"
                    git config --global user.email "adina.wagner@t-online.de"
                
            
    Configure DataLad to use latest features:
                
                    git config --global --add datalad.extensions.load next
                
            

    Using DataLad in a terminal

    Check the installed version:
            
                datalad --version
            
            

    For help on using DataLad from the command line:
                
                    datalad --help
                
                The help may be displayed in a pager - exit it by pressing "q"
            
    For extensive info about the installed package, its dependencies, and extensions, use datalad wtf. Let's find out what kind of system we're on:
                
                    datalad wtf -S system
                
            

    Using datalad via its Python API

    Open a Python environment:
            
                ipython
            
        
    Import and start using:
                
                    import datalad.api as dl
                    dl.create(path='mydataset')
                
            
    Exit the Python environment:
                
                    exit
                
            

    Datalad datasets...

    ...Datalad datasets

    Create a dataset (here, with the yoda configuration, which adds a helpful structure and configuration for data analyses):
            
                datalad create -c yoda my-analysis
            
        
    Let's have a look inside. Navigate using cd (change directory):
                
                    cd my-analysis
                
            
    List the directory content, including hidden files, with ls:
                
                    ls -la .
                
            

    Version control...

    ...Version control

    The yoda-configuration added a README placeholder in the dataset. Let's add Markdown text (a project title) to it:
            
                echo "# My example DataLad dataset" > README.md
            
        
    Now we can check the status of the dataset:
                
                    datalad status
                
            
    We can save the state with save
                
                    datalad save -m "Add project title into the README"
                
            
    Further modifications:
                
                    echo "Contains a small data analysis for my project" >> README.md
                
            
    You can also checkout what has changed:
                
                    git diff
                
            
    Save again:
                
                    datalad save -m "Add information on the dataset contents to the README"
                
            

    ...Version control

    Now, let's check the dataset history:
                
                    git log
                
            
    We can also make the history prettier:
                
                    tig
                
                (navigate with arrow keys and enter, press "q" to go back and exit the program)
            
    Convenience functions make downloads easier. Let's add code for a data analysis from an external source:
                datalad download-url -m "Add an analysis script" \
      -O code/classification_analysis.py \
      https://raw.githubusercontent.com/datalad-handbook/resources/master/classification_analysis.py
                
            
    Check out the file's history:
                git log code/classification_analysis.py
            

    Local version control

    Procedurally, version control is easy with DataLad!


    Advice:
    • Save meaningful units of change
    • Attach helpful commit messages

    Computationally reproducible execution I...

    • which script/pipeline version
    • was run on which version of the data
    • to produce which version of the results?

    ... Computationally reproducible execution I

    A variety of processes can modify files. A simple example: Code formatting
                black code/classification_analysis.py
            
    Version control makes changes transparent:
                git diff
            
    But its useful to keep track beyond that. Let's discard the latest changes...
                git restore code/classification_analysis.py
            
    ... and record precisely what we did
                datalad run -m "Reformat code with black" \
     "black code/classification_analysis.py"
            
    let's take a look:
                git show
            
    ... and repeat!
                datalad rerun
            

    Data consumption & transport...

    ...Data consumption & transport...

    You can install a dataset from remote URL (or local path) using clone. Either as a stand-alone entity:
            
                # just an example:
                datalad clone \
                https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
            
        
    Or as linked dataset, nested in another dataset in a superdataset-subdataset hierarchy:
            
                # just an example:
                datalad clone -d . \
                https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
            
        
    • Helps with scaling (see e.g. the Human Connectome Project dataset )
    • Version control tools struggle with >100k files
    • Modular units improves intuitive structure and reuse potential
    • Versioned linkage of inputs for reproducibility

    ...Dataset nesting

    Let's make a nest!
    Clone a dataset with analysis data into a specific location ("input/") in the existing dataset, making it a subdataset:
                datalad clone --dataset . \
       https://github.com/datalad-handbook/iris_data.git \
       input/
            
    Let's see what changed in the dataset, using the subdatasets command:
                
                    datalad subdatasets
                
            
    ... and also git show:
                
                    git show
                
            
    We can now view the cloned dataset's file tree:
                
                    cd input
                    ls
                
            
    ...and also its history
                
                    tig
                
            
    Let's check the dataset size (with the du disk-usage command):
                
                    du -sh
                
            
    Let's check the actual dataset size:
                
                    datalad status --annex
                
            
    Let's check try to print the file contents into the terminal (cat):
                
                    cat iris.csv
                
            

    ...Data consumption & transport

    We can retrieve actual file content with get:
            
                datalad get iris.csv
            
        
    If we don't need a file locally anymore, we can drop its content:
                
                    datalad drop iris.csv
            
    No need to store all files locally, or archive results with Giga/Terra-Bytes of source data:
    dl.get('input/sub-01')
    [really complex analysis]
    dl.drop('input/sub-01')
    If data is published anywhere, your data analysis can carry an actionable link to it, with barely any space requirements.

    Git versus Git-annex

    Data in datasets is either stored in Git or git-annex
    By default, everything is annexed, i.e., stored in a dataset annex by git-annex



  • With annexed data, only content identity (hash) and location information is put into Git, rather than file content. The annex, and transport to and from it is managed with git-annex
  • Git versus Git-annex

    Configurations (e.g., YODA), custom rules, or command parametrization determines if a file is annexed
    Storing files in Git or git-annex has distinct advantages:


    Git git-annex
    handles small files well (text, code) handles all types and sizes of files well
    file contents are in the Git history and will be shared upon git/datalad push file contents are in the annex. Not necessarily shared
    Shared with every dataset clone Can be kept private on a per-file level when sharing the dataset
    Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files Useful: Large files, private files


    YODA configures the contents of the code/ directory and the dataset descriptions (e.g., README files) to be in Git. There are many other configurations, and you can also write your own.

    ...Computationally reproducible execution...

    Try to execute the downloaded analysis script. Does it work?
    
    cd ..
    python code/classification_analysis.py
    • Software can be difficult or impossible to install (e.g. conflicts with existing software, or on HPC) for you or your collaborators
    • Different software versions/operating systems can produce different results: Glatard et al., doi.org/10.3389/fninf.2015.00012
    • Software containers encapsulate a software environment and isolate it from a surrounding operating system. Two common solutions: Docker, Singularity

    ...Computationally reproducible execution...

    • The datalad run can run any command in a way that links the command or script to the results it produces and the data it was computed from
    • The datalad rerun can take this recorded provenance and recompute the command
    • The datalad containers-run (from the extension "datalad-container") can capture software provenance in the form of software containers in addition to the provenance that datalad run captures


    ...Computationally reproducible execution

    With the datalad-container extension, we can add software containers to datasets and work with them. Let's add a software container with Python software to run the script
                
                   datalad containers-add python-env --url shub://adswa/resources:2
                
            
    inspect the list of registered containers:
                
                    datalad containers-list
                
            
    Now, let's try out the containers-run command:
            
    datalad containers-run -m "run classification analysis in python environment" \
      --container-name python-env \
      --input "input/iris.csv" \
      --output "pairwise_relationships.png" \
      --output "prediction_report.csv" \
      "python3 code/classification_analysis.py {inputs} {outputs}"
            
        
    What changed after the containers-run command has completed?
    We can use datalad diff (based on git diff):
                
                    datalad diff -f HEAD~1
                
            
    We see that some files were added to the dataset!
    And we have a complete provenance record as part of the git history:
                
                    git log -n 1
                
            

    Publishing datasets...


    We will use GIN: gin.g-node.org:

    Publishing datasets...

    • Create a GIN user account and log in: gin.g-node.org/user/sign_up
    • Create an SSH key
    •             
                      ssh-keygen -t ed25519 -C "your-email"
                      eval "$(ssh-agent -s)"
                      ssh-add ~/.ssh/id_ed25519
                  
              
    • upload the SSH key to GIN
    •             
                      cat ~/.ssh/id_ed25519.pub
                  
              
    • Publish your dataset!

    ...Publishing datasets

    DataLad has convenience functions to create sibling-repositories on various infrastructure and third party services (GitHub, GitLab, OSF, WebDAV-based services, DataVerse, ...) , to which data can then be published with push.
            
                datalad create-sibling-gin example-analysis --access-protocol ssh
            
        
    You can verify the dataset's siblings with the siblings command:
                
                    datalad siblings
                
            
    And we can push our complete dataset (Git repository and annex) to GIN:
                
                    datalad push --to gin
                
            

    Using published data...

    Let's see how the analysis feels like to others:

            cd ../
    datalad clone \
       https://gin.g-node.org/adswa/example-analysis \
       myclone
        
                
                    cd myclone
                
            
    Get results:
                
                    datalad get prediction_report.csv
                
            
                
                    datalad drop prediction_report.csv
                
            
    Or recompute results:
                
                    datalad rerun
                
            

    The YODA principles

    DataLad Datasets for data analysis

    • A DataLad dataset can have any structure, and use as many or few features of a dataset as required.
    • However, for data analyses it is beneficial to make use of DataLad features and structure datasets according to the YODA principles:
    P1: One thing, one dataset
    P2: Record where you got it from, and where it is now
    P3: Record what you did to it, and with what

    Find out more about the YODA principles in the handbook, and more about structuring dataset at psychoinformatics-de.github.io/rdm-course/02-structuring-data
    ## P1: One thing, one dataset ![](../pics/dataset_modules.png) - Create **modular** datasets: Whenever a particular collection of files could anyhow be useful in more than one context (e.g. data), put them in their own dataset, and install it as a subdataset. - Keep everything **structured**: Bundle all components of one analysis into one superdataset, and within this dataset, separate code, data, output, execution environments. - Keep a dataset **self-contained**, with relative paths in scripts to subdatasets or files. Do not use absolute paths.

    Why Modularity?

    • 1. Reuse and access management
    • 2. Scalability
    • 3. Transparency

    • Original:
      
      /dataset
      ├── sample1
      │   └── a001.dat
      ├── sample2
      │   └── a001.dat
      ...
      
      Without modularity, after applied transform (preprocessing, analysis, ...):
      
      /dataset
      ├── sample1
      │   ├── ps34t.dat
      │   └── a001.dat
      ├── sample2
      │   ├── ps34t.dat
      │   └── a001.dat
      ...
      
      Without expert/domain knowledge, no distinction between original and derived data possible.

    Why Modularity?

    • 3. Transparency

    • Original:
      
      /raw_dataset
      ├── sample1
      │   └── a001.dat
      ├── sample2
      │   └── a001.dat
      ...
      
      With modularity after applied transform (preprocessing, analysis, ...)
      
      /derived_dataset
      ├── sample1
      │   └── ps34t.dat
      ├── sample2
      │   └── ps34t.dat
      ├── ...
      └── inputs
          └── raw
              ├── sample1
              │   └── a001.dat
              ├── sample2
              │   └── a001.dat
              ...
      
      Clearer separation of semantics, through use of pristine version of original dataset within a new, additional dataset holding the outputs.
    ## P2: Record where you got it from, and where it is now ![](../pics/data_origin.png) - **Link** individual datasets to declare data-dependencies (e.g. as subdatasets). - **Record data's origin** with appropriate commands, for example to record access URLs for individual files obtained from (unstructured) sources "in the cloud". - Share and **publish** datasets for collaboration or back-up.

    Dataset linkage

    $ datalad clone --dataset . http://example.com/ds inputs/rawdata
    
    $ git diff HEAD~1
    diff --git a/.gitmodules b/.gitmodules
    new file mode 100644
    index 0000000..c3370ba
    --- /dev/null
    +++ b/.gitmodules
    @@ -0,0 +1,3 @@
    +[submodule "inputs/rawdata"]
    +       path = inputs/rawdata
    +       url = http://example.com/importantds
    diff --git a/inputs/rawdata b/inputs/rawdata
    new file mode 160000
    index 0000000..fabf852
    --- /dev/null
    +++ b/inputs/rawdata
    @@ -0,0 +1 @@
    +Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
    
    Each (sub)dataset is a separately, but jointly version-controlled entity. If none of its data is retrieved, subdatasets are an extremely lightweight data dependency and yet actionable (datalad get retrieves contents on demand)
    ## P3: Record what you did to it, and with what ![](../pics/dataset_linkage_provenance.png) - Collect and store **provenance** of all contents of a dataset that you create - "Which script produced which output?", "From which data?", "In which **software environment**?" ... Record it in an ideally machine-readable way with **datalad (containers-)run**

    Take home messages

    What does DataLad add to Git and git-annex?
    Simple(r) core API to unify Git and git-annex functionality
    Ability to record provenance
    Support for software container solutions (Singularity, Docker)
    Subdatasets and linkage with a mono-repo-like user-experience
    Interoperability adapters to publish to a variety of hosting services
    Open Data Distribution: To date, more than 600TB of open neuro data are available via via datasets.datalad.org

    Thank you for your attention!



    Slides: DOI 10.5281/zenodo.10118794 (Scan the QR code)



    Women neuroscientists are underrepresented in neuroscience. You can use the
    Repository for Women in Neuroscience to find and recommend neuroscientists for
    conferences, symposia or collaborations, and help making neuroscience more open & divers.

    Command summaries

    Summary - Local version control

    datalad create creates an empty dataset.
    Configurations (-c yoda, -c text2git) add useful structure and/or configurations.

    A dataset has a history to track files and their modifications.
    Explore it with Git (git log) or external tools (e.g., tig).

    datalad save records the dataset or file state to the history.
    Concise commit messages should summarize the change for future you and others.

    datalad download-url obtains web content and records its origin.
    It even takes care of saving the change.

    datalad status reports the current state of the dataset.
    A clean dataset status (no modifications, not untracked files) is good practice.

    Summary - Dataset consumption & nesting

      datalad clone installs a dataset.
      It can be installed “on its own”: Specify the source (url, path, ...) of the dataset, and an optional path for it to be installed to.

      Datasets can be installed as subdatasets within an existing dataset.
      The --dataset/-d option needs a path to the root of the superdataset.

      Only small files and metadata about file availability are present locally after an install.
      To retrieve actual file content of annexed files, datalad get downloads file content on demand.

      Datasets preserve their history.
      The superdataset records only the version state of the subdataset.

    Summary - Reproducible execution

      datalad run records a command and its impact on the dataset.
      All dataset modifications are saved - use it in a clean dataset.

      Data/directories specified as --input are retrieved prior to command execution.
      Use one flag per input.

      Data/directories specified as --output will be unlocked for modifications prior to a rerun of the command.
      Its optional to specify, but helpful for recomputations.

      datalad containers-run can be used to capture the software environment as provenance.
      Its ensures computations are ran in the desired software set up. Supports Docker and Singularity containers

      datalad rerun can automatically re-execute run-records later.
      They can be identified with any commit-ish (hash, tag, range, ...)