
Decentralized Management of Digital Objects for Open Science

Adina Wagner

Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center Jülich

Slides: DOI 10.5281/zenodo.10556597 (Scan the QR code)


DataLad software
& ecosystem
  • Psychoinformatics Lab,
    Research center Jülich
  • Center for Open
    Dartmouth College
  • Joey Hess (git-annex)
  • >100 additional contributors
improve scientific workflows, coming from the perspective of software distributions and development

"Share and treat data like software"

DataLad Datasets

A DataLad dataset is a joined Git + git-annex repository

What makes scientific workflows special?

Scientific building blocks are not static.
The building blocks of a scientific result are rarely static
Analysis code, manuscripts, ... evolve
(Rewrite, fix bugs, add functions, refactor, extend, ...)
Based on Piled Higher and Deeper 1531
Scriberia and The Turing Way (CC-BY)

Version control

CC-BY Scriberia & The Turing Way
  • keep things organized
  • keep track of changes
  • revert changes or go
    back to previous states
  • collect and share digital provenance
  • industry standard: Git
The building blocks of a scientific result are rarely static
Data changes, too
(errors are fixed, data is extended,
naming standards change, an analysis
requires only a subset of your data...)
Piled Higher and Deeper 1323

Sadly, Git does not handle large files well.

Version control beyond text files

Using git-annex, DataLad version controls large data

Version control beyond text files

  • Datasets can have an optional annex for tracking (large) files without placing their content into Git
  • For annex'ed files, identity (hash) and location information is put into Git, rather than their content:
    • Where the filesystem allows it, annexed files are symlinks:
$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
(PS: especially useful in datasets with many identical files)
    • The symlink reveals this internal data organization based on identity hash:
$ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
aeb0e5f2e2d5fe4ade97117a8cc5232f  sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
    • The (tiny) symlink instead of the (potentially large) file content is committed - version controlling precise file identity without checking contents into Git

Version control beyond text files

  • Datasets can have an optional annex for tracking (large) files without placing their content into Git
  • For annex'ed files, identity (hash) and location information is put into Git, rather than their content:
    • File availability information is stored to record a decentral network of file content. A file can exist in multiple different locations.
    • $ git annex whereis sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
      whereis sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (2 copies)
        	8c3680dd-6165-4749-adaa-c742232bc317 -- git@8242caf9acd8:/data/repos/adswa/bidsdata.git [gin]
         	fff8fdbc-3185-4b78-bd12-718717588442 -- adina@muninn:~/bids-data [here]

Delineation and advantages of decentral versus central RDM: Hanke et al., (2021). In defense of decentralized research data management

Version Control

  • DataLad knows two things: Datasets and files

  • Every file you put into a in a dataset can be easily version-controlled, regardless of size, with the same command: datalad save
  • Version control

    • Example: Add a new file into a dataset
    • # create a data analysis script
    • Save the dataset modification...
      • ... with DataLad
      • $ datalad save \
            -m "Add a k-nearest-neighbour clustering analysis" \
      • ... versus with Git
      • $ git add code/
        $ git commit -m "Add a k-nearest-neighbour clustering analysis"
      • ... versus with git-annex
      • $ git annex add code/
        $ git commit -m "Add a k-nearest-neighbour clustering analysis"

    Local version control

    Procedurally, version control is easy with DataLad!

      Stay flexible:

    • Non-complex DataLad core API (easier than Git)
    • Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)

    • Save meaningful units of change
    • Attach helpful commit messages

    Git versus Git-annex

    Data in datasets is either stored in Git or git-annex
    By default, everything is annexed, i.e., stored in a dataset annex
    Git git-annex
    handles small files well (text, code) handles all types and sizes of files well
    file contents are in the Git history and will be shared upon git/datalad push file contents are in the annex. Not necessarily shared
    Shared with every dataset clone Can be kept private on a per-file level when sharing the dataset
    Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files Useful: Large files, private files

    What makes scientific workflows special?

    Scientific building blocks are not static.
    Version control beyond text
    Science is build from modular units.

    Git submodules

    • Built-in Git feature: Add a repository to another repository, treating them as separate projects (e.g., use third party project, but keep commits separate)
    Make a project with a submodule:
    $ git init myproject
    Initialized empty Git repository in /tmp/myproject/.git/
    $ cd myproject
    $ git submodule add \
    Cloning into '/tmp/myproject/multimatch_gaze'...
    $ git commit -am 'Add multimatch module'
    [main fb9093c] Add multimatch module
     2 files changed, 4 insertions(+)
     create mode 100644 .gitmodules
     create mode 160000 multimatch_gaze
    Get a repository with a submodule:
    $ git clone
    Cloning into 'myproject'...
    $ cd myproject
    $ git submodule init
    Submodule 'multimatch_gaze' (
    registered for path 'multimatch_gaze'

    Dataset Nesting

    • Seamless nesting mechanisms:
      • hierarchies of datasets in super-/sub-dataset relationships
      • based on Git submodules, but more seamless: Mono-repo feel thanks to recursive operations
    • Overcomes scaling issues with large amounts of files
    • adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
      15530572 annex'd files (77.9 TB recorded total size)
      nothing to save, working tree clean
    • Modularizes research components for transparency, reuse, and access management

    Keeping a project clean and orderly

    CC-BY Scriberia & The Turing Way
      Version control
    • keep things organized
    • keep track of changes
    • revert changes or go
      back to previous states
      Intuitive structure
    • Keep projects lean
    • Link project dependencies easily
    • Follow the YODA principles

    Keeping a project clean and orderly

    First, let's create a new data analysis dataset with datalad create
    $ datalad create -c yoda myanalysis
    [INFO   ] Creating a new annex repo at /tmp/myanalysis
    [INFO   ] Scanning for unlocked files (this may take some time)
    [INFO   ] Running procedure cfg_yoda
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    create(ok): /tmp/myanalysis (dataset) 
  • -c yoda applies useful pre-structuring and configurations:
  • $ tree
    ├── code
    │   └──

    Intuitive data analysis structure

  • You can link datasets together in superdataset-subdataset hierarchies:

  • $ cd myanalysis
    # we can install analysis input data as a subdataset to the dataset
    $ datalad clone -d . input/
    [INFO   ] Scanning for unlocked files (this may take some time)
    [INFO   ] Remote origin not usable by git-annex; setting annex-ignore
    install(ok): input (dataset)
    add(ok): input (file)
    add(ok): .gitmodules (file)
    save(ok): . (dataset)
    action summary:
      add (ok: 2)
      install (ok: 1)
      save (ok: 1)

    Intuitive data analysis structure

  • You can link datasets together in superdataset-subdataset hierarchies:

  • $ tree
    ├── code
    │   ├──
    │   └──
    └── input
        └── iris.csv

    Seamless dataset nesting & linkage

    $ datalad clone --dataset . input/
    $ git diff HEAD~1
    diff --git a/.gitmodules b/.gitmodules
    new file mode 100644
    index 0000000..c3370ba
    --- /dev/null
    +++ b/.gitmodules
    @@ -0,0 +1,3 @@
    +[submodule "input"]
    +       path = input
    +       datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
    +       datalad-url =
    diff --git a/input b/input
    new file mode 160000
    index 0000000..fabf852
    --- /dev/null
    +++ b/input
    @@ -0,0 +1 @@
    +Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572

    What makes scientific workflows special?

    Scientific building blocks are not static.
    Version control beyond text
    Science is build from modular units.
    Science is exploratory, iterative, multi-stepped, and complex.

    Reusing past work isn't necessarily simple

    Your past self is the worst collaborator:

    Full comic at

    Leaving a trace

    "Shit, which version of which script produced these outputs from which version of what data?"

    "Shit, why buttons did I click and in which order did I use all those tools?"

    CC-BY Scriberia and The Turing Way

    Leaving a trace

    datalad run wraps around anything expressed in a command line call and saves the dataset modifications resulting from the execution.

    datalad rerun repeats captured executions. If the outcomes differ, it saves a new state of them.

    datalad containers-run executes command line calls inside a tracked software container and saves the dataset modifications resulting from the execution.

    data analysis provenance

    Enshrine the analysis in a script


    $ datalad containers-run \
      --message "Time series extraction from Locus Coeruleus"
      --container-name nilearn \
      --input 'mri/*_bold.nii' \
      --output 'sub-*/LC_timeseries_run-*.csv' \
      "python3 code/"
    -- Git commit --
        commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
        Author:     Adina Wagner
        AuthorDate: Mon Nov 11 16:15:08 2019 +0100
        Commit:     Adina Wagner
        CommitDate: Mon Nov 11 16:15:08 2019 +0100
        [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
        === Do not change lines below ===
         "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
         "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
         "inputs": [
         "outputs": ["sub-*/LC_timeseries_run-*.csv"],
        ^^^ Do not change lines above ^^^
     sub-01/LC_timeseries_run-1.csv | 1 +

    data analysis provenance

    Result: machine readable record about which data, code, and
    software produced a result how, when, and why.

    $ datalad containers-run \
      --message "Time series extraction from Locus Coeruleus"
      --container-name nilearn \
      --input 'mri/*_bold.nii' \
      --output 'sub-*/LC_timeseries_run-*.csv' \
      "python3 code/"
    -- Git commit --
        commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
        Author:     Adina Wagner
        AuthorDate: Mon Nov 11 16:15:08 2019 +0100
        Commit:     Adina Wagner
        CommitDate: Mon Nov 11 16:15:08 2019 +0100
        [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
        === Do not change lines below ===
         "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
         "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
         "inputs": [
         "outputs": ["sub-*/LC_timeseries_run-*.csv"],
        ^^^ Do not change lines above ^^^
     sub-01/LC_timeseries_run-1.csv | 1 +

    data analysis provenance

    ... to have a machine recompute and verify past work

    $ datalad rerun 5a7565a640ff6de67
    [INFO   ] run commit 5a7565a640ff6de67; (Time series extraction from Locus Coeruleus)
    [INFO   ] Making sure inputs are available (this may take some time)
    get(ok): mri/sub-01_bold.nii (file)
    get(ok): mri/sub-02_bold.nii (file)
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): sub-01/LC_timeseries_run-*.csv(file)
    add(ok): sub-02/LC_timeseries_run-*.csv (file)
    action summary:
      add (ok: 30)
      get (ok: 30)
      save (ok: 2)
      unlock (ok: 30)

    Lack of provenance can be devastating

    • Data analyses typically start with data wrangling:
      • Move/Copy/Rename/Reorganize/... data
    • Mistakes propagate through the complete analysis pipeline - especially those early ones are hard to find!
    CC-BY Scriberia and The Turing Way

    Example: "Let me just copy those files..."

    • Researcher builds an analysis dataset and moves events.tsv files (different per subject) to the directory with functional MRI data
    • $ for sourcefile, dest in zip(glob(path_to_events),          # note: not sorted!
                                    glob(path_to_fMRI_subjects)):  # note: not sorted!
          destination = path.join(dest, Path(sourcefile).name)
          shutil.move(sourcefile, destination)
    eventfiles/                            analysis/
    ├── sub-01                             ├── sub-01
    │   ├── events.tsv                     │   ├── bold.nii.gz
    ├── sub-02                             │   └── events.tsv  # from subject 8
    │   ├── events.tsv                     ├── sub-02
    ├── sub-03                 --->        │   ├── bold.nii.gz
    │   ├── events.tsv                     │   └── events.tsv  # from subject 42
    ├── sub-04                             ├── sub-01
    │   ├── events.tsv                     │   ├── bold.nii.gz
    [...]                                  │   └── events.tsv  # from subject 21

    Researcher shares analysis with others

    "I would never make such a mistake, I'm way more
    • organized
    • knowledgeable
    • experienced

    Everyone makes mistakes - the earlier we find them or guard against them, the better for science!

    Leave a trace!

    $ datalad run -m "Copy event files" \
    "for sub in eventfiles;
        do mv ${sub}/events.tsv analysis/${sub}/events.tsv;
    $ datalad copy-file ../eventfiles/sub-01/events.tsv sub-01/ -d .
    copy_file(ok): /data/project/coolstudy/eventfiles/events.tsv [/data/project/coolstudy/analysis/sub-01/events.tsv]
    save(ok): /data/project/coolstudy/analysis (dataset)
    action summary:
      copy_file (ok: 1)
      save (ok: 1)

    Research data management is tied to reproducibility

    Based on (CC-BY) Reproducibility Management in Neuroscience - Specific Issues and Solutions (DOI 10.5281/zenodo.4285927)

    What makes scientific workflows special?

    Scientific building blocks are not static.
    Version control beyond text
    Science is build from modular units.
    Science is exploratory, iterative, multi-stepped, and complex.
    Science is collaborative.


    • Scientific workflows can be idiosyncratic across institutions / departments / labs / any two scientists

    Decentral operation, also for annexed files

    Sadly, Git does not handle large files well.

    And repository hosting services refuse to handle large files:

    Publishing datasets

    • Most public datasets separate content in Git versus git-annex behind the scenes


    • DataLad is built to maximize interoperability and streamline routines across hosting and storage technology

    Publishing datasets

      Seamless connections:
    • Datasets are exposed via a private or public repository on a repository hosting service
    • Data can't be stored in the latter, but can be kept in almost any third party storage
    • Publication dependencies automate interactions to both places, e.g.,
                      $ git config --local remote.github.datalad-publish-depends gdrive # or
      $ datalad siblings add --name origin --url --publish-depends s3

    Publishing datasets

    Special case 1: repositories with annex support

    Publishing datasets

    Special case 2: Special remotes with repositories

    Transport logistics

    • Share data like source code
    a screenrecording of cloning studyforrest data from github

    Transport logistics: Lots of data, little disk-usage

    • Cloned datasets are lean. "Meta data" (file names, availability) are present, but no file content:
    • $ datalad clone
        install(ok): /tmp/studyforrest-data-phase2 (dataset)
      $ cd studyforrest-data-phase2 && du -sh
        18M	.
    • files' contents can be retrieved on demand:
    $ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
      get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
  • Have access to more data on your computer than you have disk-space:
  • # eNKI dataset (1.5TB, 34k files):
    $ du -sh
    1.5G	.
    # HCP dataset (~200TB, >15 million files)
    $ du -sh
    48G	. 

    (Raw) data mismanagement

    • Multiple large datasets are available on a compute cluster 🏞
    • Each researcher creates their own copies of data ⛰
    • Multiple different derivatives and results are computed from it 🏔
    • Data, copies of data, half-baked data transformations, results, and old versions of results are kept - undocumented 🌋

    Example: eNKI dataset

    • Raw data size: 1.5 TB
    • + Back-up: 1.5 TB
    • + A BIDS structured version: 1.5 TB
    • + Common, minimal derivatives (fMRIprep): ~ 4.3TB
    • + Some other derivatives: "Some other" x 5TB
    • + Copies of it all or of subsets in home and project directories

    Example: eNKI dataset

    "Can't we buy more hard drives?"


    DataLad way

    • Download the data, have a back-up
    • Transform it into a DataLad dataset
    • $ datalad create -f .
      $ datalad save -m "Snapshot raw data"
    • Move it to a common location. Everyone who needs it installs it and gets required data
    • $ datalad create my_enki_analysis
      $ datalad clone -d . /data/enki data
    • Compute results with provenance capture. Drop input data and, potentially, everything that's not relevant and automatically re-computed.

    What makes scientific workflows special?

    Scientific building blocks are not static.
    Version control beyond text
    Science is build from modular units.
    Science is exploratory, iterative, multi-stepped, and complex.
    Science is collaborative.
    Transport logistics

    Examples of what DataLad can be used for:

    • Publish or consume datasets via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services
    Examples of what DataLad can be used for:

    a screenrecording of browsing open neuro

    Examples of what DataLad can be used for:

    • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
    a screenrecording of cloning REMODNAV paper dataset from github

    Examples of what DataLad can be used for:

    • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
    Examples of what DataLad can be used for:

    • Central data management and archival system

    Examples of what DataLad can be used for:

    • Scalable computing framework for reproducible science

    Command summaries

    Summary - Local version control

    datalad create creates an empty dataset.
    Configurations (-c yoda, -c text2git) add useful structure and/or configurations.

    A dataset has a history to track files and their modifications.
    Explore it with Git (git log) or external tools (e.g., tig).

    datalad save records the dataset or file state to the history.
    Concise commit messages should summarize the change for future you and others.

    datalad status reports the current state of the dataset.
    A clean dataset status (no modifications, not untracked files) is good practice.

    Summary - Dataset consumption & nesting

      datalad clone installs a dataset.
      It can be installed “on its own”: Specify the source (url, path, ...) of the dataset, and an optional path for it to be installed to.

      Datasets can be installed as subdatasets within an existing dataset.
      The --dataset/-d option needs a path to the root of the superdataset.

      Only small files and metadata about file availability are present locally after an install.
      To retrieve actual file content of annexed files, datalad get downloads file content on demand.

      Datasets preserve their history.
      The superdataset records only the version state of the subdataset.

    Summary - Reproducible execution

      datalad run records a command and its impact on the dataset.
      All dataset modifications are saved - use it in a clean dataset.

      Data/directories specified as --input are retrieved first.
      Use one flag per input.

      Data/directories specified as --output will be unlocked for modifications prior to a rerun of the command.
      Its optional to specify, but helpful for recomputations.

      datalad containers-run can be used to capture the software environment as provenance.
      Its ensures computations are ran in the desired software set up. Supports Docker and Singularity containers

      datalad rerun can automatically re-execute run-records later.
      They can be identified with any commit-ish (hash, tag, range, ...)

    Take home messages

    Science has specific requirements that can impede efficiency and reproducibility.
    DataLad is one of many tools in an ecosystem of resources, infrastructure, and experts to assist you.
    DataLad sits on top of, and complements Git and git-annex.
    Even outside of science, data deserves version control.
    It changes and evolves just like code, and exhaustive tracking lays a foundation for reproducibility.
    Data management with tools like Git or DataLad can feel technical and complex.
    But effort pays off: Increased transparency, better reproducibility, easier accessibility, efficiency through automation and collaboration, streamlined procedures for synchronizing and updating your work, ...
    The biggest beneficiary of RDM? Yourself
    The second biggest beneficiary of RDM? Yourself in 6 months
    The consequence of good RDM? Better science

    Further resources and stay in touch

    Thanks for your attention

    Slides at DOI 10.5281/zenodo.10556597

    Women neuroscientists are underrepresented in neuroscience. You can use the
    Repository for Women in Neuroscience to find and recommend neuroscientists for
    conferences, symposia or collaborations, and help making neuroscience more open & divers.