Data Management
for Open Science



Dr. Adina Wagner
mas.to/@adswa

Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)
Research Center Jülich


DOI: doi.org/10.5281/zenodo.10869053
Slides: files.inm7.de/adina/talks/html/neuraltraces.html

Research data management?






adapted from https://dribbble.com/shots/3090048-Front-end-vs-Back-end

www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates

  • Domain-agnostic command-line tool (+ graphical user interface), built on top of Git & Git-annex
  • 10+ year open source project (100+ contributors), available for all major OS
  • Major features:
  • Version-controlling arbitrarily large content
    Version control data & software alongside to code!
    Transport mechanisms for sharing, updating & obtaining data
    Consume & collaborate on data (analyses) like software
    (Computationally) reproducible data analysis
    Track and share provenance of all digital objects
    (... and much more)

The building blocks of a scientific result are rarely static
Analysis code, manuscripts, ... evolve
(Rewrite, fix bugs, add functions, refactor, extend, ...)
Based on Piled Higher and Deeper 1531
The building blocks of a scientific result are rarely static
Data changes, too
(errors are fixed, data is extended,
naming standards change, an analysis
requires only a subset of your data...)
Piled Higher and Deeper 1323

Version control - beyond text files

CC-BY Scriberia & The Turing Way
  • keep things organized
  • keep track of changes
  • revert changes or go
    back to previous states
  • collect and share digital provenance
  • industry standard: Git

Building up on Git and git-annex, DataLad version controls any data

Version Control

  • DataLad knows two things: Datasets and files
  • A DataLad dataset is an Git repository:
    • Content and domain agnostic
    • Minimization of custom procedures or data structures (user must not lose data or data access if DataLad vanishes)
    • Uncomprimised decentralization

Version Control: Data

  • Datasets have an optional annex for (large or sensitive) data (or text/code).
  • Identity (hash) and location information is put into Git, rather than file content. The annex, and transport to and from it is managed with git-annex → decentralized version control for files of any size.
  • DataLad works towards wrapping Git and git-annex into a non-complex core-API (helpful for data management novices).
  • Flexibility and commands of Git and git-annex are preserved (useful for experienced Git/git-annex users).

Delineation and advantages of decentral versus central RDM: Hanke et al., (2021). In defense of decentralized research data management

Version Control: Nesting

  • Seamless nesting mechanisms:
    • hierarchies of datasets in super-/sub-dataset relationships
    • based on Git submodules, but more seamless
  • Overcomes scaling issues with large amounts of files
  • adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
    15530572 annex'd files (77.9 TB recorded total size)
    nothing to save, working tree clean
    (github.com/datalad-datasets/human-connectome-project-openaccess)
  • Modularizes research components for transparency, reuse, and access management

Use DataLad for ...

  • ... self-descriptive, reusable projects
a screenrecording of cloning a reproducible paper linking code and data from github
github.com/psychoinformatics-de/paper-remodnav

Use DataLad to ...

  • ... share and consume data like source code
a screenrecording of cloning studyforrest data from github

Publishing datasets

  • The best of both worlds: Publish to Git hosting services, storage providers, or both

Transport logistics

  • Scientific workflows can be idiosyncratic across institutions / departments / labs / any two scientists
  • DataLad is built to maximize interoperability and streamline routines across hosting and storage technology

Lots of data, little disk-usage

  • Cloned datasets are lean. "Meta data" (file names, availability) are present, but no file content:
  • $ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
    install(ok): /tmp/studyforrest-data-phase2 (dataset)
    $ cd studyforrest-data-phase2 && du -sh
    18M	.
  • files' contents can be retrieved on demand - and also dropped:
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]
  • Have access to more data on your computer than you have disk-space:
  • # eNKI dataset (1.5TB, 34k files):
    $ du -sh
    1.5G	.
    # HCP dataset (~200TB, >15 million files)
    $ du -sh
    48G	. 
    dl.get('input/sub-01')
    [really complex analysis]
    dl.drop('input/sub-01')
    

    Have yourself some data

    > 500TB of open data available at datasets.datalad.org

    Reusing past work

    ... isn't necessarily simple

    Your past self is the worst collaborator:

    Full comic at http://phdcomics.com/comics.php?f=1979

    Lack of provenance can be devastating

    • Data analyses or data wrangling is complex
      • Move/Copy/Rename/Reorganize/Transform/Compute/... data
    • Mistakes propagate through the complete analysis pipeline - especially those early ones are hard to find!
    CC-BY Scriberia and The Turing Way

    Leaving a trace

    "Shit, which version of which script produced these outputs from which version of what data?"

    "Shit, why buttons did I click and in which order did I use all those tools?"


    CC-BY Scriberia and The Turing Way

    Leaving a trace

    datalad run wraps around anything expressed in a command line call and saves the dataset modifications resulting from the execution.

    datalad rerun repeats captured executions. If the outcomes differ, it saves a new state of them.

    datalad containers-run executes command line calls inside a tracked software container and saves the dataset modifications resulting from the execution.

    data analysis provenance

    Enshrine the analysis in a script

    Here: extract_lc_timeseries.py

    $ datalad containers-run \
      --message "Time series extraction from Locus Coeruleus"
      --container-name nilearn \
      --input 'mri/*_bold.nii' \
      --output 'sub-*/LC_timeseries_run-*.csv' \
      "python3 code/extract_lc_timeseries.py"
    
    -- Git commit --
        commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
        Author:     Adina Wagner adina.wagner@t-online.de
        AuthorDate: Mon Nov 11 16:15:08 2019 +0100
        Commit:     Adina Wagner adina.wagner@t-online.de
        CommitDate: Mon Nov 11 16:15:08 2019 +0100
    
        [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
        === Do not change lines below ===
        {
         "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
         "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
         "inputs": [
          "mri/*.bold.nii.gz",
          ".datalad/environments/nilearn.simg"
         ],
         "outputs": ["sub-*/LC_timeseries_run-*.csv"],
         ...
        }
        ^^^ Do not change lines above ^^^
    ---
     sub-01/LC_timeseries_run-1.csv | 1 +
    ...
    

    data analysis provenance

    Record code execution together
    with input-data, output files and software environment in the execution-command

    $ datalad containers-run \
      --message "Time series extraction from Locus Coeruleus"
      --container-name nilearn \
      --input 'mri/*_bold.nii' \
      --output 'sub-*/LC_timeseries_run-*.csv' \
      "python3 code/extract_lc_timeseries.py"
    
    -- Git commit --
        commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
        Author:     Adina Wagner adina.wagner@t-online.de
        AuthorDate: Mon Nov 11 16:15:08 2019 +0100
        Commit:     Adina Wagner adina.wagner@t-online.de
        CommitDate: Mon Nov 11 16:15:08 2019 +0100
    
        [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
        === Do not change lines below ===
        {
         "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
         "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
         "inputs": [
          "mri/*.bold.nii.gz",
          ".datalad/environments/nilearn.simg"
         ],
         "outputs": ["sub-*/LC_timeseries_run-*.csv"],
         ...
        }
        ^^^ Do not change lines above ^^^
    ---
     sub-01/LC_timeseries_run-1.csv | 1 +
    ...
    

    data analysis provenance

    Result: machine readable record about which data, code, and
    software produced a result how, when, and why.

    $ datalad containers-run \
      --message "Time series extraction from Locus Coeruleus"
      --container-name nilearn \
      --input 'mri/*_bold.nii' \
      --output 'sub-*/LC_timeseries_run-*.csv' \
      "python3 code/extract_lc_timeseries.py"
    
    -- Git commit --
        commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
        Author:     Adina Wagner adina.wagner@t-online.de
        AuthorDate: Mon Nov 11 16:15:08 2019 +0100
        Commit:     Adina Wagner adina.wagner@t-online.de
        CommitDate: Mon Nov 11 16:15:08 2019 +0100
    
        [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
        === Do not change lines below ===
        {
         "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
         "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
         "inputs": [
          "mri/*.bold.nii.gz",
          ".datalad/environments/nilearn.simg"
         ],
         "outputs": ["sub-*/LC_timeseries_run-*.csv"],
         ...
        }
        ^^^ Do not change lines above ^^^
    ---
     sub-01/LC_timeseries_run-1.csv | 1 +
    ...
    

    data analysis provenance

    Use the unique identifier of the execution record

    $ datalad rerun 5a7565a640ff6de67
    [INFO   ] run commit 5a7565a640ff6de67; (Time series extraction from Locus Coeruleus)
    [INFO   ] Making sure inputs are available (this may take some time)
    get(ok): mri/sub-01_bold.nii (file)
    get(ok): mri/sub-02_bold.nii (file)
            [...]
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): sub-01/LC_timeseries_run-*.csv(file)
    add(ok): sub-02/LC_timeseries_run-*.csv (file)
            [...]
    action summary:
      add (ok: 30)
      get (ok: 30)
      save (ok: 2)
      unlock (ok: 30)
        

    data analysis provenance

    ... to have a machine recompute and verify past work

    $ datalad rerun 5a7565a640ff6de67
    [INFO   ] run commit 5a7565a640ff6de67; (Time series extraction from Locus Coeruleus)
    [INFO   ] Making sure inputs are available (this may take some time)
    get(ok): mri/sub-01_bold.nii (file)
    get(ok): mri/sub-02_bold.nii (file)
            [...]
    [INFO   ] == Command start (output follows) =====
    [INFO   ] == Command exit (modification check follows) =====
    add(ok): sub-01/LC_timeseries_run-*.csv(file)
    add(ok): sub-02/LC_timeseries_run-*.csv (file)
            [...]
    action summary:
      add (ok: 30)
      get (ok: 30)
      save (ok: 2)
      unlock (ok: 30)
        

    Provenance at the largest scale:

    www.nature.com/articles/s41597-022-01163-2

    Research data management is tied to reproducibility

    Based on xkcd.com/2347/ (CC-BY) Reproducibility Management in Neuroscience - Specific Issues and Solutions (DOI 10.5281/zenodo.4285927)

    Further Information



    Install it on your own hardware: handbook.datalad.org/r.html?install

    Acknowledgements

    Software
    • Joey Hess (git-annex)
    • The DataLad team & contributors


    Thanks!



    (scan the QR code for slides)
    Funders
    Collaborators