Research Data Management with DataLad
🚀
for easier, open, and transparent science

Adina Wagner @AdinaKrik
Psychoinformatics lab, Institute of Neuroscience and Medicine, Brain & Behavior (INM-7) Research Center Jülich

Research data management (RDM)

(Research) Data = every digital object involved in your project: code, software/tools, raw data, processed data, results, manuscripts ...
Data needs to be managed FAIRly- from creation to use, publication, sharing, archiving, re-use, or destruction:

Research data management is a key component for reproducibility, efficiency, and impact/reach of data analysis projects

JISC; CC-BY-SA-ND

Why data management?

adapted from https://dribbble.com/shots/3090048-Front-end-vs-Back-end
⬆
This a metaphor for most projects after publication

Why data management?

This a metaphor for reproducing (your own) research
a few months after publication
⬇

TODO

Why data management?

This is a metaphor for
many computational ➡
clusters without RDM

https://infostory.files.wordpress.com/2013/03/big_data_cartoon.jpeg

Why data management? Different view points

"Oh well if others say so": External requirements and expectations Funders & publishers require it Scientific peers increasingly expect it
"There is no other way": Some datasets require it Exciting datasets (UKBiobank, HCP, ...) are so large that neither computational infrastructure nor typical analysis workflows scale to their sizes
"OMG when can I start?": Intrinsic motivation and personal & scientific benefits The quality, efficiency and replicability of your work improves

(all of those are valid reasons for RDM, but its fun if you have Minion-attitude)

Today

General overview of DataLad
Hands-on experience: Copy-Paste code snippets at handbook.datalad.org/en/latest/code_from_chapters/MPI_code.html
DataLad-centric solutions to real-life data management problems

Further resources

Everything I'm talking about is documented in text and video tutorials, and you can reach out for any questions!

Comprehensive user documentation in the DataLad Handbook (handbook.datalad.org)

Recordings of talks and tutorials on our YouTube channel
Reach out with questions via Matrix or GitHub (github/datalad/datalad or github/datalad-handbook/book)

polling system for live-feedback

Let's start

Requirements

DataLad version 0.12.x or later (Installation instructions at handbook.datalad.org)

A configured Git identity:

$ git config --add user.name "Bob McBobface"
$ git config --add user.email bob@example.com

(You have about 5 minutes to still install it)

Free,
open source,
command line tool & Python API

Acknowledgements

Software

Michael Hanke
Yaroslav Halchenko
Joey Hess (git-annex)
Kyle Meyer
Benjamin Poldrack
26 additional contributors

Documentation project

Michael Hanke
Laura Waite
28 additional contributors

Funders

Collaborators

Core Features:

Joint version control (Git, git-annex) for code, software, and data
Provenance capture: Create and share machine-readable, re-executable records of your data analysis for reproducible, transparent, and FAIR research
Data transport mechanisms: Install or share complete projects extremely lightweight, retrieve data on demand and drop it to free up space without losing data access or provenance, collaborate remotely on scientific projects

Examples of what DataLad can be used for:

Publish or consume datasets via GitHub, GitLab, OSF, or similar services

a screenrecording of cloning studyforrest data from github

Examples of what DataLad can be used for:

Creating and sharing reproducible, open science: Sharing data, software, code, and provenance

a screenrecording of cloning REMODNAV paper dataset from github

Examples of what DataLad can be used for:

Behind-the-scenes infrastructure component for data transport and versioning (e.g., used by OpenNeuro, brainlife.io , the Canadian Open Neuroscience Platform (CONP), CBRAIN)

a screenrecording of browsing open neuro

Examples of what DataLad can be used for:

Central data management and archival system

Examples of what DataLad can be used for:

Code along

Code to follow along: handbook.datalad.org/en/latest/code_from_chapters/MPI_code.html

Version control

Why version control?

keep things organized
keep track of changes
revert changes or go back to previous states

Version Control

DataLad knows two things: Datasets and files

A DataLad dataset is a Git/git-annex: repository:

For Git users: Use workflows from software development for science!
Content and domain agnostic - Manage science, or your music library
Minimization of custom procedures or data structures - A PDF stays a PDF, and users won't lose data or data access if DataLad vanishes

Local version control

Procedurally, version control is easy with DataLad!

Advice:

Save meaningful units of change
Attach helpful commit messages

Summary - Local version control

datalad create creates an empty dataset.: Configurations (-c yoda, -c text2git) are useful (details soon).
A dataset has a history to track files and their modifications.: Explore it with Git (git log) or external tools (e.g., tig).
datalad save records the dataset or file state to the history.: Concise commit messages should summarize the change for future you and others.
datalad download-url obtains web content and records its origin.: It even takes care of saving the change.
datalad status reports the current state of the dataset.: A clean dataset status is good practice.

Questions!

Consuming datasets

Here's how a dataset looks after installation:

Plenty of data, but little disk-usage

Cloned datasets are lean. "Meta data" (file names, availability) are present, but no file content:

$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M	.

$ ls
code/
src/
stimuli
sub-01/
sub-02/
sub-03/
sub-04/
[...]

file's contents can be retrieved on demand:

$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]

Have more access to your computer than you have disk-space:

# eNKI dataset (1.5TB, 34k files):
$ du -sh
  1.5G	.
# HCP dataset (80TB, 15 million files)
$ du -sh
48G	.

Sharing datasets

Share data with others and keep them up to date, or get data from someone and stay up to date (datalad update --merge)
Have all updates in your dataset history, but pick the version you want to work with

Summary - Dataset consumption & nesting

datalad clone installs a dataset.

from local or remote sources.

Datasets can be installed as subdatasets within an existing dataset.

Using the --dataset/-d option. Useful for transparency, cleanliness, and scalability.

Only small files and file availability metadata are present.

datalad get retrieves file contents on demand, datalad drop can remove file content on demand.

Datasets preserve their history.

Superdatasets record the version state of their subdataset.

Questions!

reproducible data analysis

Your past self is the worst collaborator:

Full comic at http://phdcomics.com/comics.php?f=1979

Basic organizational principles for datasets

Keep everything clean and modular

do not touch/modify raw data: save any results/computations outside of input datasets
Keep a superdataset self-contained: Scripts reference subdatasets or files with relative paths

Basic organizational principles for datasets

Record where you got it from, where it is now, and what you do to it

Find out more about organizational principles in the YODA principles!

A classification analysis on the iris flower dataset

Reproducible execution & provenance capture

datalad run

Provenance capture

Those "run records" are stored in a dataset's history and can be automatically rerun:

$ datalad rerun eee1356bb7e8f921174e404c6df6aadcc1f158f0
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
add(ok): sub-01/LC_timeseries_run-1.csv (file)
...
save(ok): . (dataset)
action summary:
  add (ok: 45)
  save (notneeded: 45, ok: 1)
  unlock (notneeded: 45)
...

Computational reproducibility

Code may fail (to reproduce) if run with different software
Datasets can store (and share) software environments (Docker or Singularity containers) and reproducibly execute code inside of the software container, capturing software as additional provenance
DataLad extension: datalad-container

datalad-containers run

Summary - Reproducible execution

datalad run records a command and its impact on the dataset.

All dataset modifications are saved - use it in a clean dataset.

Data/directories specified as --input are retrieved prior to command execution.

Use one flag per input.

Data/directories specified as --output will be unlocked for modifications prior to a rerun of the command.

Its optional to specify, but helpful for recomputations.

datalad containers-run can be used to capture the software environment as provenance.

Its ensures computations are ran in the desired software set up.

datalad rerun can automatically re-execute run-records later.

They can be identified with any commit-ish (hash, tag, range, ...)

Questions!

Interested in more about computational reproducibility? Checkout the usecase DataLad for machine-learning anlaysis at handbook.datalad.org

Datasets for yourself and others

DataLad is built to maximize interoperability and use with hosting and storage technology: Share datasets with the services you use anyway

Datasets for yourself and others

DataLad is built to maximize interoperability and use with hosting and storage technology: Share datasets with the services you use anyway

Everything you need to know about sharing datasets is in the chapter in Third party infrastructure

Why use DataLad?

Mistakes are not forever anymore: Easy version control, regardless of file size
Who needs short-term memory when you can have run-records?
Disk-usage magic: Have access to more data than your hard drive has space
Collaboration and updating mechanisms: Alice shares her data with Bob. Alice fixes a mistake and pushes the fix. Bob says "datalad update" and gets her changes. And vice-versa.
Transparency: Shared datasets keep their history. No need to track down a former student, ask their project what was done.

No need to ask colleagues what they did, you can ask the files how they came to be:

$ git log some_result_file
commit 593aa8018116ca9d198ce4bfd9e09af3476c7a9b
Author: Elena Piscopia elena@example.net
Date:   Thu Sep 3 13:35:51 2020 +0200

    [DATALAD RUNCMD] Re-create the results with most recent data

    === Do not change lines below ===
    {
     "chain": [
      "38e18c0cd73627e10b620b1ba08e4be2caba18e7"
     ],
     "cmd": "bash code/mycode.sh",
     "dsid": "57ce4457-a29b-4bd0-be6f-a9da8d46aee3",
     "exit": 0,
     "extra_inputs": [],
     "inputs": data/input_data/*.nii.gz,
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

... and then have a machine re-do it:

$ datalad rerun 593aa8018116ca

Questions!

Real-life examples

(Raw) data mismanagement

Multiple large datasets are available on a compute cluster 🏞
Each researcher creates their own copies of data ⛰
Multiple different derivatives and results are computed from it 🏔
Data, copies of data, half-baked data transformations, results, and old versions of results are kept - undocumented 🌋

Example: eNKI dataset

Raw data size: 1.5 TB
+ Back-up: 1.5 TB
+ A BIDS structured version: 1.5 TB
+ Common, minimal derivatives (fMRIprep): ~ 4.3TB
+ Some other derivatives: "Some other" x 5TB
+ Copies of it all or of subsets in home and project directories

Example: eNKI dataset

"Can't we buy more hard drives?"

No.

DataLad way

Download the data, have a back-up
Transform it into a DataLad dataset

$ datalad create -f .
$ datalad save -m "Snapshot raw data"

Move it to a common location. Everyone who needs it installs it and gets required data

$ datalad create my_enki_analysis
$ datalad clone -d . /data/enki data

Compute results with provenance capture. Drop input data and, potentially, everything that's not relevant and automatically re-computed.

Lack of provenance can be devastating

Data analyses typically start with data wrangling:

Move/Copy/Rename/Reorganize/... data

Mistakes propagate through the complete analysis pipeline - especially those early ones are hard to find!

CC-BY Scriberia and The Turing Way

Example: "Let me just copy those files..."

Researcher builds an analysis dataset and moves events.tsv files (different per subject) to the directory with functional MRI data

$ for sourcefile, dest in zip(glob(path_to_events),          # note: not sorted!
                              glob(path_to_fMRI_subjects)):  # note: not sorted!
    destination = path.join(dest, Path(sourcefile).name)
    shutil.move(sourcefile, destination)

eventfiles/                            analysis/
├── sub-01                             ├── sub-01
│   ├── events.tsv                     │   ├── bold.nii.gz
├── sub-02                             │   └── events.tsv  # from subject 8
│   ├── events.tsv                     ├── sub-02
├── sub-03                 --->        │   ├── bold.nii.gz
│   ├── events.tsv                     │   └── events.tsv  # from subject 42
├── sub-04                             ├── sub-01
│   ├── events.tsv                     │   ├── bold.nii.gz
[...]                                  │   └── events.tsv  # from subject 21

Researcher shares analysis with others
😱

"I would never make such a mistake, I'm way more

organized
knowledgeable
experienced

Everyone makes mistakes - the earlier we find them or guard against them, the better for science!

Leave a trace!

$ datalad run -m "Copy event files" \
"for sub in eventfiles;
    do mv ${sub}/events.tsv analysis/${sub}/events.tsv;
done"

$ datalad copy-file ../eventfiles/sub-01/events.tsv sub-01/ -d .
copy_file(ok): /data/project/coolstudy/eventfiles/events.tsv [/data/project/coolstudy/analysis/sub-01/events.tsv]
save(ok): /data/project/coolstudy/analysis (dataset)
action summary:
  copy_file (ok: 1)
  save (ok: 1)

Writing a reproducible paper

Live-Demo!

GitHub repository: github.com/psychoinformatics-de/paper-remodnav
Detailed write-up and tutorial: handbook.datalad.org/en/latest/usecases/reproducible-paper.html

Writing a reproducible paper

The details of how the reproducible paper was created (Makefiles, Python code, LaTeX-based manuscript) are arbitrary - there are many ways of creating them.
What I regard as important is the backbone that DataLad provides: A vehicle to link data to code and distribute it alongside to it and means to collaboratively work on science as one would in software development

Thank you!

Back-up/Details

Git versus Git-annex

Data in datasets is either stored in Git or git-annex: By default, everything is stored in git-annex

Git versus Git-annex

Useful background information for demo later. Read this handbook chapter for details
Git and Git-annex handle files differently: annexed files are stored in an annex. File content is hashed & only content-identity is committed to Git.

Files stored in Git are modifiable, files stored in Git-annex are content-locked

Annexed contents are not available right after cloning, only content- and availability information (as they are stored in Git)

Git versus Git-annex

When sharing datasets with someone without access to the same computational infrastructure, annexed data is not necessarily stored together with the rest of the dataset.

Transport logistics exist to interface with all major storage providers. If the one you use isn't supported, let us know!

Git versus Git-annex

Pre-made run-procedures, provided by DataLad (e.g., text2git, yoda) or created and shared by users (Tutorial at handbook.datalad.org)
Self-made configurations in .gitattributes (e.g., based on file type, file/path name, size, ...)
Per-command basis (e.g., via datalad save --to-git)

Datasets scale!

Nesting overcomes scaling issues with large amounts of files. Largest dataset so far: 80TB, 15 million files.

adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean

(github.com/datalad-datasets/human-connectome-project-openaccess)

Find out more

Comprehensive user documentation in the
DataLad Handbook (handbook.datalad.org)


	High-level function/command overviews, Installation, Configuration
	Narrative-based code-along course Independent on background/skill level, suitable for data management novices
	Step-by-step solutions to common data management problems, like how to make a reproducible paper

Further info and reading

Everything I am talking about is documented in depth elsewhere:

General DataLad tutorial: handbook.datalad.org/basics/intro.html/
How to structure data analysis projects: handbook.datalad.org/r.html?yoda
More DataLad tutorials: DataLad YouTube channel

Open an issue on GitHub if you have more questions!

Git	git-annex
handles small files well (text, code)	handles all types and sizes of files well
file contents are in the Git history and will be shared upon git/datalad push	file contents are in the annex. Not necessarily shared
Shared with every dataset clone	Can be kept private on a per-file level when sharing the dataset
Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files	Useful: Large files, private files

Research Data Management with DataLad🚀for easier, open, and transparent science

Research data management (RDM)

Why data management?

Why data management?

Why data management?

Why data management? Different view points

Today

Further resources

polling system for live-feedback

Let's start

Requirements

Acknowledgements

Core Features:

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

Code along

Version control

Why version control?

Version Control

Local version control

Summary - Local version control

Questions!

Consuming datasets

Plenty of data, but little disk-usage

Sharing datasets

Summary - Dataset consumption & nesting

Questions!

reproducible data analysis

Basic organizational principles for datasets

Basic organizational principles for datasets

A classification analysis on the iris flower dataset

Reproducible execution & provenance capture

Provenance capture

Computational reproducibility

Summary - Reproducible execution

Questions!

Datasets for yourself and others

Datasets for yourself and others

Why use DataLad?

Questions!

Real-life examples

(Raw) data mismanagement

Example: eNKI dataset

Example: eNKI dataset

"Can't we buy more hard drives?"

No.

DataLad way

Lack of provenance can be devastating

Example: "Let me just copy those files..."

Leave a trace!

Writing a reproducible paper

Writing a reproducible paper

Thank you!

Back-up/Details

Git versus Git-annex

Git versus Git-annex

Git versus Git-annex

Git versus Git-annex

Datasets scale!

Find out more

Further info and reading

Research Data Management with DataLad
🚀
for easier, open, and transparent science