Adina Wagner
@AdinaKrik |
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7) Research Center JΓΌlich |
| This is a metaphor for many computational β‘ clusters without RDM |
![]() |
|
|
|
|
|
|
|
|
|
(all of those are valid reasons for RDM, but its fun if you have Minion-attitude)
$ git config --add user.name "Bob McBobface"
$ git config --add user.email bob@example.com
|
Funders
Collaborators
|

Procedurally, version control is easy with DataLad!
datalad create creates an empty dataset.datalad save records the dataset or file state to the history. datalad download-url obtains web content and records its origin. datalad status reports the current state of the dataset.
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ ls
code/
src/
stimuli
sub-01/
sub-02/
sub-03/
sub-04/
[...]
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (80TB, 15 million files)
$ du -sh
48G .
datalad update --merge)
datalad clone installs a dataset.datalad get retrieves file contents on demand,
datalad drop can remove file content on demand.
![]() |
|
datalad run
$ datalad rerun eee1356bb7e8f921174e404c6df6aadcc1f158f0
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
add(ok): sub-01/LC_timeseries_run-1.csv (file)
...
save(ok): . (dataset)
action summary:
add (ok: 45)
save (notneeded: 45, ok: 1)
unlock (notneeded: 45)
...
datalad-containerdatalad-containers run
datalad run records a command and
its impact on the dataset.--input
are retrieved prior to command execution.--output
will be unlocked for modifications prior to a rerun of the command. datalad containers-run can be used
to capture the software environment as provenance.datalad rerun can automatically re-execute run-records later.
Everything you need to know about sharing datasets is in the chapter in Third party infrastructure
$ git log some_result_file
commit 593aa8018116ca9d198ce4bfd9e09af3476c7a9b
Author: Elena Piscopia elena@example.net
Date: Thu Sep 3 13:35:51 2020 +0200
[DATALAD RUNCMD] Re-create the results with most recent data
=== Do not change lines below ===
{
"chain": [
"38e18c0cd73627e10b620b1ba08e4be2caba18e7"
],
"cmd": "bash code/mycode.sh",
"dsid": "57ce4457-a29b-4bd0-be6f-a9da8d46aee3",
"exit": 0,
"extra_inputs": [],
"inputs": data/input_data/*.nii.gz,
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
$ datalad rerun 593aa8018116ca
$ datalad create -f .
$ datalad save -m "Snapshot raw data"
$ datalad create my_enki_analysis
$ datalad clone -d . /data/enki data
events.tsv
files (different per subject) to the directory with functional MRI data$ for sourcefile, dest in zip(glob(path_to_events), # note: not sorted!
glob(path_to_fMRI_subjects)): # note: not sorted!
destination = path.join(dest, Path(sourcefile).name)
shutil.move(sourcefile, destination)
Researcher shares analysis with others
π±
Everyone makes mistakes - the earlier we find them or guard against them, the better for science!
$ datalad run -m "Copy event files" \
"for sub in eventfiles;
do mv ${sub}/events.tsv analysis/${sub}/events.tsv;
done"
$ datalad copy-file ../eventfiles/sub-01/events.tsv sub-01/ -d .
copy_file(ok): /data/project/coolstudy/eventfiles/events.tsv [/data/project/coolstudy/analysis/sub-01/events.tsv]
save(ok): /data/project/coolstudy/analysis (dataset)
action summary:
copy_file (ok: 1)
save (ok: 1)
| Git | git-annex |
| handles small files well (text, code) | handles all types and sizes of files well |
| file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
| Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
| Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
|
|
|
text2git, yoda)
or created and shared by users
(Tutorial at handbook.datalad.org) .gitattributes (e.g., based on file type, file/path name, size, ...)datalad save --to-git)adina@bulk1 in /ds/hcp/super on git:masterβ± datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean
(github.com/datalad-datasets/human-connectome-project-openaccess)
|
Comprehensive user documentation in the DataLad Handbook (handbook.datalad.org) |
|
|
|
|
|
|
Open an issue on GitHub if you have more questions!