Adina Wagner @AdinaKrik
|
|
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7) Research Center Jülich |




Code for hands-on: handbook.datalad.org
datalad create mydatasetimport datalad.api as dl
dl.create(path="mydataset")# in R
> system("datalad create mydataset")
$ datalad save -m "Saving changes" --recursive
$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
![]() Terminal view |
![]() File viewer |
Stay flexible:
| Analysis code evolves (Fix bugs, add functions, refactor, ...) |
|
| Data changes (errors are fixed, data is extended, naming standards change, an analysis requires only a subset of your data...) |
|
|
Data changes (for real) (errors are fixed, data is extended, naming standards change, ...)
|
|
= "The tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred"
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
+ datalad-url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
Git does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:


git-annex to the rescue! Let's take a look how it works
$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
../../.git/annex/objects/kZ/K5/MD5E-s24180157--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz/MD5E-s24180157
--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz
(PS: especially useful in datasets with many identical files) $ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
aeb0e5f2e2d5fe4ade97117a8cc5232f sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz

$ git annex whereis code/nilearn-tutorial.pdf
whereis code/nilearn-tutorial.pdf (2 copies)
cf13d535-b47c-5df6-8590-0793cb08a90a -- [datalad]
e763ba60-7614-4b3f-891d-82f2488ea95a -- jovyan@jupyter-adswa:~/my-analysis [here]
datalad: https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
Delineation and advantages of decentral versus central RDM: Hanke et al., (2021). In defense of decentralized research data management
Two consequences:
datalad get
from whereever it is stored.
|
|
| Git | git-annex |
| handles small files well (text, code) | handles all types and sizes of files well |
| file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
| Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
| Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
text2git, yoda)
or created and shared by users
(Tutorial) .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git)datalad-container extension gives DataLad commands to register software containers as "just another file" to your
dataset, and datalad containers-run analysis inside the container, capturing software as additional
provenance
"In defense of decentralized Research Data Management", doi.org/10.1515/nf-2020-0037
datalad tag
|
Funders
Collaborators
|
$ datalad drop inputs/sub-02
drop(ok): input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file)
drop(ok): input/sub-02 (directory)
action summary:
drop (ok: 2)$ datalad drop --what all input
uninstall(ok): input (dataset)# The command operates outside of the to-be-removed dataset!
$ datalad remove inputs
uninstall(ok): inputs (dataset)$ datalad drop figures/sub-02_mean-epi.png
drop(error): figures/sub-02_mean-epi.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary
copy; (Use --reckless availability to override this check, or
adjust numcopies.)]
$ datalad drop figures/sub-02_mean-epi.png --reckless availability$ datalad remove -d my-analysis
uninstall(error): . (dataset) [to-be-dropped dataset has revisions that are not available at any known
sibling. Use `datalad push --to ...` to push these before dropping the local dataset,
or ignore via `--reckless availability`. Unique revisions: ['main']]
$ datalad remove -d my-analysis --reckless availability$ rm -rf local-dataset
rm: cannot remove 'local-dataset/.git/annex/objects/Kj/44/MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt/
MD5E-s42--8f008874ab52d0ff02a5bbd0174ac95e.txt': Permission denied
$ chmod +w -R local-dataset
$ rm -rf local-dataset