|
Adina Wagner |
|
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7) Research Center Jülich |
datalad tag
|
Comprehensive user documentation in the DataLad Handbook (handbook.datalad.org) |
|
|
|
|
|
|
Overview of most tutorials, talks, videos, ... at github.com/datalad/tutorials

Please try to log in now
$ datalad save -m "Saving changes" --recursive
$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
datalad --version
datalad --help
datalad wtf.
Let's find out what kind of system we're on:
datalad wtf -S system
git config --get user.name
git config --get user.email
git config --global user.name "Adina Wagner"
git config --global user.email "adina.wagner@t-online.de"
git config --global --add datalad.extensions.load next
ipython
import datalad.api as dl
dl.create(path='mydataset')
exit
datalad create mydataset
import datalad.api as dl
dl.create(path="mydataset")
# in R
> system("datalad create mydataset")
|
Funders
Collaborators
|
![]() Terminal view |
![]() File viewer |
text2git configuration, which adds
a helpful configuration):
datalad create -c text2git my-analysis
cd (change directory):
cd my-analysis
ls:
ls -la .
Stay flexible:
echo "# My example DataLad dataset" > README.md
status of the dataset:
datalad status
save
datalad save -m "Create a short README"
echo "This dataset contains a toy data analysis" >> README.md
git diff
datalad save -m "Add information on the dataset contents to the README"
git log
tig
(navigate with arrow keys and enter, press "q" to go back and exit the program)
| Analysis code evolves (Fix bugs, add functions, refactor, ...) |
|
| Data changes (errors are fixed, data is extended, naming standards change, an analysis requires only a subset of your data...) |
|
|
Data changes (for real) (errors are fixed, data is extended, naming standards change, ...)
|
|
datalad-container extension, we can not only add code or data, but also
software containers to datasets and work with them.
Let's add a software container with Python software for later:
datalad containers-add nilearn \
--url shub://adswa/nilearn-container:latest
datalad containers-list
= "The tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred"
wget -P code/ \
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
wget command downloaded a script for extracting a brain mask:
datalad status
datalad save -m "Adding a nilearn-based script for brain masking"
datalad download-url -m "Add a tutorial on nilearn" \
-O code/nilearn-tutorial.pdf \
https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
datalad status
git log code/nilearn-tutorial.pdf
black code/get_brainmask.py
git diff
git restore code/get_brainmask.py
datalad run -m "Reformat code with black" \
"black code/get_brainmask.py"
git show
datalad rerun
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
+ datalad-url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
datalad clone -d . \
https://gin.g-node.org/adswa/bids-data \
input
subdatasets command:
datalad subdatasets
git show:
git show
cd input
ls
tig
du disk-usage command):
du -sh
datalad status --annex
get or drop annexed file contents depending on your needs:
datalad get sub-02
datalad drop sub-02
cd ..
datalad run -m "Compute brain mask" \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"datalad-container extension gives DataLad commands to register software containers as "just another file" to your
dataset, and datalad containers-run analysis inside the container, capturing software as additional
provenance
containers-run command:
datalad containers-run -m "Compute brain mask" \
-n nilearn \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"
git log sub-02_brain-mask.nii.gz
datalad rerun
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
Git does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:


git-annex to the rescue! Let's take a look how it works
$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
../../.git/annex/objects/kZ/K5/MD5E-s24180157--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz/MD5E-s24180157
--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz
(PS: especially useful in datasets with many identical files) $ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
aeb0e5f2e2d5fe4ade97117a8cc5232f sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz

$ git annex whereis code/nilearn-tutorial.pdf
whereis code/nilearn-tutorial.pdf (2 copies)
cf13d535-b47c-5df6-8590-0793cb08a90a -- [datalad]
e763ba60-7614-4b3f-891d-82f2488ea95a -- jovyan@jupyter-adswa:~/my-analysis [here]
datalad: https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
Delineation and advantages of decentral versus central RDM: Hanke et al., (2021). In defense of decentralized research data management
Two consequences:
datalad get
from whereever it is stored.
|
|
| Git | git-annex |
| handles small files well (text, code) | handles all types and sizes of files well |
| file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
| Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
| Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
text2git, yoda)
or created and shared by users
(Tutorial) .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git)
$ git config --local remote.github.datalad-publish-depends gdrive
# or
$ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3
Special case 1: repositories with annex support
Special case 2: Special remotes with repositories
Publishing to OSF
datalad-osf and datalad-nextdatalad osf-credentials:
datalad osf-credentials
datalad create-sibling-osf -d . -s my-osf-sibling \
--title 'my-osf-project-title' --mode export --public
datalad push -d . --to my-osf-sibling
cd ..
datalad clone osf://my-osf-project-id my-osf-clone