Adina Wagner mas.to/@adswa |
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7) Research Center Jülich |
datalad
tag
Comprehensive user documentation in the DataLad Handbook (handbook.datalad.org) |
|
|
|
|
|
Overview of most tutorials, talks, videos, ... at github.com/datalad/tutorials
Please try to log in now
$ datalad save -m "Saving changes" --recursive
$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
datalad --version
datalad --help
datalad wtf
.
Let's find out what kind of system we're on:
datalad wtf -S system
git config --get user.name
git config --get user.email
git config --global user.name "Adina Wagner"
git config --global user.email "adina.wagner@t-online.de"
git config --global --add datalad.extensions.load next
ipython
import datalad.api as dl
dl.create(path='mydataset')
exit
datalad create mydataset
import datalad.api as dl
dl.create(path="mydataset")
# in R
> system("datalad create mydataset")
|
Funders
Collaborators
|
Terminal view |
File viewer |
text2git
configuration, which adds
a helpful configuration):
datalad create -c text2git my-analysis
cd
(change directory):
cd my-analysis
ls
:
ls -la .
Stay flexible:
echo "# My example DataLad dataset" > README.md
status
of the dataset:
datalad status
save
datalad save -m "Create a short README"
echo "This dataset contains a toy data analysis" >> README.md
git diff
datalad save -m "Add information on the dataset contents to the README"
git log
tig
(navigate with arrow keys and enter, press "q" to go back and exit the program)
Analysis code evolves (Fix bugs, add functions, refactor, ...) |
|
Data changes (errors are fixed, data is extended, naming standards change, an analysis requires only a subset of your data...) |
|
Data changes (for real) (errors are fixed, data is extended, naming standards change, ...) |
datalad-container
extension, we can not only add code or data, but also
software containers to datasets and work with them.
Let's add a software container with Python software for later:
datalad containers-add nilearn \
--url shub://adswa/nilearn-container:latest
datalad containers-list
= "The tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred"
wget -P code/ \
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
wget
command downloaded a script for extracting a brain mask:
datalad status
datalad save -m "Adding a nilearn-based script for brain masking"
datalad download-url -m "Add a tutorial on nilearn" \
-O code/nilearn-tutorial.pdf \
https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
datalad status
git log code/nilearn-tutorial.pdf
black code/get_brainmask.py
git diff
git restore code/get_brainmask.py
datalad run -m "Reformat code with black" \
"black code/get_brainmask.py"
git show
datalad rerun
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
+ datalad-url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
datalad clone -d . \
https://gin.g-node.org/adswa/bids-data \
input
subdatasets
command:
datalad subdatasets
git show
:
git show
cd input
ls
tig
du
disk-usage command):
du -sh
datalad status --annex
get
or drop
annexed file contents depending on your needs:
datalad get sub-02
datalad drop sub-02
cd ..
datalad run -m "Compute brain mask" \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"
datalad-container
extension gives DataLad commands to register software containers as "just another file" to your
dataset, and datalad containers-run analysis inside the container, capturing software as additional
provenance
containers-run
command:
datalad containers-run -m "Compute brain mask" \
-n nilearn \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"
git log sub-02_brain-mask.nii.gz
datalad rerun
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
Git does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:
git-annex to the rescue! Let's take a look how it works
$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
../../.git/annex/objects/kZ/K5/MD5E-s24180157--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz/MD5E-s24180157
--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz
(PS: especially useful in datasets with many identical files) $ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
aeb0e5f2e2d5fe4ade97117a8cc5232f sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
$ git annex whereis code/nilearn-tutorial.pdf
whereis code/nilearn-tutorial.pdf (2 copies)
cf13d535-b47c-5df6-8590-0793cb08a90a -- [datalad]
e763ba60-7614-4b3f-891d-82f2488ea95a -- jovyan@jupyter-adswa:~/my-analysis [here]
datalad: https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
Delineation and advantages of decentral versus central RDM: Hanke et al., (2021). In defense of decentralized research data management
Two consequences:
datalad get
from whereever it is stored.
|
Git | git-annex |
handles small files well (text, code) | handles all types and sizes of files well |
file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
text2git
, yoda
)
or created and shared by users
(Tutorial) .gitattributes
(e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git
)
$ git config --local remote.github.datalad-publish-depends gdrive
# or
$ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3
Special case 1: repositories with annex support
Special case 2: Special remotes with repositories
Publishing to OSF
datalad-osf
and datalad-next
datalad osf-credentials
:
datalad osf-credentials
datalad create-sibling-osf -d . -s my-osf-sibling \
--title 'my-osf-project-title' --mode export --public
datalad push -d . --to my-osf-sibling
cd ..
datalad clone osf://my-osf-project-id my-osf-clone