adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean
(github.com/datalad-datasets/human-connectome-project-openaccess)
/dataset
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
/dataset
├── sample1
│ ├── ps34t.dat
│ └── a001.dat
├── sample2
│ ├── ps34t.dat
│ └── a001.dat
...
Without expert/domain knowledge, no distinction between original and derived data
possible.
/raw_dataset
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
With modularity after applied transform (preprocessing, analysis, ...)
/derived_dataset
├── sample1
│ └── ps34t.dat
├── sample2
│ └── ps34t.dat
├── ...
└── inputs
└── raw
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
Clearer separation of semantics, through use of pristine version of original dataset within a
new, additional dataset holding the outputs.
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
Each (sub)dataset is a separately, but jointly version-controlled entity.
If none of its data is retrieved, subdatasets are an extremely lightweight data dependency
and yet actionable (datalad get retrieves contents on demand)
Link precisely versioned inputs to version-controlled outputs
Science has many different building blocks: Code, software, and data produce research outputs.
The more you share, the more likely can others reproduce your results
|
![]() |
datalad-container extension gives DataLad commands to add, track, retrieve, and
execute Docker or Singularity containers.
pip/conda install datalad-container
# create a superdataset
$ datalad create -c text2git bids-data
[INFO ] Creating a new annex repo at /home/adina/bids-data
[INFO ] scanning for unlocked files (this may take some time)
[INFO ] Running procedure cfg_text2git
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
create(ok): /home/adina/bids-data (dataset)
$ cd bids-data
# create a README
$ echo "# A BIDS structured dataset for my input data" > README.md
$ datalad status
untracked: README.md (file)
$ datalad save -m "Add a short README"
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
save (ok: 1)
# add the input data (DICOMs) as a subdataset
$ datalad clone --dataset . \
https://github.com/datalad/example-dicom-functional.git \
inputs/rawdata'
install(ok): inputs/rawdata (dataset)
add(ok): inputs/rawdata (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
install (ok: 1)
save (ok: 2)
$ datalad clone -d . \
https://github.com/ReproNim/containers.git \
code/containers
[INFO ] scanning for unlocked files (this may take some time)
[INFO ] Remote origin not usable by git-annex; setting annex-ignore
install(ok): code/containers (dataset)
add(ok): code/containers (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
install (ok: 1)
save (ok: 2)
# list all available containers across the dataset hierarchy
$ datalad containers-list --recursive
[...]
code/containers/repronim-reproin -> code/containers/images/repronim/repronim-reproin--0.9.0.sing
[...]
# list the direct subdataset of bids-data
$ datalad subdatasets
subdataset(ok): code/containers (dataset)
subdataset(ok): inputs/rawdata (dataset)
# use datalad containers-run to run the conversion and save its provenance
$ datalad containers-run -m "Convert subject 02 to BIDS" \
--container-name code/containers/repronim-reproin \
--input inputs/rawdata/dicoms \
--output sub-02 \
"-f reproin -s 02 --bids -l "" --minmeta -o . --files inputs/rawdata/dicoms"
[...]
save(ok): . (dataset)
action summary:
add (ok: 18)
get (notneeded: 4, ok: 1)
save (notneeded: 2, ok: 1)
$ datalad siblings add -d . \
--name gin \
--url git@gin.g-node.org:/adswa/bids-data.git'
$ datalad siblings
.: here(+) [git]
[WARNING] Could not detect whether gin carries an annex. If gin is a pure Git remote, this is expected.
.: gin(-) [git@gin.g-node.org:/adswa/bids-data.git (git)]
$ datalad push --to gin
copy(ok): sourcedata/sub-02/func/sub-02_task-oneback_run-01_bold.dicom.tgz (file) [to gin...]
copy(ok): sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [to gin...]
publish(ok): . (dataset) [refs/heads/git-annex->gin:refs/heads/git-annex 68523d8..b4c1ee0]
publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master [new branch]]
action summary:
copy (ok: 2)
publish (ok: 2)
$ cd ../
# create a new dataset for your analysis. The yoda procedure pre-structures it
# and applies configurations that ensure that scripts are versioned in Git
$ datalad create -c yoda myanalysis
[INFO ] Creating a new annex repo at /home/adina/myanalysis
[INFO ] scanning for unlocked files (this may take some time)
[INFO ] Running procedure cfg_yoda
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
create(ok): /home/adina/myanalysis (dataset)
$ cd myanalysis
$ tree
.
├── CHANGELOG.md
├── code
│ └── README.md
└── README.md
1 directory, 3 file
# add the BIDS-structured data as input - in the form of a subdataset
$ datalad clone -d . \
https://gin.g-node.org/adswa/bids-data \
input
[INFO ] scanning for unlocked files (this may take some time)
install(ok): input (dataset)
add(ok): input (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
install (ok: 1)
save (ok: 2)
# Get a script for the analysis
$ datalad download-url -m "Download code for brain masking from Github" \
-O code/get_brainmask.py \
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py'
[INFO ] Downloading 'https://raw.githubusercontent.com/datalad-handbook/...
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py:
download_url(ok): /home/adina/myanalysis/code/get_brainmask.py (file)
add(ok): code/get_brainmask.py (file)
save(ok): . (dataset)
action summary:
add (ok: 1)
download_url (ok: 1)
save (ok: 1)
# Add a container with all relevant Python software
$ datalad containers-add nilearn \
--url shub://adswa/nilearn-container:latest \
--call-fmt "singularity exec {img} {cmd}"
[INFO ] Initiating special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/adina/myanalysis/.datalad/environments/nilearn/image (file)
action summary:
add (ok: 1)
containers_add (ok: 1)
save (ok: 1)
# run your containerized analysis reproducibly
$ datalad containers-run -m "Compute brain mask" \
-n nilearn \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output figures/ \
--output "sub-02*" \
"python code/get_brainmask.py"'
[INFO ] Making sure inputs are available (this may take some time)
get(ok): input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz (file) [from origin...]
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
add(ok): figures/sub-02_brainmask.png (file)
add(ok): figures/sub-02_mean-epi.png (file)
add(ok): sub-02_brain-mask.nii.gz (file)
save(ok): . (dataset)
action summary:
add (ok: 3)
get (notneeded: 2, ok: 1)
save (notneeded: 1, ok: 1)
# Ask your results how they came to be
$ git log sub-02_brain-mask.nii.gz
commit d2d35eb31a93a0a82163835de0e3c14946504811 (HEAD -> master)
Author: Adina Wagner
Date: Wed Apr 20 16:05:40 2022 +0200
[DATALAD RUNCMD] Compute brain mask
=== Do not change lines below ===
{
"chain": [],
"cmd": "singularity exec .datalad/environments/nilearn/image python code/get_brainmask.py",
"dsid": "421d677c-2873-49f0-a1a9-9c7bb0100e69",
"exit": 0,
"extra_inputs": [
".datalad/environments/nilearn/image"
],
"inputs": [
"input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz"
],
"outputs": [
"figures/",
"sub-02*"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
# ... or recompute them
$ datalad rerun
datalad run records a command and
its impact on the dataset.--input
are retrieved prior to command execution, Data/directories specified as --output
will be unlocked for modifications prior to a rerun of the command.datalad containers-run from the datalad-container
extends datalad run & can be used
to capture the software environment as provenance.datalad run ipython/R/matlab/...You can archive a dataset to figshare?
If you have a Figshare account, you can do the following:
$ datalad export-to-figshare
[INFO ] Exporting current tree as an archive under /tmp/comics since figshare does not support directories
[INFO ] Uploading /tmp/comics/datalad_ce82ff1f-e2b3-4a84-9e56-87d8eb6e5b27.zip to figshare
Article
Would you like to create a new article to upload to? If not - we will list existing articles (choices: yes, no): yes
New article
Please enter the title (must be at least 3 characters long). [comics#ce82ff1f-e2b3-4a84-9e56-87d8eb6e5b27]: acomictest
[INFO ] Created a new (private) article 13247186 at https://figshare.com/account/articles/13247186. Please visit it, enter additional meta-data and make public
[INFO ] 'Registering' /tmp/comics/datalad_ce82ff1f-e2b3-4a84-9e56-87d8eb6e5b27.zip within annex
[INFO ] Adding URL https://ndownloader.figshare.com/files/25509824 for it
[INFO ] Registering links back for the content of the archive
[INFO ] Adding content of the archive /tmp/comics/datalad_ce82ff1f-e2b3-4a84-9e56-87d8eb6e5b27.zip into annex AnnexRepo(/tmp/comics)
[INFO ] Initiating special remote datalad-archives
[INFO ] Finished adding /tmp/comics/datalad_ce82ff1f-e2b3-4a84-9e56-87d8eb6e5b27.zip: Files processed: 4, removed: 4, +git: 2, +annex: 2
[INFO ] Removing generated and now registered in annex archive
export_to_figshare(ok): Dataset(/tmp/comics) [Published archive https://ndownloader.figshare.com/files/25509824]
datalad-container extension can help to use and share software
environments in your dataset
|
|
$ singularity pull docker://nipy/heudiconv:0.5.4
$ singularity pull shub://adswa/python-ml:1
INFO: Downloading shub image
265.56 MiB / 265.56 MiB [==================================================] 100.00% 10.23 MiB/s 25s
$ sudo singularity build myimage Singularity.2
INFO: Starting build...
Getting image source signatures
Copying blob 831751213a61 done
[...]
INFO: Creating SIF file...
INFO: Build complete: myimage
$ ls
heudiconv_0.5.4.sif
python-ml_1.sif$ datalad containers-add software --url /home/me/singularity/myimage
[INFO ] Copying local file myimage to /home/adina/repos/resources/.datalad/environments/software/image
add(ok): .datalad/environments/software/image (file)
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /home/adina/repos/resources/.datalad/environments/software/image (file)
action summary:
add (ok: 2)
containers_add (ok: 1)
save (ok: 1)
$ datalad containers-list
software -> .datalad/environments/software/image
$ datalad containers-add software --url shub://adswa/python-ml:1
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /tmp/bla/.datalad/environments/software/image (file)
action summary:
add (ok: 1)
containers_add (ok: 1)
save (ok: 1)
$ docker pull repronim/neurodocker:latest 1 !
latest: Pulling from repronim/neurodocker
$ sudo docker build -t adwagner/somedockercontainer .
Sending build context to Docker daemon 6.656kB
Step 1/4 : FROM python:3.6
[...]
Successfully built 31d6acc37184
Successfully tagged adwagner/somedockercontainer:latest
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
repronim/neurodocker latest 84b9023f0019 7 months ago 81.5MB
adwagner/min_preproc latest fca4a144b61f 8 months ago 5.96GB
[...]$ datalad containers-add --url dhub://busybox:1.30 bb
[INFO] Saved busybox:1.30 to C:\Users\datalad\testing\blablablabla\.datalad\environments\bb\image
add(ok): .datalad\environments\bb\image\64f5d945efcc0f39ab11b3cd4ba403cc9fefe1fa3613123ca016cf3708e8cafb.json (file)
add(ok): .datalad\environments\bb\image\a57c26390d4b78fd575fac72ed31f16a7a2fa3ebdccae4598513e8964dace9b2\VERSION (file)
add(ok): .datalad\environments\bb\image\a57c26390d4b78fd575fac72ed31f16a7a2fa3ebdccae4598513e8964dace9b2\json (file)
add(ok): .datalad\environments\bb\image\a57c26390d4b78fd575fac72ed31f16a7a2fa3ebdccae4598513e8964dace9b2\layer.tar (file)
add(ok): .datalad\environments\bb\image\manifest.json (file)
add(ok): .datalad\environments\bb\image\repositories (file)
add(ok): .datalad\config (file)
save(ok): . (dataset)
containers_add(ok): C:\Users\datalad\testing\blablablabla\.datalad\environments\bb\image (file)
action summary:
add (ok: 7)
containers_add (ok: 1)
save (ok: 1)
datalad containers-run executes any command inside of the
specified container. How does it work?
$ cat .datalad/config
[datalad "containers.midterm-software"]
updateurl = shub://adswa/resources:1
image = .datalad/environments/midterm-software/image
cmdexec = singularity exec {img} {cmd}
$ datalad containers-add fmriprep \
--url shub://ReproNim/containers:bids-fmriprep--20.1.1 \
--call-fmt 'singurity run --cleanenv -B $PWD,$PWD/.tools/license.txt {img} {cmd}'