Adina Wagner
@AdinaKrik |
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7) Research Center JΓΌlich ReproNim/INCF fellow |
|
Funders
Collaborators
|
datalad create mydatasetimport datalad.api as dl
dl.create(path="mydataset")# in R
> system("datalad create mydataset")
clone a dataset from a public or private place
and get access to the data it tracks
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M . # its tiny!
datalad get:$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
datalad drop:$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (80TB, 15 million files)
$ du -sh
48G .
$ datalad clone https://github.com/OpenNeuroDatasets/ds003171.git$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess.git
$ datalad clone ///
|
|
|
|
|
datalad create
$ datalad create -c yoda myanalysis
[INFO ] Creating a new annex repo at /tmp/myanalysis
[INFO ] Scanning for unlocked files (this may take some time)
[INFO ] Running procedure cfg_yoda
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
create(ok): /tmp/myanalysis (dataset)
-c yoda applies useful pre-structuring and configurations:$ tree
.
βββ CHANGELOG.md
βββ code
βΒ Β βββ README.md
βββ README.md
datalad save # create a data analysis script
$ datalad status
untracked: code/script.py (file)
$ git status
On branch master
Untracked files:
(use "git add file..." to include in what will be committed)
code/script.py
nothing added to commit but untracked files present (use "git add" to track)
# create a data analysis script
$ datalad status
untracked: code/script.py (file)
$ git status
On branch master
Untracked files:
(use "git add file..." to include in what will be committed)
code/script.py
nothing added to commit but untracked files present (use "git add" to track)
# create a data analysis script
$ datalad status
untracked: code/script.py (file)
$ git status
On branch master
Untracked files:
(use "git add file..." to include in what will be committed)
code/script.py
nothing added to commit but untracked files present (use "git add" to track)
$ datalad save -m "Add a k-nearest-neighbour clustering analysis" code/script.py

Procedurally, version control is easy with DataLad!
Stay flexible:
$ cd myanalysis
# we can install analysis input data as a subdataset to the dataset
$ datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/
[INFO ] Scanning for unlocked files (this may take some time)
[INFO ] Remote origin not usable by git-annex; setting annex-ignore
install(ok): input (dataset)
add(ok): input (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
install (ok: 1)
save (ok: 1)
$ tree
.
βββ CHANGELOG.md
βββ code
βΒ Β βββ README.md
βΒ Β βββ script.py
βββ input
Β βββ iris.csv
![]() |
|
Write-up:
handbook.datalad.org/en/latest/basics/101-130-yodaproject.html
Write-up:
handbook.datalad.org/en/latest/basics/101-130-yodaproject.html
datalad run
$ datalad run -m "analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"
[INFO ] Making sure inputs are available (this may take some time)
get(ok): input/iris.csv (file) [from web...]
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
get (notneeded: 2, ok: 1)
save (notneeded: 1, ok: 1)
$ datalad run -m "analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"
[INFO ] Making sure inputs are available (this may take some time)
get(ok): input/iris.csv (file) [from web...]
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
get (notneeded: 2, ok: 1)
save (notneeded: 1, ok: 1)
$ git log
commit df2dae9b5af184a0c463708acf8356b877c511a8 (HEAD -> master)
Author: Adina Wagner adina.wagner@t-online.de
Date: Tue Dec 1 11:58:18 2020 +0100
[DATALAD RUNCMD] analyze iris data with classification analysis
=== Do not change lines below ===
{
"chain": [],
"cmd": "python3 code/script.py",
"dsid": "9ffdbfcd-f4af-429a-b64a-0c81b48b7f62",
"exit": 0,
"extra_inputs": [],
"inputs": [
"input/iris.csv"
],
"outputs": [
"prediction_report.csv",
"pairwise_relationships.png"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
$ git log
commit df2dae9b5af184a0c463708acf8356b877c511a8 (HEAD -> master)
Author: Adina Wagner adina.wagner@t-online.de
Date: Tue Dec 1 11:58:18 2020 +0100
[DATALAD RUNCMD] analyze iris data with classification analysis
[...]
rerun this hash to repeat the
analysis:
$ datalad rerun df2dae9b5af1
datalad rerun df2dae9b5af18
[INFO ] run commit df2dae9; (analyze iris data...)
[INFO ] Making sure inputs are available (this may take some time)
unlock(ok): pairwise_relationships.png (file)
unlock(ok): prediction_report.csv (file)
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
action summary:
add (ok: 2)
get (notneeded: 3)
save (notneeded: 2)
unlock (ok: 2)
datalad-containerdatalad-containers run
$ datalad containers-add software --url shub://adswa/resources:2
[INFO ] Initiating special remote datalad
add(ok): .datalad/config (file)
save(ok): . (dataset)
containers_add(ok): /tmp/myanalysis/.datalad/environments/software/image (file)
action summary:
add (ok: 1)
containers_add (ok: 1)
save (ok: 1)
Write-up:
http://handbook.datalad.org/en/latest/basics/101-133-containersrun.html
datalad containers-run will execute the command in the specified
software environment$ datalad containers-run -m "rerun analysis in container" \
--container-name midterm-software \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"
[INFO] Making sure inputs are available (this may take some time)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
unlock(ok): pairwise_relationships.png (file)
unlock(ok): prediction_report.csv (file)
add(ok): pairwise_relationships.png (file)
add(ok): prediction_report.csv (file)
save(ok): . (dataset)
action summary:
add (ok: 2)
get (notneeded: 4)
save (notneeded: 1, ok: 1)
unlock (ok: 2)
datalad rerun will repeat the analysis in the
specified software environment|
Comprehensive user documentation in the DataLad Handbook (handbook.datalad.org) |
|
|
|
|
|
|