Julian Kosciessa
@JulianKosciessa |
(kindly power-point-karaoke-ing with slides from Adina Wagner)
|
Cognitive Neuromodulation Group,
Donders Institute for Brain, Cognition and Behaviour |
datalad
tag
|
Funders
Collaborators
|
|
Important! The Hub is a shared resource. Don't fill it up :)
git config --get user.name
git config --get user.email
git config --global user.name "Adina Wagner"
git config --global user.email "adina.wagner@t-online.de"
git config --global --add datalad.extensions.load next
datalad --version
datalad --help
The help may be displayed in a pager - exit it by pressing "q"
datalad wtf
.
Let's find out what kind of system we're on:
datalad wtf -S system
ipython
import datalad.api as dl
dl.create(path='mydataset')
exit
yoda
configuration, which adds
a helpful structure and configuration for data analyses):
datalad create -c yoda my-analysis
cd
(change directory):
cd my-analysis
ls
:
ls -la .
echo "# My example DataLad dataset" >| README.md
status
of the dataset:
datalad status
save
datalad save -m "Add project title into the README"
echo "Contains a small data analysis for my project" >> README.md
git diff
datalad save -m "Add information on the dataset contents to the README"
git log
tig
(navigate with arrow keys and enter, press "q" to go back and exit the program)
datalad download-url -m "Add an analysis script" \
-O code/mne_time_frequency_tutorial.py \
https://raw.githubusercontent.com/datalad-handbook/resources/master/mne_time_frequency_tutorial.py
git log code/mne_time_frequency_tutorial.py
Procedurally, version control is easy with DataLad!
black code/mne_time_frequency_tutorial.py
git diff
git restore code/mne_time_frequency_tutorial.py
datalad run -m "Reformat code with black" \
"black code/mne_time_frequency_tutorial.py"
git show
datalad rerun
clone
.
Either as a stand-alone entity:
# just an example:
datalad clone \
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
# just an example:
datalad clone -d . \
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
datalad clone --dataset . \
https://github.com/OpenNeuroDatasets/ds003104.git \
input/
subdatasets
command:
datalad subdatasets
git show
:
git show
cd input
ls
tig
du
disk-usage command):
du -sh
datalad status --annex
get
:
datalad get sub-01
drop
its content:
datalad drop sub-01
dl.get('input/sub-01')
[really complex analysis]
dl.drop('input/sub-01')
If data is published anywhere, your data analysis can carry an actionable link to it,
with barely any space requirements.
Git | git-annex |
handles small files well (text, code) | handles all types and sizes of files well |
file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
code/
directory and the dataset descriptions (e.g., README files) to be in Git.
There are many other configurations, and you can also
write your own.
cd ..
python code/mne_time_frequency_tutorial.py
datalad run
can run any command in a way that links the command or script to the
results it produces and the data it was computed fromdatalad rerun
can take this recorded provenance and recompute the commanddatalad containers-run
(from the extension "datalad-container") can capture software provenance in the form of software containers in addition to the provenance that datalad run capturesdatalad-container
extension, we can add software containers
to datasets and work with them.
Let's add a software container with Python software to run the script
datalad containers-add python-env \
--url https://files.inm7.de/adina/resources/mne \
--call-fmt "singularity exec {img} {cmd}"
datalad containers-list
containers-run
command:
datalad containers-run -m "run classification analysis in python environment" \
--container-name python-env \
--input "input/sub-01/meg/sub-01_task-somato_meg.fif" \
--output "figures/*" \
"python3 code/mne_time_frequency_tutorial.py {inputs}"
containers-run
command has completed?
datalad diff
(based on git diff
):
datalad diff -f HEAD~1
git log -n 1
ssh-keygen -t ed25519 -C "your-email"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
cat ~/.ssh/id_ed25519.pub
sibling
-repositories
on various infrastructure and third party services (GitHub, GitLab, OSF, WebDAV-based services, DataVerse, ...)
, to which data can then be published with push
.
datalad create-sibling-gin example-analysis --access-protocol ssh
siblings
command:
datalad siblings
datalad push --to gin
cd ../
datalad clone \
https://gin.g-node.org/adswa/example-analysis \
myclone
cd myclone
datalad get figures/inter_trial_coherence.png
datalad drop figures/inter_trial_coherence.png
datalad rerun
Data changes (errors are fixed, data is extended, naming standards change, an analysis requires only a subset of your data...) |
|
= "The tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred"
/dataset
βββ sample1
β βββ a001.dat
βββ sample2
β βββ a001.dat
...
/dataset
βββ sample1
β βββ ps34t.dat
β βββ a001.dat
βββ sample2
β βββ ps34t.dat
β βββ a001.dat
...
Without expert/domain knowledge, no distinction between original and derived data
possible.
/raw_dataset
βββ sample1
β βββ a001.dat
βββ sample2
β βββ a001.dat
...
With modularity after applied transform (preprocessing, analysis, ...)
/derived_dataset
βββ sample1
β βββ ps34t.dat
βββ sample2
β βββ ps34t.dat
βββ ...
βββ inputs
βββ raw
βββ sample1
β βββ a001.dat
βββ sample2
β βββ a001.dat
...
Clearer separation of semantics, through use of pristine version of original dataset within a
new, additional dataset holding the outputs.
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
Each (sub)dataset is a separately, but jointly version-controlled entity.
If none of its data is retrieved, subdatasets are an extremely lightweight data dependency
and yet actionable (datalad get retrieves contents on demand)
Women neuroscientists are underrepresented in neuroscience. You can use the Repository for Women in Neuroscience to find and recommend neuroscientists for conferences, symposia or collaborations, and help making neuroscience more open & divers. |
datalad create
creates an empty dataset.datalad save
records the dataset or file state to the history. datalad download-url
obtains web content and records its origin. datalad status
reports the current state of the dataset.datalad clone
installs a dataset.datalad get
downloads file content on demand.datalad run
records a command and
its impact on the dataset.--input
are retrieved prior to command execution.--output
will be unlocked for modifications prior to a rerun of the command. datalad containers-run
can be used
to capture the software environment as provenance.datalad rerun
can automatically re-execute run-records later.