datalad --version
0.16.1
$ git config --list
user.name=Adina Wagner
user.email=adina.wagner@t-online.de
[...]
$ git config set --global user.name "Adina Wagner"
$ git config set --global user.email "adina.wagner@t-online.de"
Find installation and configuration
instructions at
handbook.datalad.org datalad create mydatasetimport datalad.api as dl
dl.create(path="mydataset")# in R
> system("datalad create mydataset")
$ datalad save -m "Saving changes" --recursive
$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
$ datalad create -c text2git my-dataset
Procedurally, version control is easy with DataLad!
datalad download-url is helpful
datalad create creates an empty dataset.datalad save records the dataset or file state to the history. datalad download-url obtains web content and records its origin. datalad status reports the current state of the dataset.
git restore is a dangerous (!), but sometimes useful command:git revert [hash] transparently undoes a past commitgit checkout lets you time-travel.git rebase changes and git reset rewinds history without creating a commit about it (see Handbook chapter for examples).git refloggit restore and git clean.Git does not handle large files well.
Git does not handle large files well.
And repository hosting services refuse to handle large files:


git-annex to the rescue! Let's take a look how it works
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G .
| Git | git-annex |
| handles small files well (text, code) | handles all types and sizes of files well |
| file contents are in the Git history and will be shared upon git/datalad push | file contents are in the annex. Not necessarily shared |
| Shared with every dataset clone | Can be kept private on a per-file level when sharing the dataset |
| Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files | Useful: Large files, private files |
|
|
|
datalad get from whereever it is stored.
text2git, yoda)
or created and shared by users
(Tutorial) .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git)$ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
install(ok): /tmp/machinelearning-books (dataset)
$ cd machinelearning-books && du -sh
348K .
$ ls
A.Shashua-Introduction_to_Machine_Learning.pdf
B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
[...]
$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]
$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]
$ git annex whereis inputs/images/chinstrap_02.jpg
whereis inputs/images/chinstrap_02.jpg (1 copy)
00000000-0000-0000-0000-000000000001 -- web
c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]
web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
ok
$ datalad drop inputs/images/chinstrap_02.jpg
drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
$ datalad get inputs/images/chinstrap_02.jpg
get(ok): inputs/images/chinstrap_02.jpg (file)
$ datalad drop inputs/images/chinstrap_01.jpg
drop(error): inputs/images/chinstrap_01.jpg (file)
[unsafe; Could only verify the existence of 0 out of 1 necessary copy;
(Use --reckless availability to override this check, or adjust numcopies.)]
$ ls -l inputs/images/chinstrap_01.jpg
lrwxrwxrwx 1 adina adina 132 Apr 5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
(PS: especially useful in datasets with many identical files) $ md5sum inputs/images/chinstrap_01.jpg
2e043a5654cec96aadad554fda2a8b26 inputs/images/chinstrap_01.jpg
datalad run wraps a command execution and records its impact on a dataset.
datalad run wraps a command execution and records its impact on a dataset.
commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
Author: Adina Wagner [adina.wagner@t-online.de]
Date: Mon Apr 18 12:31:47 2022 +0200
[DATALAD RUNCMD] Convert the second image to greyscale
=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
"dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
new file mode 120000
index 0000000..5febc72
--- /dev/null
+++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
@@ -0,0 +1 @@
+../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
\ No newline at end of file
The resulting commit's hash (or any other identifier) can be used to automatically re-execute a computation (more on this tomorrow)
Traceback (most recent call last):
File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
grey.save(args.output_file)
File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
fp = builtins.open(filename, "w+b")
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
$ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
$ ls outputs/images_greyscale/chinstrap_02_grey.jpg
-rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpgdatalad run wraps a command execution and records its impact on a dataset.
In addition, it can take care of data retrieval and unlocking
datalad rerun is helpful to spare others and yourself
the short- or long-term memory task, or the forensic skills to figure
out how you performed an analysis
datalad rerun to rerun the script execution.
Find out if the output changed--input
are retrieved prior to command execution, data/directories specified as --output unlocked.datalad rerun can automatically re-execute run-records later.