Adina Wagner
@AdinaKrik |
|
|
Psychoinformatics lab,
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7) Research Center Jülich ReproNim/INCF fellow |
|

(Yes, 13 TB of data. Yes, real-life example)
Sadly, Git does not handle large files well.
Sadly, Git does not handle large files well.
And repository hosting services refuse to handle large files:


datalad --version
0.15.6$ git config --list
user.name=Adina Wagner
user.email=adina.wagner@t-online.de
[...]
datalad create mydatasetimport datalad.api as dl
dl.create(path="mydataset")# in R
> system("datalad create mydataset")
|
|
Procedurally, version control is easy with DataLad!
datalad create creates an empty dataset.datalad save records the dataset or file state to the history. datalad download-url obtains web content and records its origin. datalad status reports the current state of the dataset.$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (80TB, 15 million files)
$ du -sh
48G .
|
|
|
datalad get from whereever it is stored.
text2git, yoda)
or created and shared by users
(Tutorial) .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)datalad save --to-git)
adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean
(github.com/datalad-datasets/human-connectome-project-openaccess)
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
datalad clone installs a dataset.datalad get downloads file content on demand.
![]() |
|
datalad run
datalad-containerdatalad-containers run
datalad run records a command and
its impact on the dataset.--input
are retrieved prior to command execution.--output
will be unlocked for modifications prior to a rerun of the command. datalad containers-run can be used
to capture the software environment as provenance.datalad rerun can automatically re-execute run-records later.
$ datalad siblings add -d . \
--name gin \
--url git@gin.g-node.org:/adswa/bids-data.git
$ datalad push --to gin
$ datalad clone https://gin.g-node.org/adswa/bids-data
datalad rerun is helpful to spare others and yourself
the short- or long-term memory task, or the forensic skills to figure
out how you performed an analysis
|
Funders
Collaborators
|
datalad run "unlocks" everything specified as --outputdatalad run, you can use datalad unlock$ ls -l myfile
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af
# unlocking
$ datalad unlock myfile
unlock(ok): myfile (file)
$ ls -l myfile
-rw-r--r-- 1 adina adina 7 Nov 17 07:08 myfile # not a symlink anymore!
datalad save "locks" the file again$ datalad save
add(ok): myfile (file)
action summary:
add (ok: 1)
save (notneeded: 1)
$ ls -l myfile
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af
rm -rf a dataset, this happens:
$ rm -rf mydataset
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png': Permission denied
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s35756--af496e97d4eb7c98b7ea027f6db1cda7.png/MD5E-s27246--af496e97d4eb7c98b7ea027f6db1cda7.png': Permission denied
[...]
😱
$ chmod -R +w mydataset
$ rm -rf mydataset # success!
datalad remove:
$ datalad remove -d ds001241
remove(ok): . (dataset)
action summary:
drop (notneeded: 1)
remove (ok: 1)
$ datalad remove -d mydataset
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
[ERROR ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/tmp/mydataset/interdisciplinary.png)]
drop(error): interdisciplinary.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
[WARNING] could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png'] [drop(/tmp/mydataset)]
drop(impossible): . (directory) [could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png']]
action summary:
drop (error: 1, impossible: 1)
--nocheck to force removal:
$ datalad remove -d mydataset --nocheck 1 !
remove(ok): . (dataset)
datalad remove will also error:
$ datalad remove -d myds
drop(ok): README.md (file) [locking gin...]
drop(ok): . (directory)
[ERROR ] to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive? [remove(/tmp/myds)]
remove(error): . (dataset) [to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive?]
action summary:
drop (ok: 3)
remove (error: 1)
--recursive to remove all subdatasets, too:
$ datalad remove -d myds --recursive
uninstall(ok): input (dataset)
remove(ok): . (dataset)
action summary:
drop (notneeded: 2)
remove (ok: 1)
uninstall (ok: 1)