import datalad.api as dl
dl.create(path="mydataset")
... and other programming languages can use it via system call
# in R
> system("datalad create mydataset")
DataLad Datasets
DataLad's core data structure
Dataset = A directory managed by DataLad
Any directory of your computer can be managed by DataLad.
Datasets can be created (from scratch) or installed
Datasets can be nested: linked subdirectories
Let's start by creating a dataset
DataLad Datasets
A DataLad dataset is a joined Git + git-annex repository
Why version control?
keep things organized
keep track of changes
revert changes or go back to previous states
Version Control
DataLad knows two things: Datasets and files
Every file you put into a in a dataset can be easily version-controlled,
regardless of size, with the same command.
Local version control
Procedurally, version control is easy with DataLad!
Advice:
Save meaningful units of change
Attach helpful commit messages
Start to record provenance
Have you ever saved a PDF to read later onto your computer, but forgot
where you got it from?
Digital Provenance = "The tools and processes used to create a
digital file, the responsible entity, and when and where the process
events occurred"
The history of a dataset already contains provenance, but there is more
to record - for example: Where does a file come from?
datalad download-url is helpful
Summary - Local version control
datalad create creates an empty dataset.
Configurations (-c yoda, -c text2git) are useful (details soon).
A dataset has a history to track files and their modifications.
Explore it with Git (git log) or external tools (e.g., tig).
datalad save records the dataset or file state to the history.
Concise commit messages should summarize the change for future you and others.
datalad download-url obtains web content and records its origin.
It even takes care of saving the change.
datalad status reports the current state of the dataset.
A clean dataset status (no modifications, not untracked files) is good practice.
Questions!
Consuming datasets
Here's how a dataset looks after installation:
Datasets are light-weight: Upon installation, only small
files and meta data about file availability are retrieved.
Plenty of data, but little disk-usage
Cloned datasets are lean.
"Meta data" (file names, availability) are present, but no file content:
$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .
file's contents can be retrieved on demand:
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]
Have more access to your computer than you have disk-space:
# eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (80TB, 15 million files)
$ du -sh
48G .
Git versus Git-annex
Data in datasets is either stored in Git or git-annex
By default, everything is annexed, i.e., stored in a dataset annex by git-annex
Git
git-annex
handles small files well (text, code)
handles all types and sizes of files well
file contents are in the Git history
and will be shared upon git/datalad push
file contents are in the annex. Not necessarily shared
Shared with every dataset clone
Can be kept private on a per-file level when sharing the dataset
With annexed data, only content identity (hash)
and location information is put into Git, rather than file content.
The annex, and transport to and from it is managed with git-annex
Git versus Git-annex
Git versus Git-annex
Useful background information for demo later. Read
this handbook chapter for details
Git and Git-annex handle files differently: annexed files are stored in an annex.
File content is hashed & only content-identity is committed to Git.
Files stored in Git are modifiable, files stored in Git-annex are content-locked
Annexed contents are not available right after cloning,
only content identity and availability information (as they are stored in Git).
Everything that is annexed needs to be retrieved with datalad get from whereever it is stored.
Git versus Git-annex
When sharing datasets with someone without access to the same computational
infrastructure, annexed data is not necessarily stored together with the rest
of the dataset (more in the session on publishing).
Transport logistics exist to interface with all major storage providers.
If the one you use isn't supported, let us know!
Git versus Git-annex
Users can decide which files are annexed:
Pre-made run-procedures, provided by DataLad (e.g., text2git, yoda)
or created and shared by users
(Tutorial)
Self-made configurations in .gitattributes (e.g., based on file type,
file/path name, size, ...;
rules and examples
)
Per-command basis (e.g., via datalad save --to-git)
Transport logistics
Disk-space aware workflows: Cloned datasets are lean (only Git):
$ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
install(ok): /tmp/machinelearning-books (dataset)
$ cd machinelearning-books && du -sh
348K .
$ ls
A.Shashua-Introduction_to_Machine_Learning.pdf
B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
[...]
annexed file's contents can
be retrieved & dropped on demand:
$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]
$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]
git-annex protects your files
If git-annex does not know any other storage location for a file it will
warn you and refuse to drop content (can be configured)
Here is a file with a registered remote location (the web)
$ datalad drop .easteregg
drop(ok): /demo/myanalysis/.easteregg (file) [checking https://imgs.xkcd.com/comics/fuck_grapefruit.png...]
Here is a file without a registered remote location (the web)
$ datalad drop compiling.png
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
[ERROR ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/demo/myanalysis/compiling.png)]
drop(error): /demo/myanalysis/compiling.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
If a different location for file content is known,
datalad get can retrieve file content after dropping
Dataset nesting
Seamless nesting mechanisms:
Overcomes scaling issues with large amounts of files
adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean
A classification analysis on the iris flower dataset
Reproducible execution & provenance capture
datalad run
Computational reproducibility
Code may fail (to reproduce) if run with different software
Datasets can store (and share) software environments (Docker or Singularity containers)
and reproducibly execute code inside of the software container, capturing software as additional
provenance
DataLad extension: datalad-container
datalad-containers run
Summary - Reproducible execution
datalad run records a command and
its impact on the dataset.
All dataset modifications are saved - use it
in a clean dataset.
Data/directories specified as --input
are retrieved prior to command execution.
Use one flag per input.
Data/directories specified as --output
will be unlocked for modifications prior to a rerun of the command.
Its optional to specify, but helpful for recomputations.
datalad containers-run can be used
to capture the software environment as provenance.
Its ensures computations are ran in the desired software set up.
Supports Docker and Singularity containers
datalad rerun can automatically re-execute run-records later.
They can be identified with any commit-ish (hash, tag, range, ...)
datalad rerun
datalad rerun is helpful to spare others and yourself
the short- or long-term memory task, or the forensic skills to figure
out how you performed an analysis
But it is also a digital and machine-reable provenance record
Important: The better the run command is specified, the better the
provenance record
Note: run and rerun only create an entry in the history if the command execution
leads to a change.
Questions!
Unlocking things
datalad run "unlocks" everything specified as --output
Outside of datalad run, you can use datalad unlock
This makes annex'ed files writeable:
$ ls -l myfile
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af
# unlocking
$ datalad unlock myfile
unlock(ok): myfile (file)
$ ls -l myfile
-rw-r--r-- 1 adina adina 7 Nov 17 07:08 myfile # not a symlink anymore!
datalad save "locks" the file again
$ datalad save
add(ok): myfile (file)
action summary:
add (ok: 1)
save (notneeded: 1)
$ ls -l myfile
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af
Some tools (e.g., MatLab) don't like
symlinks. Unlocking or running matlab with "datalad run" helps!
Removing datasets
As mentioned before, annexed data is write-protected.
So when you try to rm -rf a dataset, this happens:
If a dataset contains file for which no other remote copy is known, you'll
get a warning:
$ datalad remove -d mydataset
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
[ERROR ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/tmp/mydataset/interdisciplinary.png)]
drop(error): interdisciplinary.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
[WARNING] could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png'] [drop(/tmp/mydataset)]
drop(impossible): . (directory) [could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png']]
action summary:
drop (error: 1, impossible: 1)
Compare different models, repeat with updated data
Imagenette dataset
Prepare an input dataset
Create a stand-alone input dataset
Either add data and datalad save it, or use commands such as datalad download-url
or datalad add-urls to retrieve it from web-sources
Configure and setup an analysis dataset
Given the purpose of an analysis dataset, configurations can make it easier to use:
-c yoda prepares a useful structure
-c text2git keeps text files such as scripts in Git
The input dataset is installed as a subdataset
Required software is containerized and added to the dataset
Prepare data
Add a script for data preparation (labels train and validation images)
Execute it using datalad containers-run
Train models and evaluate them
Add scripts for training and evaluation.
This dataset state can be tagged to identify it easily at a later point
Execute the scripts using datalad containers-run
By dumping a trained model as a joblib object the trained classifier stays reusable
Tips and tricks for ML applications
Standalone input datasets keep input data extendable and reusable
Subdatasets can be registered in precise versions, and updated to the newest state
Software containers aid greatly with reproducibility
The correct software environment is preserved and can be shared
Re-executable run-records can capture all provenance
This can also capture command-line parametrization
Git workflows can be helpful elements in ML workflows
DataLad is no workflow manager, but by checking out out tags
or branches one can switch easy and fast between results of different models
Why use DataLad?
Mistakes are not forever anymore: Easy version control, regardless of file size
Who needs short-term memory when you can have run-records?
Disk-usage magic: Have access to more data than your hard drive has space
Collaboration and updating mechanisms: Alice shares her data with Bob. Alice fixes a mistake and pushes the fix.
Bob says "datalad update" and gets her changes. And vice-versa.
Transparency: Shared datasets keep their history. No need to track down a former student,
ask their project what was done.