You write a paper about an algorithm, stay up
late to generate good-looking figures, but you have to tweak parameters and
display options to make it work AND look good. The next morning, you have no
idea which parameters produced which figures, and which of the figures
fits to what you report in the paper.
Illustration adapted from Scriberia and The Turing Way
Common problems in science
Your research project produces phenomenal results, but your laptop,
the only place that stores the source code for the results, is
stolen/breaks

https://co.pinterest.com/pin/551128073121451139//imgcredit>
Common problems in science
A graduate student approaches their supervisor, complaining that the
supervisors research idea does not work. After weeks of discussion,
it becomes apparent that oral communication doesn't suffice - the
student can't sufficiently explain the environment (data, algorithms,
...) they constructed, and if the supervisor can't enter and use the
students project there's no way to find a fix.

http://phdcomics.com/comics.php?f=1693
Common problems in science
A Post-doc wrote a script during the PhD that applied a specific
method to a dataset. Now, with new data and a new project, they
try to reuse the script, but forgot how it worked.

http://phdcomics.com/comics.php?f=1693
common problems in science
You try to recreate results from another lab's published paper.
You base your re-implementation on everything reported in their paper,
but the results you obtain look nowhere like the original.

http://phdcomics.com/comics.php?f=1693
common old problems in science
Why don't we make our live easier?
Both for you and your future self, as well as for science as a whole?
The tools exist, and are getting easier and
easier to use.
Sometimes, you only need to know that something exists and ...
👏 just 👏 get 👏 started! 👏
... but also don't be too hard on yourself! 🤗

DataLad
can help
with small or large-scale
data management
Free,
open source,
command line tool & Python API

- A command-line tool, available for all major operating systems
(Linux, macOS/OSX, Windows), MIT-licensed
- Build on top of Git
and Git-annex
- Allows...
- ... version-controlling arbitrarily large content
- version control data and software alongside to code!
- ... transport mechanisms for sharing and obtaining data
- consume and collaborate on data (analyses) like software
- ... (computationally) reproducible data analysis
- Track and share provenance of all digital objects
- ... and much more
- Completely domain-agnostic
Everything happens in DataLad datasets
Dataset = Git/git-annex repository
- content agnostic
- no custom data structures
- complete decentralization
- Looks and feels like a directory on your computer:
File viewer and terminal view of a DataLad dataset
version control arbitrarily large files
Stay flexible:
- Non-complex DataLad core API (easy for data management novices)
- Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)
Use a datasets' history
- reset your dataset (or subset of it) to a previous state,
- revert changes or bring them back,
- find out what was done when, how, why, and by whom
- Identify precise versions: Use data in the most recent version, or the one from 2018, or...
Consume and collaborate
machine-readable, re-executable provenance
Seamless nesting and dataset linkage
Third party integrations
Apart from local computing infrastructure (from private laptops to computational clusters),
datasets can be hosted in major third party repository hosting and cloud storage services.
More info: Chapter on
Third party infrastructure.
Examples of what DataLad can be used for:
- Publish or consume datasets via GitHub, GitLab, OSF, or similar services
Examples of what DataLad can be used for:
Examples of what DataLad can be used for:
- Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
Examples of what DataLad can be used for:
- Central data management and archival system
... and many more!
Let's have a ☕, and then get started