Welcome Session starts in Have a ☕!

Research data management
👩‍💻👨‍💻
with DataLad

Adina Wagner
@AdinaKrik
Michał Szczepanik

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center Jülich



Slides: https://github.com/datalad-handbook/course/

welcome!

A few logistical things first:

Questions/interaction throughout the workshop

  • If you have a question during a lecture, please first type your questions in the chat. There are no stupid questions :)
  • It would be great to have lively discussions - unless its interrupting others, please feel encouraged to unmute/turn on your video to interact with us.
  • We're happy to discuss specific use cases at the end. Please make a note about them in the "Shared notes"

Questions/interaction after the workshop

Resources and Further Reading

Comprehensive user documentation in the
DataLad Handbook (handbook.datalad.org)
  • High-level function/command overviews,
    Installation, Configuration, Cheatsheet
  • Narrative-based code-along course
  • Independent on background/skill level,
    suitable for data management novices
  • Step-by-step solutions to common
    data management problems, like
    how to make a reproducible paper

Overview of most tutorials, talks, videos, ... at github.com/datalad/tutorials

Live polling system

Please use your phone to scan to QR code, or open the link in a new browser window

What's your mood today?

What's your level of excitement?

Video recordings

The recording would be edited or stopped to exclude certain or all discussions.
This poll is unanimous - only if everyone votes "yes" the workshop will be recorded

What will we do today?

  • The workshop centers around DataLad (version 0.16)
  • We aim to do more than a standard introduction by providing in-depth explanations, hands-on exercises, and discussions throughout the workshop
  • (Help us by asking any question that comes up!)

Motivation

Common problems in science

You write a paper about an algorithm, stay up late to generate good-looking figures, but you have to tweak parameters and display options to make it work AND look good. The next morning, you have no idea which parameters produced which figures, and which of the figures fits to what you report in the paper.
Illustration adapted from Scriberia and The Turing Way

Common problems in science

Your research project produces phenomenal results, but your laptop, the only place that stores the source code for the results, is stolen/breaks
https://co.pinterest.com/pin/551128073121451139//imgcredit>

Common problems in science

A graduate student approaches their supervisor, complaining that the supervisors research idea does not work. After weeks of discussion, it becomes apparent that oral communication doesn't suffice - the student can't sufficiently explain the environment (data, algorithms, ...) they constructed, and if the supervisor can't enter and use the students project there's no way to find a fix.
http://phdcomics.com/comics.php?f=1693

Common problems in science

A Post-doc wrote a script during the PhD that applied a specific method to a dataset. Now, with new data and a new project, they try to reuse the script, but forgot how it worked.
http://phdcomics.com/comics.php?f=1693

common problems in science

You try to recreate results from another lab's published paper. You base your re-implementation on everything reported in their paper, but the results you obtain look nowhere like the original.
http://phdcomics.com/comics.php?f=1693

common old problems in science

All these problems were paraphrased from Buckheit & Donoho, 1995
Let's do better!

DataLad can help
with small or large-scale
data management
Free,
open source,
command line tool & Python API

  • A command-line tool, available for all major operating systems (Linux, macOS/OSX, Windows), MIT-licensed
  • Build on top of Git and Git-annex
  • Allows...
  • ... version-controlling arbitrarily large content
    version control data and software alongside to code!
    ... transport mechanisms for sharing and obtaining data
    consume and collaborate on data (analyses) like software
    ... (computationally) reproducible data analysis
    Track and share provenance of all digital objects
    ... and much more
  • Completely domain-agnostic

Acknowledgements

Software
  • Joey Hess (git-annex)
  • The DataLad team & contributors
Illustrations
  • The Turing Way
    project & Scriberia
Funders
Collaborators

Core concepts & features

Everything happens in DataLad datasets


Dataset = Git/git-annex repository

  • content agnostic
  • no custom data structures
  • complete decentralization
  • Looks and feels like a directory on your computer:


File viewer and terminal view of a DataLad dataset

version control arbitrarily large files


    Stay flexible:

  • Non-complex DataLad core API (easy for data management novices)
  • Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)

Use a datasets' history

  • reset your dataset (or subset of it) to a previous state,
  • revert changes or bring them back,
  • find out what was done when, how, why, and by whom
  • Identify precise versions: Use data in the most recent version, or the one from 2018, or...

Consume and collaborate


machine-readable, re-executable provenance


Seamless nesting and dataset linkage


Third party integrations


Apart from local computing infrastructure (from private laptops to computational clusters), datasets can be hosted in major third party repository hosting and cloud storage services. More info: Chapter on Third party infrastructure.

Examples of what DataLad can be used for:

Examples of what DataLad can be used for:

  • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
  • a screenrecording of cloning REMODNAV paper dataset from github

Examples of what DataLad can be used for:

  • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
  • a screenrecording of cloning REMODNAV paper dataset from github

Examples of what DataLad can be used for:

  • Central data management and archival system

Examples of what DataLad can be used for:

  • Scalable computing framework for reproducible science

... and many more!




Let's fire up a terminal and get started with the Basics