Introduction starts in Have a β˜•!

Research data management
πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»
with DataLad

Adina Wagner
@AdinaKrik
Michael Hanke
@eknahm

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center JΓΌlich



Slides: https://github.com/datalad-handbook/course/

Welcome!

Approximate workshop schedule

Session 1 (now, 13.30-15.00)
Logistics & IntroπŸ§‘β€πŸ«,
Hands-on Terminal Basics πŸ’»,
Demo of core functionality πŸ§‘β€πŸ«πŸ’»

Session 2 (today, 16.00-18.00)
Hands-on DataLad Basics & Exercises πŸ’»

Session 3 (tomorrow, 11.00-12.30)
Sharing and Collaboration πŸ§‘β€πŸ«,
Hands-on Data publication πŸ’»

Session 4 (tomorrow, 13.30-15.00)
Computational reproducibility πŸ§‘β€πŸ«πŸ’»,
Outro πŸ§‘β€πŸ«,
Final QA ❔

Logistics and links

Interactivity



  • The workshop centers around DataLad (version 0.16 and up) for real-world research data management use cases
  • There are no stupid questions; ask anything any time
  • Something doesn't look right on your system? Stick a post-it to your screen. We'll take a look together
  • We're available outside of sessions, too. Chat about your use cases or questions over a coffee or meal


  • 4 sessions = time for more than a
    standard introduction.
  • Materials are available
    online & persistent, we can
    be flexible & spontaneous
    if specific topics interest you
  • After the workshop

    Audience response system

    Use your phone to scan the QR code, or open the link in a new browser window

    On a scale of rubber ducks...

    Research data management

    Common problems in science

    You write a paper & stay up late to generate good-looking figures, but you have to tweak many parameters and display options. The next morning, you have no idea which parameters produced which figures, and which of the figures fit to what you report in the paper.
    Illustration adapted from Scriberia and The Turing Way

    Common problems in science

    Your research project produces phenomenal results, but your laptop, the only place that stores the source code for the results, is stolen or breaks
    https://co.pinterest.com/pin/551128073121451139//imgcredit>

    Common problems in science

    A graduate student complains that a research idea does not work. Their supervisor can't figure out what the student did and how, and the student can't sufficiently explain their approach (data, algorithms, software). Weeks of discussion and mis-communication ensues because the supervisor can't first-hand explore or use the students project.
    http://phdcomics.com/comics.php?f=1693

    Common problems in science

    You wrote a script during your PhD that applied a specific method to a dataset. Now, with new data and a new project, you try to reuse the script, but forgot how it worked.
    http://phdcomics.com/comics.php?f=1693

    common problems in science

    You try to recreate results from another lab's published paper. You base your re-implementation on everything reported in their paper, but the results you obtain look nowhere like the original.
    http://phdcomics.com/comics.php?f=1693

    common old problems in science

    All these problems were paraphrased from Buckheit & Donoho, 1995
    Let's do better!

    DataLad can help
    with small or large-scale
    data management
    Free,
    open source,
    command line tool & Python API

    • A command-line tool, available for all major operating systems (Linux, macOS/OSX, Windows), MIT-licensed
    • Build on top of Git and Git-annex
    • Allows...
    • ... version-controlling arbitrarily large content
      version control data and software alongside to code!
      ... transport mechanisms for sharing and obtaining data
      consume and collaborate on data (analyses) like software
      ... (computationally) reproducible data analysis
      Track and share provenance of all digital objects
      ... and much more
    • Completely domain-agnostic

    Acknowledgements

    Software
    • Joey Hess (git-annex)
    • The DataLad team & contributors
    Illustrations
    • The Turing Way
      project & Scriberia
    Funders
    Collaborators

    Examples of what DataLad can be used for:

    Examples of what DataLad can be used for:

    • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
    • a screenrecording of cloning REMODNAV paper dataset from github

    Examples of what DataLad can be used for:

    • Creating and sharing reproducible, open science: Sharing data, software, code, and provenance
    • a screenrecording of cloning REMODNAV paper dataset from github

    Examples of what DataLad can be used for:

    • Central data management and archival system

    Examples of what DataLad can be used for:

    • Scalable computing framework for reproducible science

    Prerequisites: Terminal

    • DataLad can be used from the command line
    • datalad create mydataset
    • ... or with its Python API
    • import datalad.api as dl
      dl.create(path="mydataset")
    • ... and other programming languages can use it via system call
    • # in R
      > system("datalad create mydataset")
      


    Prerequisites: Terminal

    datalad-hub.inm7.de

    Unix terminal cheatsheet (incl. Windows equivalents)

    Prerequisites: Installation and Configuration

    • Your installed version of DataLad should be 0.17.2
    • datalad --version
      0.17.2
    • DataLad relies on Git to create a revision history with detailed information on what was changes, when, and how. Therefore, you should tell Git who you are and configure a Git identity (name and email). Find out if an identity is set by running either of:
    • $ git config --get user.name
      Adina Wagner
      $ git config --get user.email
      adina.wagner@t-online.de                               .
      
      $ datalad configuration get user.name user.email
      Adina Wagner
      adina.wagner@t-online.de
                                                             .
      
    • Set a Git identity using either of
      $ git config set --global \
        user.name "Adina Wagner"
      $ git config set --global \
        user.email "adina.wagner@t-online.de"                    .
      $ datalad configuration --scope global \
        set user.name="Adina Wagner"
      $ datalad configuration --scope global \
        set user.email="adina.wagner@t-online.de"                     .
    • Allow brand-new DataLad functionality:
      datalad configuration --scope global set datalad.extensions.load=next
    • Find installation and configuration instructions at handbook.datalad.org

    Prerequisites: Using DataLad

    • Every DataLad command consists of a main command followed by a sub-command. The main and the sub-command can have options.
    • Example (main command, subcommand, several subcommand options):
      $ datalad save -m "Saving changes" --recursive 
    • Use --help to find out more about any (sub)command and its options, including detailed description and examples (q to close). Use -h to get a short overview of all options
      $ datalad save -h
            Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
                          [-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
                          [--version]
                          [PATH ...]
      
      Use '--help' to get more comprehensive information.
                

    Backup

    Core concepts & features

    Everything happens in DataLad datasets


    Dataset = Git/git-annex repository

    • content agnostic
    • no custom data structures
    • complete decentralization
    • Looks and feels like a directory on your computer:


    File viewer and terminal view of a DataLad dataset

    version control arbitrarily large files


      Stay flexible:

    • Non-complex DataLad core API (easy for data management novices)
    • Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)

    Use a datasets' history

    • reset your dataset (or subset of it) to a previous state,
    • revert changes or bring them back,
    • find out what was done when, how, why, and by whom
    • Identify precise versions: Use data in the most recent version, or the one from 2018, or...

    Consume and collaborate


    machine-readable, re-executable provenance


    Seamless nesting and dataset linkage


    Core concepts & features

    Everything happens in DataLad datasets


    Dataset = Git/git-annex repository

    • content agnostic
    • no custom data structures
    • complete decentralization
    • Looks and feels like a directory on your computer:


    File viewer and terminal view of a DataLad dataset

    version control arbitrarily large files


      Stay flexible:

    • Non-complex DataLad core API (easy for data management novices)
    • Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)

    Use a datasets' history

    • reset your dataset (or subset of it) to a previous state,
    • revert changes or bring them back,
    • find out what was done when, how, why, and by whom
    • Identify precise versions: Use data in the most recent version, or the one from 2018, or...

    Consume and collaborate


    machine-readable, re-executable provenance


    Seamless nesting and dataset linkage


    Third party integrations


    Apart from local computing infrastructure (from private laptops to computational clusters), datasets can be hosted in major third party repository hosting and cloud storage services. More info: Chapter on Third party infrastructure.

    Third party integrations


    Apart from local computing infrastructure (from private laptops to computational clusters), datasets can be hosted in major third party repository hosting and cloud storage services. More info: Chapter on Third party infrastructure.