Research data management
👩‍💻👨‍💻
with DataLad

Adina Wagner
@AdinaKrik
More authors

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center Jülich



Slides: DOI 10.5281/zenodo.6346849 (Scan the QR code)

Ice breaker questions

Acknowledgements

Software
  • Joey Hess (git-annex)
  • The DataLad team & contributors
Illustrations
  • The Turing Way
    project & Scriberia
Funders
Collaborators

Some code

On demand file access via git-annex/DataLad:
# clone the repository
$ git clone https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
# get one or more files/directories/... on demand
$ git annex get file/directory/...
# or
$ datalad clone https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
$ datalad get file/directory/...
Fortunate side-effect: Cloned repos/datasets are small in size,
but can be browsed for existing files and can provide access to
their content regardless of where it is hosted.
You can have access to more files than your computer has diskspace!

Summary - something

datalad create creates an empty dataset.
Configurations (-c yoda, -c text2git) are useful (details soon).

A dataset has a history to track files and their modifications.
Explore it with Git (git log) or external tools (e.g., tig).

datalad save records the dataset or file state to the history.
Concise commit messages should summarize the change for future you and others.

datalad download-url obtains web content and records its origin.
It even takes care of saving the change.

datalad status reports the current state of the dataset.
A clean dataset status (no modifications, not untracked files) is good practice.

This one floats

  • git status can guide you through resolving the merge conflict. Run it frequently
$ git status                                                                                  1 !
On branch preproc
You have unmerged paths.
  (fix conflicts and run "git commit")
  (use "git merge --abort" to abort the merge)

Unmerged paths:
  (use "git add file..." to mark resolution)
	both modified:   code/preproc.sh

no changes added to commit (use "git add" and/or "git commit -a")

"I'm in a merge conflict!"

How to emergency-abort

What to do next

Which files contain conflicts

Here's an r-stack

Example: A Git repository with annexed data in a public S3 bucket
Did you know? All of OpenNeuro's datasets are DataLad datasets Do you want to publish data to s3? handbook.datalad.org has a Walkthrough
Slide 2