FDM.NRW Werkstatt 2023

title image
Michael Hanke
Adina Wagner
Stephan Heunis

@datalad datalad

Psychoinformatics lab
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7), Germany


Slides: psychoinformatics-de.github.io/fdm-datalad
files.inm7.de/adina/fdm-datalad

Acknowledgements

Agenda


  1. Talk: Using DataLad to interface RDM infrastructures
  2. Code-along demo: Publishing datasets

1

Using DataLad to interface RDM infrastructures

whatisdatalad_2023.html

2

Code-along demonstration

Practical aspects

jupyterlogo
  • We'll work in the browser on a cloud server with JupyterHub
  • Cloud-computing environment:
       - datalad-hub.inm7.de
  • We have pre-installed DataLad and other requirements
  • We will work via the terminal
  • Draw a username, and set a password of your choice when logging in for the first time; remember it!

Using DataLad in the Terminal

Check the installed version:
        
            datalad --version
        
        

For help on using DataLad from the command line:
            
                datalad --help
            
        
For extensive info about the installed package, its dependencies, and extensions, use wtf:
            
                datalad wtf
            
        

git identity

Check git identity:
        
            git config --get user.name
            git config --get user.email
        
    
Configure git identity:
            
                git config --global user.name "Stephan Heunis"
                git config --global user.email "s.heunis@fz-juelich.de"
            
        
Use the latest datalad features:
            
                git config --global --add datalad.extensions.load next
            
        

Using datalad via its Python API

Open a Python environment:
        
            ipython
        
    
Import and start using:
            
                import datalad.api as dl
                dl.create(path='mydataset')
            
        
Exit the Python environment:
            
                exit
            
        

Quick Basics


Quick Basics


Quick Basics


Quick Basics


Quick Basics


Our dataset: Midterm YODA Data Analysis Project

  • DataLad dataset: https://github.com/datalad-handbook/midterm_project
  • Find out more: A Data Analysis Project with DataLad
  • All inputs (i.e. building blocks from other sources) are located in the input/ subdataset
  • Custom code is located in code/
  • Relevant software is included as a software container
  • Outcomes are generated with a provenance tracked run command, and located in the root of the dataset:
    • prediction_report.csv contains the main classification metrics
    • output/pairwise_relationships.png is a plot of the relations between features.
        
            [DS~0] ~/midterm_project
            ├── CHANGELOG.md
            ├── README.md
            ├── code/
            │   ├── README.md
            │   └── script.py
            ├── [DS~1] input/
            │   └── iris.csv -> .git/annex/objects/...
            ├── pairwise_relationships.png -> .git/annex/objects/...
            └── prediction_report.csv -> .git/annex/objects/...
        
    

Our dataset: Midterm YODA Data Analysis Project

  • Let's explore the dataset briefly
  • Install the dataset:
            
                datalad clone \
                https://github.com/datalad-handbook/midterm_project.git
            
            
    Find out about its subdatasets
            
                cd midterm_project
                datalad subdatasets
            
            
    Get some contents
            
                datalad get input
            
            
    Drop some contents
            
                datalad drop input
            
            
    Find out about its history
            
                tig
            
            
    Reproduce an analysis
            
                datalad rerun HEAD~2
            
            

    Our plan


    1. Learn - How do we publish data?
    2. Do - Publish data to:
      • GitLab
      • OSF
      • Sciebo (webdav)
      • Dataverse

    "Share data like source code"

    • Datasets can be cloned, pushed, and updated from and to local and remote paths, remote hosting services, external special remotes
    • Examples:
      Local path
      ../my-projects/experiment_data
      Remote path
      myuser@myinstitutes.hcp.system:/home/myuser/my-projects/experiment_data
      Hosting service
      git.github.com:myuser/experiment_data.git
      External special remotes
      osf://my-osf-project-id

    Interoperability

    • DataLad is built to maximize interoperability and use with hosting and storage technology
    See the chapter Third party infrastructure for walk-throughs for different services

    Interoperability

    • DataLad is built to maximize interoperability and use with hosting and storage technology
    See the chapter Third party infrastructure for walk-throughs for different services

    Publishing datasets

    I have a dataset on my computer. How can I share it, or collaborate on it?

    Glossary

    Sibling (remote)
    Linked clones of a dataset. You can usually update (from) siblings to keep all your siblings in sync (e.g., ongoing data acquisition stored on experiment compute and backed up on cluster and external hard-drive)
    Repository hosting service
    Webservices to host Git repositories, such as GitHub, GitLab, Bitbucket, Gin, ...
    Third-party storage
    Infrastructure (private/commercial/free/...) that can host data. A "special remote" protocol is used to publish or pull data to and from it
    Publishing datasets
    Pushing dataset contents (Git and/or annex) to a sibling using datalad push
    Updating datasets
    Pulling new changes from a sibling using datalad update --merge

    Publishing datasets

    • Most public datasets separate content in Git versus git-annex behind the scenes

    Publishing datasets

    Publishing datasets

    Publishing datasets

    Typical case:
    • Datasets are exposed via a private or public repository on a repository hosting service
    • Data can't be stored in the repository hosting service, but can be kept in almost any third party storage
    • Publication dependencies automate pushing to the correct place, e.g.,
                      
      $ git config --local remote.github.datalad-publish-depends gdrive
      # or
      $ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3
                  
                  

    Publishing datasets

    Special case 1: repositories with annex support

    Publishing datasets

    Special case 2: Special remotes with repositories

    2a

    Publishing to GitLab

    https://gitlab.com/

    gitlab-logo

    create-sibling-gitlab

     (docs)
    1. Log into GitLab
    2. Create personal access token
    3. Create a top-level group
    4. Create a gitlab config file (replace relevant items)
                
                    cat << EOF > ~/.python-gitlab.cfg
                    [my-site]
                    url = https://gitlab.com/
                    private_token = my-gitlab-token
                    api_version = 4
                    EOF
                
            
    5. Configure create-sibling-gitlab in the midterm_project dataset:
                
                    datalad configuration set datalad.gitlab-default-site='my-site'
                    datalad configuration set datalad.gitlab-'my-site'-project='my-top-level-group'
                
            
    6. Create the sibling:
                
                    datalad create-sibling-gitlab -d . --recursive -s 'my-gitlab-sibling'
                
            
    7. Push to the sibling:
                
                    datalad push -d . --recursive --to 'my-gitlab-sibling'
                
            

    Publishing to OSF

    https://osf.io/

    datalad-osf-logo

    create-sibling-osf

     (docs)
    1. Log into OSF
    2. Create personal access token
    3. Enter credentials using datalad osf-credentials:
                
                    datalad osf-credentials
                
            
    4. Create the sibling:
                
                    datalad create-sibling-osf -d . -s my-osf-sibling \
                    --title 'my-osf-project-title' --mode export --public
                
            
    5. Push to the sibling:
                
                    datalad push -d . --to my-osf-sibling
                
            
    6. Clone from the sibling:
                
                    cd ..
                    datalad clone osf://my-osf-project-id my-osf-clone
                
            

    Publishing to Sciebo

    https://hochschulcloud.nrw/en/

    sciebo-logo

    create-sibling-webdav

     (docs)
    1. Log into Sciebo
    2. Create a new folder, e.g., datalad-fdm
    3. Copy your WebDAV URL and add the folder name at the end: Menu > Files > Settings > WebDAV
      E.g.: https://fz-juelich.sciebo.de/remote.php/dav/files/s.heunis%40fz-juelich.de/datalad-fdm
    4. Create the sibling:
                
                    cd midterm_project
                    datalad create-sibling-webdav \
                    -d . \
                    -r \
                    -s my-webdav-sibling \
                    --mode filetree 'my-webdav-url'
                
            
    At this point, DataLad should ask for credentials if you have not entered them before. Enter your Sciebo username and password.
    4. Push to the sibling:
                
                    datalad push -d . --recursive --to my-webdav-sibling
                
            
    5. Clone from the sibling:
                
                    cd ..
                    datalad clone 'datalad-annex::?type=webdav&encryption=none\
                    &exporttree=yes&url=my-webdav-url/dataset-name' my-webdav-clone
                
            

    Publishing to Dataverse

    https://dataverse.org/

    dataladdataverse-logo

    add-sibling-dataverse

     (docs)
    1. Create an account and log into demo.dataverse.org (or your instance)
    2. Find your API token (Username > API Token)
    3. Create a new Dataverse dataset
    4. Add required metadata and save dataset
    5. Retieve dataset DOI and the Dataverse instance URL
    6. Create the sibling:
                
                    cd midterm_project
                    datalad add-sibling-dataverse -d . -s my-dataverse-sibling \
                    'my-dataverse-instance-url' doi:'my-dataset-doi'
                
            
    for example:
                
                    datalad add-sibling-dataverse -d . -s dataverse \
                    https://demo.dataverse.org  doi:10.70122/FK2/3K9FOD
                
            
    (DataLad asks for credentials (token) if you haven't entered them before)
    7. Push to the sibling:
                
                    datalad push -d . --to my-dataverse-sibling
                
            
    8. Clone from the sibling:
                
                    cd ..
                    datalad clone 'datalad-annex::?type=external&externaltype=dataverse\
                    &encryption=none&exporttree=no&url=my-dataverse-instance-url\
                    &doi='my-dataset-doi' my-sciebo-clone
                
            

    Discussion

    • Any questions?
    • Any comments?
    • Any feature requests?

    Extra Walk-through

    Datalad Basics

    Datalad datasets...

    ...Datalad datasets

    Create a dataset (here, with the text2git config):
            
                datalad create -c text2git bids-data
            
        
    Let's have a look inside. Navigate using cd (change directory):
                
                    cd bids-data
                
            
    List the directory content, including hidden files, with ls:
                
                    ls -la .
                
            

    Version control...

    ...Version control

    Let's add some Markdown text to a README file in the dataset
            
                echo "# A BIDS structured dataset for my input data" > README.md
            
        
    Now we can check the status of the dataset:
                
                    datalad status
                
            
    We can save the state with save
                
                    datalad save -m "Add a short README"
                
            
    Further modifications:
                
                    echo "Contains functional task data of one subject" >> README.md
                
            
    Save again:
                
                    datalad save -m "Add information on the dataset contents to the README"
                
            
    Now, let's check the dataset history:
                
                    git log
                
            

    Data consumption & transport...

    ...Data consumption & transport...

    Install a dataset from remote URL (or local path) using clone:
            
                cd ../
                datalad clone \
                https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
            
        
    We can now view the cloned dataset's file tree:
                
                    cd studyforrest-data-phase2
                    ls
                
            
    Let's check the dataset size (i.e. git repository):
                
                    du -sh # this will print the size of the directory in human readable sizes
                
            
    Let's check the actual dataset size (i.e. git repository + annexed content):
                
                    datalad status --annex
                
            
    The DataLad dataset is just the git repository, i.e. the metadata of all files in the dataset, including the content of all files comitted to git. The actual file content in the annex can be retrieved as needed.

    ...Data consumption & transport

    We can retrieve actual file content with get (here, multiple files):
            
                # get all files of sub-01 for all functional runs of the localizer task
                datalad get \
                sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-*.nii.gz
            
        
    If we don't need a file locally anymore, we can drop it:
                
                    # drop a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
                
            
    And it's no problem if you need that exact file again, just getit:
                
                    # get a specific file
                    datalad drop \
                    sub-01/ses-localizer/func/sub-01_ses-localizer_task-objectcategories_run-4_bold.nii.gz
                
            
    Therefore: no need to store all files locally. Data just needs to be available from at least one location, then you can get what you want when you need it, and drop the rest.

    Dataset nesting...

    Datasets can be nested in superdataset-subdataset hierarchies:
    • Helps with scaling (see e.g. the Human Connectome Project dataset )
    • Version control tools struggle with >100k files
    • Modular units improves intuitive structure and reuse potential
    • Versioned linkage of inputs for reproducibility

    ...Dataset nesting

    Let's make a nest! First we navigate into the top-level dataset:
            
                cd ../bids-data
            
        
    Then we clone the input dataset into a specific location in the file tree of the existing dataset, making it a subdataset (using the -d/--dataset flag):
                
                    datalad clone --dataset . \
                    https://github.com/datalad/example-dicom-functional.git  \
                    inputs/rawdata
                
            
    Similarly, we can clone the analysis container (actually, a set of containers from ReproNim) as a subdataset:
                
                    datalad clone -d . \
                    https://github.com/ReproNim/containers.git \
                    code/containers
                
            
    Let's see what changed in the dataset, using the subdatasets command:
                
                    datalad subdatasets
                
            

    Computationally reproducible execution...

    • which script/pipeline version
    • was run on which version of the data
    • to produce which version of the results?

    ...Computationally reproducible execution...

    • The datalad run can run any command in a way that links the command or script to the results it produces and the data it was computed from
    • The datalad rerun can take this recorded provenance and recompute the command
    • The datalad containers-run (from the extension) can capture software provenance in the form of software containers in addition to the provenance that datalad run captures


    With the datalad-container extension, we can inspect the list of registered containers (recursively):
                
                    datalad containers-list --recursive
                
            
    We'll use the repronim-reproin container for dicom conversion.

    ...Computationally reproducible execution

    Now, let's try out the containers-run command:
            
                datalad containers-run -m "Convert subject 02 to BIDS" \
                --container-name code/containers/repronim-reproin \
                --input inputs/rawdata/dicoms \
                --output sub-02 \
                "-f reproin -s 02 --bids -l '' --minmeta -o . --files inputs/rawdata/dicoms"
            
        
    What changed after the containers-run command has completed?
    We can use datalad diff (based on git diff):
                
                    datalad diff -f HEAD~1
                
            
    We see that some files were added to the dataset!
    And we have a complete provenance record as part of the git history:
                
                    git log -n 1
                
            

    Publishing datasets...


    We will use GIN: gin.g-node.org:

    1. Create a GIN user account and log in
    2. Create and upload an SSH key to GIN
    3. Create a new empty repository named "bids-data" and copy it's SSH URL

    ...Publishing datasets

    DataLad allows you to add repositories as siblings to a dataset, to which data can be published with push.
            
                datalad siblings add -d . \
                --name gin \
                --url git@gin.g-node.org:/your-gin-username/bids-data.git
            
        
    You can verify the dataset's siblings with the siblings command:
                
                    datalad siblings
                
            
    And we can push our complete dataset (git repository and annex) to GIN:
                
                    datalad push --to gin
                
            

    Using published data...

    Let's use our published data in a new analysis, to demonstrate reusability and the usefulness of modularity.

    First let's create a new dataset using the yoda principles:
            
                cd ../
                datalad create -c yoda myanalysis
            
        
    Then we can clone our GIN-published dataset as a subdataset
    (NB: use the browser URL without ".git" suffix):
                
                    cd myanalysis
                    datalad clone -d . \
                    https://gin.g-node.org/your-gin-username/bids-data \
                    input
                
            

    ...Using published data...

    We have data, and now we need an analysis script. We will use DataLad's download-url which gets the content of a script and registers its source:
            
                datalad download-url -m "Download code for brain masking from Github" \
                -O code/get_brainmask.py \
                https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
            
        
    Now we have data and an analysis script, and we still need the correct software environment within which the run the analysis. We will again use the datalad-container extension to register a container to the new dataset:
                
                    datalad containers-add nilearn \
                    --url shub://adswa/nilearn-container:latest \
                    --call-fmt "singularity exec {img} {cmd}"
                
            

    ...Using published data

    Finally, we can run the analysis:
            
                datalad containers-run -m "Compute brain mask" \
                -n nilearn \
                --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
                --output figures/ \
                --output "sub-02*" \
                "python code/get_brainmask.py"
            
        
    Afterwards, we can inspect how specific files came to be, e.g.:
                
                    git log sub-02_brain-mask.nii.gz
                
            
    And since the run-record is part of the dataset's git history, we know the provenance. DataLad can use this machine-readable information to rerun the analysis without you having to specify any information again:
                
                    datalad rerun