Research data management
πŸ‘©β€πŸ’»πŸ§ πŸ‘¨β€πŸ’»
for Neuroimagers

Adina Wagner
@adswa@mas.to @AdinaKrik

Psychoinformatics lab,
Institute of Neuroscience and Medicine (INM-7)
Research Center JΓΌlich
Institute of Experimental Psychology, HHU DΓΌsseldorf



Slides: DOI 10.5281/zenodo.7419377 (Scan the QR code)

Research data management?






adapted from https://dribbble.com/shots/3090048-Front-end-vs-Back-end

www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates

What exactly is RDM?

JISC, www.jisc.ac.uk/guides/rdm-toolkit (CC-BY)

RDM - for whom?




Funders & publishers require it

Scientific peers & the public increasingly expect it

Win over academic staff (librarians, system administrators)

Your future self will be grateful

Without good RDM, any project becomes dreadful.

RDM in Neuroimaging

    Some peculiarities of our field...

    • Depending on acquisition hardware and analysis software, some data are in proprietary formats (e.g., Neuromag, brain voyager, brain vision)
    • Depending on field, data can be sizeable (e.g., (f)MRI, CT, EEG, PET, MEG)
    • Heterogenous data from complex acquisitions with multiple data channels and modalities
    • Datasets are getting bigger and bigger (Bzdok & Yeo, 2017), e.g. multi-modal imaging, behavioral + genetics data in HCP (humanconnectome.org) or UKBiobank (ukbiobank.ac.uk/)
    • Some data fall under General Data Protection Regulation (GDPR)
    • Complex, multi-stepped analyses


... make RDM more difficult, but also more relevant

BIDS; CC-BY


β”œβ”€β”€ memento_001
β”‚Β Β  β”œβ”€β”€ Move_correc_SSS_alignedinitial_nonfitiso
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 1_memento_001_ml83_mc_transforminitial.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 2_memento_001_ml83-1_mc_transforminitial.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 3_memento_001_ml83-2_mc_transforminitial.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_fix1.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_fix_ft1.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_fix_new1.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data_fix_reduced1.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ delay_photodiode_subject_long_default_realign_only_ICA1.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall_alignedinitial228.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall_alignedinitial461.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall_alignedinitial511.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ num_trials_old_ICA.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ resultfile_probs-1.mat
β”‚Β Β  β”‚Β Β  └── trial_out_ind.mat
β”‚Β Β  β”œβ”€β”€ Move_correc_SSS_realigneddefault_nonfittoiso
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 1_memento_001_ml83_mc_realigneddefault.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 2_memento_001_ml83-1_mc_realigned_default.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 3_memento_001_ml83-2_mc_realigneddefault.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA228.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA455.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA461.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA511.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall228.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall455.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall461.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memento_results_ICA_newall511.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ mri_aligned.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ num_trials_old_ICA.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ outfile_new_all.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ resultfile_new_all.mat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ template_grid.mat
β”‚Β Β  β”‚Β Β  └── trial_out_ind.mat
β”‚Β Β  └── Raw
β”‚Β Β      β”œβ”€β”€ 1_memento_001_ml83.fif
β”‚Β Β      β”œβ”€β”€ 2_memento_001_ml83-1.fif
β”‚Β Β      └── memento_001_ml83-2.fif

    

β”œβ”€β”€ dataset_description.json
β”œβ”€β”€ participants.json
β”œβ”€β”€ participants.tsv
β”œβ”€β”€ README
β”œβ”€β”€ sub-001
β”‚Β Β  β”œβ”€β”€ meg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_acq-calibration_meg.dat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_acq-crosstalk_meg.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_coordsystem.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_channels.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_events.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_log.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_meg.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_split-01_meg.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-001_task-memento_split-02_meg.fif
β”‚Β Β  β”‚Β Β  └── sub-001_task-memento_split-03_meg.fif
β”‚Β Β  └── sub-001_scans.tsv
β”œβ”€β”€ sub-002
β”‚Β Β  β”œβ”€β”€ meg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_acq-calibration_meg.dat
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_acq-crosstalk_meg.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_coordsystem.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_channels.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_events.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_log.tsv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_meg.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_split-01_meg.fif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sub-002_task-memento_split-02_meg.fif
β”‚Β Β  β”‚Β Β  └── sub-002_task-memento_split-03_meg.fif
β”‚Β Β  └── sub-002_scans.tsv
...                            
    

BIDS is an established and evolving community standard for a multitude of neuroimaging data (MRI, (i)EEG, MEG, ...). It defines a data organization, naming schemes for files, and meta data descriptors.
bids.neuroimaging.io

Open {software,standards}





  • remove accessibility barriers
  • allow transparent digital provenance
The building blocks of a scientific result are rarely static
Analysis code evolves
(Fix bugs, add functions, refactor, ...)
Based on Piled Higher and Deeper 1531
Scriberia and The Turing Way (CC-BY)

Version control

CC-BY Scriberia & The Turing Way
    Version control
  • keep things organized
  • keep track of changes
  • revert changes or go
    back to previous states
  • collect and share digital provenance
  • industry standard: Git
The building blocks of a scientific result are rarely static
Data changes
(errors are fixed, data is extended,
naming standards change, an analysis
requires only a subset of your data...)
Piled Higher and Deeper 1323

git-annex and DataLad version control large data

Leaving a trace

"Shit, which version of which script produced these outputs from which version of what data?"

"Shit, why buttons did I click and in which order did I use all those tools?"


CC-BY Scriberia and The Turing Way

1) Create an intuitive structure, and

2) write (plenty! of) documentation as you go, and

3) make your processes machine-readable
Tools and tricks: Perkel, 2020, checklist for computational reproducibility

Research data management is tied to reproducibility

Based on xkcd.com/2347/ (CC-BY) Reproducibility Management in Neuroscience - Specific Issues and Solutions (DOI 10.5281/zenodo.4285927)

Back-Ups and Archival

Ensure that your data are regularily backed-up, and eventually deposited in an appropriate archive or repository
Back-ups
Keep back-ups on different infrastructure, ideally even different physical locations
Synchronize regularly
My personal workflow: Distributed version control
Software
E.g., Software Heritage , Zenodo (both have automatic GitHub integrations)
Data
E.g., Zenodo, Gin.g-node.org Neurovault, DataVerse, Data DRYAD, FigShare


Further reading: the-turing-way.netlify.app/reproducible-research/rdm/rdm-sharing.html

Digital Object Identifiers

  • Example: 10.5281/zenodo.7419377 (uniquely & persistently identifies this talk)
  • Provide a persistent, trusted reference. Resolve any DOI at doi.org
  • Make your work citeable

The Turing Way
    Get a DOI from
  • free academic services and archives that you already use, such as Zenodo, FigShare, or OSF
  • your own institutions (e.g., library, DataVerse, ...)
  • Preprint servers, publishers
  • My tip for large datasets: Gin.g-node.org

Licenses

  • Everything you (co-)create has a copyright, and you're a/the copyright holder
  • Without a license your work is unusable by others
  • Use an established license rather creating one yourself
  • Different licenses are suitable for different types of work
  • Software Data (e.g., comics, text)
    License picker
    The Turing Way
    Creative Commons
    www.homoempresarius.com

    Beware!
    "Non-commercial"
    can have undesired side-effects

Further reading: the-turing-way.netlify.app/reproducible-research/licensing

FAIR

Wilkinson et al., 2016: "The FAIR Guiding Principles for scientific data management and stewardship", doi.org/10.1038/sdata.2016.18

Scriberia and The Turing Way (CC-BY)

Data Management Plans (DMP)

Scriberia and The Turing Way (CC-BY)
    A Data Management Plan (DMP) is a brief plan to define:
  • how the data will be created or used
  • how it will be documented
  • who will be able to access it
  • where it will be stored
  • who will back it up
  • whether (and how) it will be shared and preserved.



Further reading: https://ukdataservice.ac.uk/learning-hub/research-data-management

Take home messages

Thanks for your attention


Slides at DOI 10.5281/zenodo.7419377



Women neuroscientists are underrepresented in neuroscience. You can use the
Repository for Women in Neuroscience to find and recommend neuroscientists for
conferences, symposia or collaborations, and help making neuroscience more open & divers.