Open research software infrastructure in Neuro-Medicine

Open research software infrastructure
in Neuro-Medicine

Adina Wagner mas.to/@adswa
Psychoinformatics lab, Institute of Neuroscience and Medicine, Brain & Behavior (INM-7) Research Center Jülich

Slides: DOI 10.5281/zenodo.10149349 (Scan the QR code)
files.inm7.de/adina/talks/html/zimannheim.html

Open Science and Open Software go hand in hand

Science has specific requirements; research software from within science ("from scientists, for scientists") can fulfill them. Open formats, protocols, and code allow re-use, interoperability, and customization.

Open and reproducible science has specific needs for transparency and accessibility: Open source software provides the necessary auditability.

Creating software becomes increasingly possible for scientists: The San Francisco Declaration on Research Assessment (DORA; signed by FZJ), the Agreement on Reforming Research Assessment (CoARA), and the DFG recognize software as academic output.

The Institute for Neuroscience and Medicine (INM-7)

Interdisciplinary institute with 11 research groups
Research foci:

Infrastructure and method development: Digital biomarker, machine learning, meta analysis, research data management
Basic research in human brain mapping: Connectomics, genetic gradients, in-vivo brain mapping, multimodal integration
AI Applications in medical research: Cognition, Personality, Aging & neurodegenerative disease, Schizophrenia
Ethical implications of medical AI: Bias in AI applications, medical AI and society, individualized predictions

Software @ INM-7

The institute has a history of open source software, starting with the SPM Anatomy Toolbox (Eickhoff, 2005)
Multiple groups develop and maintain open source research software for their respecitve subdomain
Recent integration efforts connect our open software stack to open research software infrastructure for neuro-medicine

Domain-agnostic data management tool (command-line + graphical user interface), built on top of Git & Git-annex
10+ year open source project (100+ contributors), available for all major OS
Born from rethinking data:

Just like code, data is not static.
Just like code, data is subject to collaboration. Stream-lined workflows for sharing and collaborating should be possible, mirroring those in software development.
Provenance of data is essential for reproducible, trustworthy, and FAIR science
Flexibility and interoperability with existing tools is the key to sustainability and ease of use

Domain-agnostic command-line tool (+ graphical user interface), built on top of Git & Git-annex
10+ year open source project (100+ contributors), available for all major OS
Major features:

Version-controlling arbitrarily large content

Version control data & software alongside to code!

Transport mechanisms for sharing, updating & obtaining data

Consume & collaborate on data (analyses) like software

(Computationally) reproducible data analysis

Track and share provenance of all digital objects

(... and much more)

DataLad usecases

Publish or consume datasets via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services

Behind-the-scenes infrastructure component for data transport and versioning (e.g., used by OpenNeuro, brainlife.io , the Canadian Open Neuroscience Platform (CONP), CBRAIN)

Central data management and archival system

Decentral data and metadata catalog

Creating and sharing reproducible, open science: Sharing data, software, code, and provenance

Reproducibility at the largest scale: Framework for computationally reproducible processing

a screenrecording of cloning studyforrest data from github

a screenrecording of browsing open neuro

a screenrecording of cloning REMODNAV paper dataset from github

Acknowledgements

DataLad software & ecosystem

Psychoinformatics Lab,
Research center Jülich
Center for Open
Neuroscience,
Dartmouth College
Joey Hess (git-annex)
>100 additional contributors

DataLad Office Hour

Matrix Chatroom!

Funders

Collaborators

JTrack: Digital biomarkers from your smartphone

Objective: Close monitoring of patients/participants in non-clinical settings
Modern smartphones contain a variety of sensors for passive monitoring and active acquisition:

gyroscope
accelerometer
location
human activity recognition
application usage
screen time
microphone

JTrack

JTrack Social: Smartphone app for participants
JTrack EMA: Smartphone app for participants
JDash: Monitoring and analytics tool for study owners

JTrack components: JTrack Social

Sensor data (passive
collection mode default:
Accelerometer & Gyroscope)
Application usage statistics
Human activaty recognition
(e.g., walking, running, driving)
Location information (anonymized)
Active recording, e.g., free-speech
generation tasks

Available for Android + IPhone

JTrack components: JTrack Social

JTrack components: JTrack EMA

Binary Questions
Date and Time
Questions
Sliding Questions
Multiple/Single
coice questions

Available for Android + IPhone

JTrack components: JDash

Investigator's
study & user
management
Data Quality
Control
Notification
Center

Behind the scenes

Servers in Germany
Data versioning via DataLad
Authenticated data access via JDash
Data transfer via HTTPS

Participant's / Investigator's point of view

Install JTrack
Scan QR code and give permissions to App
Resume daily life

Install JTrack with Participant
Provide study- and subject-specific QR code from JDash
Monitor study and communicate with participants via JDash

Advantages

Easy to deploy and free environment for collection of real world data (RWD) basically at no cost
Standardized data collection across centers
High-density longitudinal data with fully customizable data collection
Opportunity for citizen science

Acknowledgements

Publications

JTrack Social: Sahandi Far et al. 2021
JTrack EMA: Sahandi Far et al. 2023

Contact

via JDash: jdash.inm7.de

JTrack Hour (open to everyone)
Every second Tuesday at 1PM (uneven-numbered calendar weeks) -
Join the Zoom meeting!

Team:

Open source Python library for easy-to use ML-pipelines, built upon scikit-learn
Domain-general, but aims to simplify entry into ML for domain scientists with built-in guarantees against most common pitfalls:

Data leakage
Overfitting of hyperparameters

The problem: Expensive AI mistakes	A solution: User-friendly solutions to common complex use cases
	Simplifies common use cases for supervised ML pipelines, with feature such as: Automatic usage of nested cross- validation for proper evaluation in hyperparameter tuning Preprocessing based on feature types, incl. confound removal Built-in visualization for model inspection and comparison Plug-and-play with scikit-learn transformers

Visualization

Interactive "Scores Viewer" for easier model comparison

Julearn vs scikit-learn

Simple CV pipeline

from julearn import run_cross_validation
run_cross_validation(
    X=X, y=y, data=data,
    preprocess=["zscore"], model="svm",
    problem_type="classification",
    X_types={"continuous": X} # X_types optional here

from sklearn.model_selection import cross_validate
from sklearn.svm import SVC # SVR in case of regression
from sklearn.preprocessing import StandardScaler
from sklearn.pipelines import make_pipeline

pipeline = make_pipeline(StandardScaler(), SVC())
cross_validate(X=data.loc[:,X], y=data.loc[:,y], estimator=pipeline)

Julearn vs scikit-learn

Nested CV with hyperparameter tuning

from julearn import run_cross_validation, PipelineCreator
creator=PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=[True, False])
creator.add("pca", n_components=2)
creator.add("svm", C=[1,2], degree=[3,4])

# X_types optional
run_cross_validation(
    X=X, y=y, data=data, model=creator, X_types={"continuous": X})

from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.svm import SVC # SVR in case of regression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipelines import make_pipeline

pipeline = make_pipeline(StandardScaler(), PCA(), SVC())
param_grid = {
    "standardscaler__with_mean": [True, False],
    "pca__n_components": [2],
    "scv__C": [1,2],
    "svc__degree": [3, 4]
}
grid_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid)
cross_validate(X=data.loc[:,X], y=data.loc[:,y], estimator=pipeline)

Acknowledgements

Fede Raimondo

Sami Hamdan

Kaustubh Patil

Shammi More

Vera Komeyer

Synchon Mandal

Leonard Sasse

ML Hours (open to everyone)
Consultancy on Machine-Learning, every second Thursday, 2-4pm
Chat: https://matrix.to/#/#ml:inm7.de

Even better together

The ABCD-J platform

An open source platform for digital biomarker for neuro-medicine in NRW

Research Center Jülich

RWTH Aachen
University Bonn
University Cologne
HHU Düsseldorf

LVR Clinics
DZNE

PeakProfiling
CanControl
IXP

(open to future additions)

Goals

Social: Promote and facilitate collaboration between multiple centers
Technical: Accelarate research through homogenization of workflows and processes, with emphasis on digital biomarker development; Elevate existing open technical solutions for research practice adoption

Open research infrastructure

Clinicians' point of view	Patients' point of view
"Deep phenotyping": ecologically valid, multimodal data Decentral data acquisition, standardized and reproducible Focus on patient well-being and optimal treatment	Accurate diagnosis and optimal treatment Strict data protection Individual patient is central Minimal disturbance in daily life

Open Research Infrastructure

Research data management's point of view
Technical solutions exists, need elevation for research practice adoption Homogenization of workflows and processes fosters collaboration across sites Decentralized approach with centralized services and web-based multi-center integration

Front-end and Back-end

JTrack for decentral, ecologically valid acquisitions, complementing in-clinic assessments
JDash for study management, participant management, and analytics overview (derived study data at subject & group level)
Central data overview and analytics at FZJ

Provenance-tracked analysis and modeling
Automated meta-data extraction for data discoverability
Result overview for clinical decision making

Opportunities

Software improves with its use cases

JTrack integration into different types wearables
JTrack integration of cognitive tasks and feedback to participants
Julearn integration into JDash
More meta-data extractors for DataLad
...

Current (first) steps

Data cataloging

Leveraging legacy data via data census and meta-data catalog

improved discovery without direct data transfer
homogenization of access request procedures
establishing a legal basis for (re-)use
Example: data.sfb1451.de

Demonstrator for data infrastructure based on §21 data (standardized and anonymized performance data of hospitals, legally required, submitted yearly to InEK by all hospitals)

Feasibility/Proof-of-concept study

Recommendations for common digital tools and workflows for common tasks and processes
Selection of digital measures from clinical routine for first trials

Summary

Open Source (Research) Software aids in various domain-general or -specific applications.
Open Science needs Open Source: For transparency and reproducibility, for science-specific requirements, for open formats, for re-use, and to enable interoperability across tools.
Collaboration across clinical and research settings is a technical, social, and legal challenge. Technical solutions won't save us alone, but they are a good first step.
We build clinical research infrastructure on open tools, for better science

Thanks!

Questions?

Inputs, Resources, Synergies