I have a dataset on my computer. How can I share it, or collaborate on it?
Glossary
Sibling (remote)
Linked clones of a dataset. You can usually update (from) siblings to keep all your siblings in sync
(e.g., ongoing data acquisition stored on experiment compute and backed up on cluster and external hard-drive)
Repository hosting service
Webservices to host Git repositories, such as GitHub, GitLab, Bitbucket, Gin, ...
Third-party storage
Infrastructure (private/commercial/free/...) that can host data. A "special remote" protocol
is used to publish or pull data to and from it
Publishing datasets
Pushing dataset contents (Git and/or annex) to a sibling using datalad push
Updating datasets
Pulling new changes from a sibling using datalad update --merge
Publishing datasets
Most public datasets separate content in Git versus git-annex behind the scenes
Publishing datasets
Publishing datasets
Publishing datasets
Typical case:
Datasets are exposed via a private or public repository on a
repository hosting service
Data can't be stored in the repository hosting service, but can be
kept in almost any third party storage
Publication dependencies automate pushing to the correct place, e.g.,
DataLad can create siblings from the command line for the following services:
GitHub
datalad create-sibling-github
GitLab
datalad create-sibling-gitlab
Gin
datalad create-sibling-gin
Gogs
datalad create-sibling-gogs
local or remote paths
datalad create-sibling
RIA stores
datalad create-sibling-ria
Open Science Framework (needs datalad-osf)
datalad create-sibling-osf
(Additional services being worked on: webdav-based services such as Sciebo,
ebrains; if you need something else, get in touch)
Cloning DataLad datasets
How does cloning dataset feel like for a consumer?
Cloning DataLad datasets
How does cloning dataset feel like for a consumer?
Cloning DataLad datasets
How does cloning dataset feel like for a consumer?
Cloning DataLad datasets
Let's take a look at the special cases:
Cloning DataLad datasets
Let's take a look at the special cases:
Requires the DataLad extension
datalad-osf
Summary: Data publication
datasets can have "siblings", linked clones in other places
Those can be local or remote, on commercial, free, or personal infrastructure
Typical repository hosting services do not host annexed contents
A notable exception is Gin
Typical storage providers do not host Git repositories
but datalad extensions can make it possible for certain services, such as the OSF
Despite the different possible services, operations are streamlined
clone installs datasets, get retrieves data,
push publishes (new changes in) datasets,
update pulls dataset updates.
This remains the case even if underlying data hosting changes.
Siblings serve multiple purposes:
Personal back-up that's easy to sync;
Publicly or privately exposed files to share with (selected) others;
Entrypoints for collaborations or others' contributions; ...