There is data management in the title of my talk, and data management sometimes sounds like an unbelievably boring topic. I want to spice that up a bit by telling about some scientific nightmares that you might know personally or fear. [EXAMPLES] Does some of that sound familiar to anyone? If yes, you're not alone - in fact, these problems existed for so long, I paraphrased them from a paper that's older than a significant chunk of the audience. All of these things are in some way or the other, some with higher and some lower severity, data management failures. And this is actually where data management often gets the most attention - whenever it fails. It can fail on these smaller scales, affecting individual projects or individual scientists, or it can also fail on a much larger scale, contributing to inefficiency or reproducibility crisis in entire fields. And obviously problems like the ones I described aren't impossible to fix. There were people 30 years ago that fixed them already, there are people now that fix them still, its just that these fixes and best practices that individual people or labs might exhibit are not widespread enough. In many cases, improvements and efficiency and reproducibity gains are not hard to achieve. There are vast amounts of tools and resources available, many are even being developed by people here in this room. The important first step is simply knowing that they exist. And today I want to tell you a bit about the tool DataLad, which is a free and open source command line tool that can help with data and reproducibility management, and, among other things, those nightmares. Who in this room has heard about it already, can you quickly show off hands? Okay, for those that have never heard about it, I want to show you some example usecases to give you a sense of what the tool can do. As an individual scientist, you can use it to publish or consume DataLad datasets via common hosting services, from Github to the open science framework or the open science cloud. For example: This is a Github repo about an open neuroimaging dataset. You can take its url and clone it with a datalad clone command, and after a few seconds of installation, you can browse all of the available files in that dataset, and then download individual contents on demand. DataLad is also used internally by a number of large hosting portals for data management. OpenNeuro for example uses it. And you profit from that, because you get a similar url from a dataset like this one, which contains more than a TB of data, and in the same streamlined fashion, you can clone it, browse and retrieve files on demand. Beyond data, you can use it to share and collaborate on open science projects. This is a paper of me and a few colleagues, again exposed as an easily findable github repo, that shares not only the manuscript, but also code and data for that manuscript, and beyond that, the paper actually recomputes itself . So you can clone it, check the Git revision history, and recompute all results from scratch to check if they are robust on your system or hold if you tweak our code. And the next example is even cooler because it illustrates the scale this can take. This here is a R bookdown-based supplement of a paper with 1.5TB of openly shared data, that others can explore and built up upon. And then a final interesting usecase is as an institutional data management and archival system. So our institute uses many of the famous large datasets, and to make discovery, updates, and access management possible, we have all our original, preprocessed and archived data available in a single bundle that can be cloned and browsed. And if a particular researcher is authorized to access a given dataset, for example ABCD, they can get its contents. So this might have given you some ideas about datalad. The three core features of this tool are - joint version control for data of any size or type, which allows you to version your data and software together with your code - provenance capture, that lets you create and share machine-readable and even re-executable provenance records - and decentral data transport mechanisms, that allow you to install, share & collaborate on projects; publish, upgrade, and retrieve project contents in a streamlined fashion and on demand, and distribute files in a decentral network using the services or infrastructure of your choice. And I want to use the next hour or so to give you some first hand experience with some of its commands so that you get an idea where it can help you personally with data management. But some prerequisites beforehand: Like many others tools, datalad is a command line tool that you use in the terminal, but it also has a python API so that you can use it in scripts, and if you don't use python you can always call it via system calls. For this hands-on we can use the Jupyterhub, and there we won't open a notebook, but the terminal. The first part of a command always consists of the main datalad command, but that is not enough - if we run it we get a message to add more. We can either add a flag to main command, for example --version, or add one of many possible subcommands to make datalad do a specific thing. For example we can datalad wtf, and the datalad tells us wtf is up with the system we're on. And those subcommands in most cases also take additional options to influence their behavior, for example, wtf has a -S which I can use to limit the output to a specific section, in this case the system. And if I want to know more about datalad or a subcommand I can just use the -h or --help flag to see commands, options, explanations or examples. Ok. We start with datalad's core data structure, the so-called dataset. You can create a dataset from scratch, turn an existing directory into one, or clone datasets from other places. And once you have it, a dataset is simply a directory on your computer that datalad manages, importantly completely domain agnostic and without any custom data structure. [DATALAD CREATE] On a technical level, a dataset is a git repository with an additional annex. On a practical level, this means that this dataset can version control files regardless of size or type. It also means that you have free choice - you can use datalads API, which adds functionality or makes some workflows eaiser, or stick to git/git-annex commands. Let's take a look: [DATALAD SAVE] Datasets thus allow exhaustive tracking of everything that might be relevant to a research project. And this is important because those building blocks of science are rarely ever static. You know this from constantly evolving analysis code or manuscripts, but it affects data in the very same way. Data evolves, too. A dataset might just get extended with more data, but it can also contain errors that get fixed over time, or a future version of the BIDS standard requires naming changes. And you will want to know which version of your data was used when you did your analysis because the results depend on this. Here is a real world example, where I discovered a bug in a public dataset and submitted a fix - but this means the data from 2016 and the data from 2019 are different. And with all of the building blocks changing, you can find yourself in the situation "Shit, which version of which script produced these outputs from which version of what data... and which software version?". And version lays the foundations to resolve those questions. Because the revision history you create is not only a diary to read up on what was done, when, and by whom, it can be actionably used to set or reset the states of your building blocks. One way to think about the information captured in a revision history is digital provenance. This entails the tools and processes used to create a digital file, the responsible entity, and when and where the process events occurred. Much of this is captured in a commit automatically. But from a practical sense, we can add provenance to that. For example, did you ever get a script or file from a colleague but forgot where you got it from after a while? Or you find a figure, but forgot which script produced it when you revisit your project after a while? [DOWNLOAD URL] [RUN] datalad run is a fairly simple wrapper. In its basic form, it will take anything expressed in a command line call, execute it, inspect the resulting modifications in the dataset after execution, and save everything as a result of the modification. But even this basic functionality goes quite a long way. You can revisit a project after a while and the files itself contain information how they were generated and modified. And you can repeat the process automatically. We'll dive a bit more into provenance capture later, but we still have a data analysis to finish. And for this, we need data. Now, from the introduction you have already seen the basic process of cloning a dataset and then getting data, so I will not only talk about cloning datasets, but also about what we call nesting datasets. And to motivate this I want you to think of the evolution of a scientific project. A project is typically not a monolithic, stand-alone thing. Instead it progresses through different stages, for example from raw to preprocessed data, from preprocessed data to analysis results. Many of these stages are multi-use. It makes sense to reuse the derived data for more than a single analysis, and it should be easy to do so. So one way that we advocate for how to approach this is to keep the different phases of a project in modular units, in datalad terms, as individual datasets, and then link them together as dependencies. We call the structure that emerges from this superdataset and subdataset hierarchy. A superdataset is a dataset that contains a another dataset, and the contained dataset is a subdataset. You can have these hierarchies as deep as you want, and once you have a hierarchy, you can issue datalad commands recursively through them. The linkage between them does not affect the subdataset at all, and the super dataset gains lightweight and actionable provenance info about the subdataset. This info entails where this dataset came from, but also in which exact version it is registered. I will get back to the topic of modularity and nesting in a bit, but let's now look at data transport. Cloning datasets is always much faster than downloading the files they track. In other words, freshly cloned datasets are lean, only a fraction of the size of the data they give you access to. What they contain is something I like to simplify as metadata: the names of available files and where to get them from. They do not contain the files content, file contents can be retrieved on demand. There is one immediate advantage to this: you can have access to more data on your computer than you have diskspace for. My own computer has access to at least half a petapyte of data. And I can flexibily get only those files that I actually need. You can also revert a get, and drop file content that is not needed to free up disk space without affecting your ability to re-retrieve it. This opens up the door for disk-space aware computations, that retrieve data at the start and drop it at the end again. Now this behavior is different from what you are used to seeing with Git, and explaining this transport is tightly connected to the question why there are two version control tools at work. And the reason starts with the fact that Git, as powerful as it is, does not handle large files well. I'm oversimplifying Git internals now, but here is how you can approximately imagine that git works: With every commit you make, Git inspects the state of your files in the repo. If a file changes, it needs to store the new file and the previous version of the file for you to be able to go back in time. If a file doesn't change, Git simply points to its previous version. As Git was made to handle text files, it has very clever and efficient strategies to represent differences between file versions - so it doesn't actually need to store two copies of that file in different version, it has means to represent the difference between them. However, it has those means not for all files. If you commit a 5 GB nifti image into a Git repo, and then you make a tiny change to the header metadata, just a few bytes, Git has no way of representing that difference. Instead it is forced to store an entire second copy. So when you version control large binary files, you inflate your git repository. And if you do that too much, performance will drop, it will take ages to clone, and you might be able to break a repo eventually. Nevertheless you could do this for quite a stretch, the practical problem is often not your Git repo, but your repo hosting services first. Because they will refuse to accept files beyond a certain size in your history, because they would be hosting all of that data. And this is were git-annex saves the day. What it allows is to version control large files with Git, but without putting their content into Git. On a conceptual level, a dataset has a git part and a git-annex part. You've seen plenty of the Git part, including the revision history, let's take a look at the annex. [symlinks, distributed availability] Keeping things in Git annex has two major impacts: File content is not available after cloning and must be retrieved on demand, and annexed files are protected against accidental modifications in order to not break the association between file identity and storage. This means the version control tools are differentially suited for different types of files. You might want to use git with text files that you frequently modify, or with files like data usage agreements that you want to have immediately available after cloning, but you would want to have larger or binary files in the annex and also private files which's content you don't necessarily want to share with everyone who gets a copy of the dataset. So thinking which files go where is a decision you make based on your usecase and datalad has many means to configure it. You can use run-procedures, such as text2git with premade configurations. Text2git is very simple, it will just keep any text file in Git, and the rest in the annex. but you can also make up your own rules, and annex files based on their size, file type, location, name pattern, and so forth. 90 minutes are not enough time for a complete datalad course, but I want to introduce you to the three core feature and give you an idea how they can aid with data management. what git-annex does, summarized in few sentences, is the following: It calculates a unique identifier based on a of a file's content. Instead of committing file content to Git, which could be large, it commits this unique identifier, which is just a few characters large. And in order to manage into If the file system allows it, it then moves the file content to You can use as many or as little features as you like. Solely consuming data is fine I want to give you a run down of these features by setting up a neuroimaging analysis. GET RID OF VERSION CONTROL ETC HEADINGS VERSION CONTROL datalad create -c text2git my-analysis cd my-analysis ls -a # create a README echo "# Example DataLad dataset" > README datalad status datalad save -m "Adding a README placeholder" echo "This dataset contains a toy analysis" >> README git diff datalad save -m "Extending the README" # containers-add a singularity image datalad containers-add shub://adswa/nilearn datalad containers-list PROVENANCE TRACKING mkdir code # what kind of analysis? -> brain mask extraction wget -O code/get_brainmask.py https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py datalad save -m "Download script for brainmasking from colleague" datalad download-url -O code/nilearn-tutorial.pdf https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf # maybe change this script somehow - e.g., make it executable? black formatting! black code/get_brainmask.py git diff git restore code/get_brainmask.py datalad run "black code/get_brainmask.py" DATA TRANSPORT # get input data in a specific version datalad clone -d . bids-data datalad -f json_pp status inputs datalad get input/sub-02/... datalad drop input/sub-02 git-annex whereis input/sub-02 datalad containers-run -n nilearn datalad rerun CLEAN UP (This talk wouldn't have data management in its title if we wouldn't clean up after ourselves) datalad drop input/sub-02/func/... datalad drop --what all input ls input datalad get input datalad remove input datalad drop one-of-the-figures datalad drop --reckless availability one-of-the-figures cd ../ datalad remove -d my-analysis