Data management

Recap reproducible execution, Git-annex basics, siblings

Outline

Create a data analysis project (keep the YODA principles in mind)
Install input data as a subdataset
Write a script to analyze the input data (Python and MatLab templates exist)
Execute the analysis reproducibly

Use the following commands

datalad create -c yoda
datalad install
datalad save -m "..."
datalad run

A classification analysis on the iris flower dataset

Either in MatLab or Python (whatever you prefer)

Git versus Git-annex

Reminder: Git and Git-annex handle files differently:

Files stored in Git are modifiable
Files stored in Git-annex are content-locked

Understanding the reasons behind this can be helpful

 for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
    # get the filename
    base=\$(basename "\$i");
    # strip the extension
    base=\${base%.mp3};
    # date as yyyy-mm-dd
    printf "\${base%%__*}\t" | tr '_' '-';
    # name and title without underscores
    printf "\${base#*__}\n" | tr '_' ' ';
 done

⮊ A for loop in shell, will print each file name as Date - Speaker - Title to the terminal.

⮊ Redirection to a file with > writes the stream to a file instead of the terminal.

⮊ Note: This could be any script or shell command!

A basic datalad run command

datalad run

* Running scripts from the command line, using tools from the command line, ...

Run-records link dataset modifications to commands

commit f4a35c8841062eb58f65dbf3cde70ccdc3c9df68 (HEAD -> master)
Author: Adina Wagner adina.wagner@t-online.de
Date:   Mon Nov 11 09:55:02 2019 +0100

    [DATALAD RUNCMD] create a list of podcast titles

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "02a84dae-faf5-11e9-ba9f-e86a64c8054c",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
new file mode 100644
index 0000000..f691b53
--- /dev/null
+++ b/recordings/podcasts.tsv
@@ -0,0 +1,206 @@
+2003-11-15     Brian Eno  The Long Now
+2003-12-13     Peter Schwartz  The Art Of The Really Long View
+2004-01-10     George Dyson  There s Plenty of Room at the Top  Long term Thinking About Large scale Computing
[...]

It follows logically: If a command does not lead to any modification in a dataset, it will not be recorded!

Oh! An error in the code...

DataLad-101 layout:

Oh! An error in the code...

DataLad-101 layout:

datalad rerun

Re-execute previous datalad run commands

What shall be rerun can be specified via its commit hash:

datalad rerun f4a35c884106

... but also via tag, revision specifications with HEAD, ..., or by giving a range of commits.

Summary - Basic datalad run

datalad run records a commands impact on a dataset.

A record is only made if the command leads to dataset modifications

The command captures provenance for humans and machines

a machine-readable runrecord is automatically created, you need to provide a commit message.

datalad rerun can take any previous datalad run commit hash and re-execute it.

This saves you the need to remember!

datalad diff and git diff are useful helpers to explore changes between version states of a dataset.

... but there is more that this command can do for you:

The anatomy of DataLad error messages

"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
[INFO   ] == Command start (output follows) =====
convert-im6.q16: unable to open image `recordings/longnow/.datalad/feed_metadata/logo_salt.jpg': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `recordings/salt_logo_small.jpg' @ error/convert.c/ConvertImageCommand/3258.
[INFO   ] == Command exit (modification check follows) =====
[INFO   ] The command had a non-zero exit code. If this is expected, you can save the changes with 'datalad save -d . -r -F .git/COMMIT_EDITMSG'
CommandError: command 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' failed with exitcode 1
Failed to run 'convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg' under '/demo/DataLad-101'. Exit code=1.

--input in datalad run

datalad get

Content-locked files (vastly simplified)

Files are given to Git-annex or Git: Based on dataset configuration about file type, size, or name.
Git-annex removes write permission from the file content it stores.: This prevents accidental modifications.
datalad unlock can unlock content for modification.: datalad save will lock content again.

--output in datalad run

datalad unlock

Analysis provenance capture

Easy provanance capture!

Advice:

use --input and --output
Attach helpful commit messages
Make sure to have a clean dataset state

Summary - Reproducible execution with datalad run

datalad run records a commands impact on a dataset.

This usually requires a "clean" dataset status (no unsaved modifications)

--input to the datalad run command gets retrieved (if necessary) prior to command execution.

This is done with a datalad get in the background.

--output to the datalad run command gets unlocked (if necessary) for modification prior to command execution.

This is done with a datalad unlock in the background.

Outlook: computational reproducibility

It may not be enough to record inputs, code, and outputs of an analysis!
Without sufficient information about required software (versions), analyses may fail to reproduce or even run.
The DataLad extension datalad containers can also capture complete software environments.
Get a preview soon: chapters on extensions is close to being finished

Now what I can do with that?

Reproducible analysis with datalad run

Practice @home

Wrap any simple shell command (e.g., cp) in a datalad run, and (later) also scripts of yours

How does a here-document work?


    $ cat << EOT > notes.txt
    One can create a new dataset with 'datalad create [--description] PATH'.
    The dataset is created empty

    EOT

Two delimiting identifiers (EOT) wrap any amount of text into a stream
The << characters redirect the stream into standard input for the cat command
The > character redirects the standard output of cat and writes it into a new file notes.txt

Why is it used?

Allows pretty formating (e.g., line breaks)
Allows writing documents from the terminal

Data management

Recap reproducible execution, Git-annex basics, siblings

Outline

A classification analysis on the iris flower dataset

Git versus Git-annex

Understanding the reasons behind this can be helpful

A basic datalad run command

Run-records link dataset modifications to commands

Oh! An error in the code...

Oh! An error in the code...

datalad rerun

Summary - Basic datalad run

The anatomy of DataLad error messages

--input in datalad run

Content-locked files (vastly simplified)

--output in datalad run

Analysis provenance capture

Summary - Reproducible execution with datalad run

Outlook: computational reproducibility

Now what I can do with that?

Practice @home

Further reading

Backup slides for anticipated questions

How does a here-document work?