Part 2: Tracking Changes in DataLad Datasets

Modifying a Dataset

DataLad keeps track of all changes made to your dataset. In this section, you will add new content to the penguins dataset and see how these changes are tracked in the git log of your repository.

Command	Description
`datalad status`	Show any untracked changes in the current dataset
`datalad save`	Save any untracked changes in the current dataset
`datalad save -m "hi"`	Save untracked changes and add the message `"hi"`
`datalad unlock file.txt`	Unlock `file.txt` to make it modifiable
`git log`	View the dataset’s history, stored in the `git log`
`git log -3`	View the last `3` entries in the `git log`

Exercise 1 Create a new file in the penguins folder called penguin_species.txt and add the species names gentoo and adelie. Then, save the file and run datalad status to see the untracked changes.

Solution

On Linux/macOS:

echo -e "gentoo\nadelie" > penguin_species.txt
datalad status

On Windows:

(echo gentoo & echo adelie) > penguin_species.txt

or just use a text editor of your choice.

Exercise 2 Use datalad save to save the untracked changes with the message "list penguin species".

Solution

datalad save -m "list penguin species"

Exercise 3 Open the git log and find the entry created by the previous datalad save command

Solution

On Linux/macOS this should be the last (i.e. top) entry, on Windows it will be the second-to-last.

git log

Exercise 4 Use datalad unlock to unlock the penguin_species.txt file and append chinstrap to the list. Then, run datalad save again with a message to save the changes

Solution

On Linux/macOS:

datalad unlock penguin_species.txt
echo -e "chinstrap" >> penguin_species.txt
datalad save -m "add chinstrap"

On Windows:

datalad unlock penguin_species.txt
echo chinstrap >> penguin_species.txt
datalad save -m "add chinstrap"

Exercise 5 Open the git log again and find the entry from the datalad save command above.

Solution

git log

Running Scripts with DataLad

Often, we won’t edit our dataset manually but run scripts that do so. In this section you will use DataLad to run Python scripts and track the changes made by them. You are also going to use the dataset’s history to re-run the commands.

Command	Description
`datalad run "python script.py"`	Run the `python` script `script.py`
`datalad run --input "data.csv" --output "figure.png" "python script.py"`	Run `script.py` with input `"data.csv"` and output `"figure.png"`
`git log`	View the dataset’s history stored in the `git log`
`datalad rerun a268d8ca22b6`	Rerun the command from the `git log` with the checksum starting with `a268d8ca22b6e87959`
`datalad rerun --since a268d8ca22b6`	Rerun ALL commands `--since` the one with the checksum starting with `a268d8ca22b6e87959`

Exercise 6 Try to run the python script in code/aggregate_culmen_data.py. What error message do you observe?

Solution

datalad run "python code/aggregate_culmen_data.py"

Currently, the dataset does not contain the annexed content for the required files. On Linux/macOS, this will result in a FileNotFoundError: [Errno 2] No such file or directory On Windows, you’ll see KeyError: "None of [Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')] are in the [columns]" because Python will actually open the pointer file and crash because it can’t find the required data.

Exercise 7 Run the same script with the data/ folder as --input and the file "results/penguin_culmens.csv" as --output.

Solution

datalad run --input "data/" --output "results/penguin_culmens.csv" "python code/aggregate_culmen_data.py"

Exercise 8 Open the git log to view the entry created by the datalad run command. Then, copy the checksum of that commit

Solution

git log

The git log entry should look like this:

commit af78031c9ca45d3c349a692b0afd332178639e64 (master)
Author: Ole Bialas <ole.bialas@posteo.de>
Date:   Thu Jul 31 11:14:00 2025 +0200

    [DATALAD RUNCMD] python code/aggregate_culmen_data.py

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "python code/aggregate_culmen_data.py",
     "dsid": "3a8aacc5-85f0-4114-adee-fcfa7d21a5df",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "data"
     ],
     "outputs": [
      "results/penguin_culmens.csv"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

The checksum is displayed on the first line, after “commit”: af78031c9ca45d3c349a692b0afd332178639e64

Exercise 9 Use the copied checksum to rerun the previous datalad run command

Solution

The command looks like this but the checksum will be different for everyone:

datalad rerun af78031c9ca45d3c349a692b0afd332178639e64

Exercise 10 Run the script code/plot_culmen_length_vs_depth.py — it takes results/penguin_culmens.csv as --input and produces results/culmen_length_vs_depth.png as an output.

Solution

datalad run --input "results/penguin_culmens.csv" --output "results/culmen_length_vs_depth.png" "python code/plot_culmen_length_vs_depth.py"

Exercise 11 Use the checksum of the very first commit (that says [DATALAD] new dataset) to re-run everything --since this commit (i.e. the whole analysis).

Solution

datalad rerun --since 4c2b9dcd5c745b519095f7bf6612bbe20f7ae9bb

Modifying a Dataset

Running Scripts with DataLad

Further reading