Part 2: Tracking Changes in DataLad Datasets

Modifying a Dataset

DataLad keeps track of all changes made to your dataset. In this section, you will add new content to the penguins dataset and see how these changes are tracked in the git log of your repository.

Command Description
datalad status Show any untracked changes in the current dataset
datalad save Save any untracked changes in the current dataset
datalad save -m "hi" Save untracked changes and add the message "hi"
datalad unlock file.txt Unlock file.txt to make it modifiable
git log View the dataset’s history, stored in the git log
git log -3 View the last 3 entries in the git log

Exercise 1 Create a new file in the penguins folder called penguin_species.txt and add the species names gentoo and adelie. Then, save the file and run datalad status to see the untracked changes.

On Linux/macOS:

echo -e "gentoo\nadelie" > penguin_species.txt
datalad status

On Windows:

(echo gentoo & echo adelie) > penguin_species.txt

or just use a text editor of your choice.

Exercise 2 Use datalad save to save the untracked changes with the message "list penguin species".

datalad save -m "list penguin species"

Exercise 3 Open the git log and find the entry created by the previous datalad save command

On Linux/macOS this should be the last (i.e. top) entry, on Windows it will be the second-to-last.

git log

Exercise 4 Use datalad unlock to unlock the penguin_species.txt file and append chinstrap to the list. Then, run datalad save again with a message to save the changes

On Linux/macOS:

datalad unlock penguin_species.txt
echo -e "chinstrap" >> penguin_species.txt
datalad save -m "add chinstrap"

On Windows:

datalad unlock penguin_species.txt
echo chinstrap >> penguin_species.txt
datalad save -m "add chinstrap"

Exercise 5 Open the git log again and find the entry from the datalad save command above.

git log

Running Scripts with DataLad

Often, we won’t edit our dataset manually but run scripts that do so. In this section you will use DataLad to run Python scripts and track the changes made by them. You are also going to use the dataset’s history to re-run the commands.

Command Description
datalad run "python script.py" Run the python script script.py
datalad run --input "data.csv" --output "figure.png" "python script.py" Run script.py with input "data.csv" and output "figure.png"
git log View the dataset’s history stored in the git log
datalad rerun a268d8ca22b6 Rerun the command from the git log with the checksum starting with a268d8ca22b6e87959
datalad rerun --since a268d8ca22b6 Rerun ALL commands --since the one with the checksum starting with a268d8ca22b6e87959

Exercise 6 Try to run the python script in code/aggregate_culmen_data.py. What error message do you observe?

datalad run "python code/aggregate_culmen_data.py"

Currently, the dataset does not contain the annexed content for the required files. On Linux/macOS, this will result in a FileNotFoundError: [Errno 2] No such file or directory On Windows, you’ll see KeyError: "None of [Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')] are in the [columns]" because Python will actually open the pointer file and crash because it can’t find the required data.

Exercise 7 Run the same script with the data/ folder as --input and the file "results/penguin_culmens.csv" as --output.

datalad run --input "data/" --output "results/penguin_culmens.csv" "python code/aggregate_culmen_data.py"

Exercise 8 Open the git log to view the entry created by the datalad run command. Then, copy the checksum of that commit

git log

The git log entry should look like this:

commit af78031c9ca45d3c349a692b0afd332178639e64 (master)
Author: Ole Bialas <ole.bialas@posteo.de>
Date:   Thu Jul 31 11:14:00 2025 +0200

    [DATALAD RUNCMD] python code/aggregate_culmen_data.py

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "python code/aggregate_culmen_data.py",
     "dsid": "3a8aacc5-85f0-4114-adee-fcfa7d21a5df",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "data"
     ],
     "outputs": [
      "results/penguin_culmens.csv"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

The checksum is displayed on the first line, after “commit”: af78031c9ca45d3c349a692b0afd332178639e64

Exercise 9 Use the copied checksum to rerun the previous datalad run command

The command looks like this but the checksum will be different for everyone:

datalad rerun af78031c9ca45d3c349a692b0afd332178639e64

Exercise 10 Run the script code/plot_culmen_length_vs_depth.py — it takes results/penguin_culmens.csv as --input and produces results/culmen_length_vs_depth.png as an output.

datalad run --input "results/penguin_culmens.csv" --output "results/culmen_length_vs_depth.png" "python code/plot_culmen_length_vs_depth.py"

Exercise 11 Use the checksum of the very first commit (that says [DATALAD] new dataset) to re-run everything --since this commit (i.e. the whole analysis).

datalad rerun --since 4c2b9dcd5c745b519095f7bf6612bbe20f7ae9bb

Further reading