Part 2: Tracking Changes in DataLad Datasets
Modifying a Dataset
DataLad keeps track of all changes made to your dataset. In this section, you will add new content to the penguins
dataset and see how these changes are tracked in the git log
of your repository.
Command | Description |
---|---|
datalad status |
Show any untracked changes in the current dataset |
datalad save |
Save any untracked changes in the current dataset |
datalad save -m "hi" |
Save untracked changes and add the message "hi" |
datalad unlock file.txt |
Unlock file.txt to make it modifiable |
git log |
View the dataset’s history, stored in the git log |
git log -3 |
View the last 3 entries in the git log |
Exercise 1 Create a new file in the penguins
folder called penguin_species.txt
and add the species names gentoo and adelie. Then, save the file and run datalad status
to see the untracked changes.
On Linux/macOS:
echo -e "gentoo\nadelie" > penguin_species.txt
datalad status
On Windows:
(echo gentoo & echo adelie) > penguin_species.txt
or just use a text editor of your choice.
Exercise 2 Use datalad save
to save the untracked changes with the message "list penguin species"
.
datalad save -m "list penguin species"
Exercise 3 Open the git log
and find the entry created by the previous datalad save
command
On Linux/macOS this should be the last (i.e. top) entry, on Windows it will be the second-to-last.
git log
Exercise 4 Use datalad unlock
to unlock the penguin_species.txt
file and append chinstrap to the list. Then, run datalad save
again with a message to save the changes
On Linux/macOS:
datalad unlock penguin_species.txt
echo -e "chinstrap" >> penguin_species.txt
datalad save -m "add chinstrap"
On Windows:
datalad unlock penguin_species.txt
echo chinstrap >> penguin_species.txt
datalad save -m "add chinstrap"
Exercise 5 Open the git log
again and find the entry from the datalad save
command above.
git log
Running Scripts with DataLad
Often, we won’t edit our dataset manually but run scripts that do so. In this section you will use DataLad to run Python scripts and track the changes made by them. You are also going to use the dataset’s history to re-run the commands.
Command | Description |
---|---|
datalad run "python script.py" |
Run the python script script.py |
datalad run --input "data.csv" --output "figure.png" "python script.py" |
Run script.py with input "data.csv" and output "figure.png" |
git log |
View the dataset’s history stored in the git log |
datalad rerun a268d8ca22b6 |
Rerun the command from the git log with the checksum starting with a268d8ca22b6e87959 |
datalad rerun --since a268d8ca22b6 |
Rerun ALL commands --since the one with the checksum starting with a268d8ca22b6e87959 |
Exercise 6 Try to run the python
script in code/aggregate_culmen_data.py
. What error message do you observe?
datalad run "python code/aggregate_culmen_data.py"
Currently, the dataset does not contain the annexed content for the required files. On Linux/macOS, this will result in a FileNotFoundError: [Errno 2] No such file or directory
On Windows, you’ll see KeyError: "None of [Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')] are in the [columns]"
because Python will actually open the pointer file and crash because it can’t find the required data.
Exercise 7 Run the same script with the data/
folder as --input
and the file "results/penguin_culmens.csv"
as --output
.
datalad run --input "data/" --output "results/penguin_culmens.csv" "python code/aggregate_culmen_data.py"
Exercise 8 Open the git log
to view the entry created by the datalad run
command. Then, copy the checksum of that commit
git log
The git log entry should look like this:
commit af78031c9ca45d3c349a692b0afd332178639e64 (master)
Author: Ole Bialas <ole.bialas@posteo.de>
Date: Thu Jul 31 11:14:00 2025 +0200
[DATALAD RUNCMD] python code/aggregate_culmen_data.py
=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/aggregate_culmen_data.py",
"dsid": "3a8aacc5-85f0-4114-adee-fcfa7d21a5df",
"exit": 0,
"extra_inputs": [],
"inputs": [
"data"
],
"outputs": [
"results/penguin_culmens.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
The checksum is displayed on the first line, after “commit”: af78031c9ca45d3c349a692b0afd332178639e64
Exercise 9 Use the copied checksum to rerun
the previous datalad run
command
The command looks like this but the checksum will be different for everyone:
datalad rerun af78031c9ca45d3c349a692b0afd332178639e64
Exercise 10 Run the script code/plot_culmen_length_vs_depth.py
— it takes results/penguin_culmens.csv
as --input
and produces results/culmen_length_vs_depth.png
as an output.
datalad run --input "results/penguin_culmens.csv" --output "results/culmen_length_vs_depth.png" "python code/plot_culmen_length_vs_depth.py"
Exercise 11 Use the checksum of the very first commit (that says [DATALAD] new dataset
) to re-run everything --since
this commit (i.e. the whole analysis).
datalad rerun --since 4c2b9dcd5c745b519095f7bf6612bbe20f7ae9bb
Further reading
- For more on DataLad run and the comparison of how Git and git-annex handle files, see these chapters of the DataLad Handbook:
- For even more on git-annex under the hood, see git-annex documentation: