Part 2: Tracking Changes in DataLad Datasets
Modifying a Dataset
DataLad keeps track of all changes made to your dataset. In this section, you will add new content to the penguins dataset and see how these changes are tracked in the git log of your repository.
| Command | Description |
|---|---|
datalad status |
Show any untracked changes in the current dataset |
datalad save |
Save any untracked changes in the current dataset |
datalad save -m "hi" |
Save untracked changes and add the message "hi" |
datalad unlock file.txt |
Unlock file.txt to make it modifiable |
git log |
View the dataset’s history, stored in the git log |
git log -3 |
View the last 3 entries in the git log |
Exercise 1 Create a new file in the penguins folder called penguin_species.txt and add the species names gentoo and adelie. Then, save the file and run datalad status to see the untracked changes.
On Linux/macOS:
echo -e "gentoo\nadelie" > penguin_species.txt
datalad statusOn Windows:
(echo gentoo & echo adelie) > penguin_species.txt
or just use a text editor of your choice.
Exercise 2 Use datalad save to save the untracked changes with the message "list penguin species".
datalad save -m "list penguin species"Exercise 3 Open the git log and find the entry created by the previous datalad save command
On Linux/macOS this should be the last (i.e. top) entry, on Windows it will be the second-to-last.
git logExercise 4 Use datalad unlock to unlock the penguin_species.txt file and append chinstrap to the list. Then, run datalad save again with a message to save the changes
On Linux/macOS:
datalad unlock penguin_species.txt
echo -e "chinstrap" >> penguin_species.txt
datalad save -m "add chinstrap"On Windows:
datalad unlock penguin_species.txt
echo chinstrap >> penguin_species.txt
datalad save -m "add chinstrap"
Exercise 5 Open the git log again and find the entry from the datalad save command above.
git logRunning Scripts with DataLad
Often, we won’t edit our dataset manually but run scripts that do so. In this section you will use DataLad to run Python scripts and track the changes made by them. You are also going to use the dataset’s history to re-run the commands.
| Command | Description |
|---|---|
datalad run "python script.py" |
Run the python script script.py |
datalad run --input "data.csv" --output "figure.png" "python script.py" |
Run script.py with input "data.csv" and output "figure.png" |
git log |
View the dataset’s history stored in the git log |
datalad rerun a268d8ca22b6 |
Rerun the command from the git log with the checksum starting with a268d8ca22b6e87959 |
datalad rerun --since a268d8ca22b6 |
Rerun ALL commands --since the one with the checksum starting with a268d8ca22b6e87959 |
Exercise 6 Try to run the python script in code/aggregate_culmen_data.py. What error message do you observe?
datalad run "python code/aggregate_culmen_data.py"Currently, the dataset does not contain the annexed content for the required files. On Linux/macOS, this will result in a FileNotFoundError: [Errno 2] No such file or directory On Windows, you’ll see KeyError: "None of [Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')] are in the [columns]" because Python will actually open the pointer file and crash because it can’t find the required data.
Exercise 7 Run the same script with the data/ folder as --input and the file "results/penguin_culmens.csv" as --output.
datalad run --input "data/" --output "results/penguin_culmens.csv" "python code/aggregate_culmen_data.py"Exercise 8 Open the git log to view the entry created by the datalad run command. Then, copy the checksum of that commit
git logThe git log entry should look like this:
commit af78031c9ca45d3c349a692b0afd332178639e64 (master)
Author: Ole Bialas <ole.bialas@posteo.de>
Date: Thu Jul 31 11:14:00 2025 +0200
[DATALAD RUNCMD] python code/aggregate_culmen_data.py
=== Do not change lines below ===
{
"chain": [],
"cmd": "python code/aggregate_culmen_data.py",
"dsid": "3a8aacc5-85f0-4114-adee-fcfa7d21a5df",
"exit": 0,
"extra_inputs": [],
"inputs": [
"data"
],
"outputs": [
"results/penguin_culmens.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
The checksum is displayed on the first line, after “commit”: af78031c9ca45d3c349a692b0afd332178639e64
Exercise 9 Use the copied checksum to rerun the previous datalad run command
The command looks like this but the checksum will be different for everyone:
datalad rerun af78031c9ca45d3c349a692b0afd332178639e64Exercise 10 Run the script code/plot_culmen_length_vs_depth.py — it takes results/penguin_culmens.csv as --input and produces results/culmen_length_vs_depth.png as an output.
datalad run --input "results/penguin_culmens.csv" --output "results/culmen_length_vs_depth.png" "python code/plot_culmen_length_vs_depth.py"Exercise 11 Use the checksum of the very first commit (that says [DATALAD] new dataset) to re-run everything --since this commit (i.e. the whole analysis).
datalad rerun --since 4c2b9dcd5c745b519095f7bf6612bbe20f7ae9bbFurther reading
- For more on DataLad run and the comparison of how Git and git-annex handle files, see these chapters of the DataLad Handbook:
- For even more on git-annex under the hood, see git-annex documentation: