Exercise Review 2
Tracking Changes in DataLad Datasets
Michał Szczepanik
Research Center Juelich
Ole Bialas
University of Bonn
August 19, 2025
Git Versus git-annex
- Data in datasets is either stored in Git or git-annex
- Matter of configuration; by default, everything is annexed
Git Versus git-annex
Git and git-annex handle files differently:
- Files in Git are downloaded during
clone
, annexed contents are retrieved on demand (get
).
- On
datalad save
, annexed contents are hashed, moved to .git/annex/objects
and symlinked.
- Git versions the symlink (content-identity), not the content.
- Content is “locked” (write-protected) against accidental modifications.
- Files stored in Git are modifiable, annexed files are write-protected.
Annexed Files on Windows
Windows’ file system does not support symlinks1. Under these conditions, git-annex automatically operates in “adjusted unlocked mode”:
- File contents are duplicated: One copy to edit, one to keep safe.
- No (un)locking – trade-off: disk space use vs easy modification.
- Annexed files are pointer files instead of symlinks: A text file with the contents “/annex/objects/” followed by the key.
- Git annex uses adjusted branches, which unlock on top the locked counterpart branch. Compatibility with other systems but confusing Git log and issues in some nesting usecases.
Hands-on: Creating Backups and Sharing DataLad Datasets
![]()