Exercise Review 2

Tracking Changes in DataLad Datasets

Michał Szczepanik

Research Center Juelich

Ole Bialas

University of Bonn

August 19, 2025

Git Versus git-annex

  • Data in datasets is either stored in Git or git-annex
  • Matter of configuration; by default, everything is annexed

Git Versus git-annex

Git and git-annex handle files differently:

  • Files in Git are downloaded during clone, annexed contents are retrieved on demand (get).
  • On datalad save, annexed contents are hashed, moved to .git/annex/objects and symlinked.
  • Git versions the symlink (content-identity), not the content.
  • Content is “locked” (write-protected) against accidental modifications.
  • Files stored in Git are modifiable, annexed files are write-protected.

Annexed Files on Windows

Windows’ file system does not support symlinks1. Under these conditions, git-annex automatically operates in “adjusted unlocked mode”:

  • File contents are duplicated: One copy to edit, one to keep safe.
  • No (un)locking – trade-off: disk space use vs easy modification.
  • Annexed files are pointer files instead of symlinks: A text file with the contents “/annex/objects/” followed by the key.
  • Git annex uses adjusted branches, which unlock on top the locked counterpart branch. Compatibility with other systems but confusing Git log and issues in some nesting usecases.

Hands-on: Creating Backups and Sharing DataLad Datasets