Exercise Review 1
Working with DataLad Datasets
Michał Szczepanik
Research Center Juelich
Ole Bialas
University of Bonn
August 19, 2025
Consuming an Existing DataLad Dataset
datalad clone
is fast because it only downloads symbolic file links
datalad get
downloads the actual file content
- Useful for working on very large datasets:
datalad clone <dataset>
datalad get <file>
python <script>
datalad drop <file>
- This works because DataLad manages the file identifiers that uniquely associate every links with it’s corresponding file content
Provider and Manager of Identifiers
DataLad provides globally unique, persistent identifiers (without a central issuing service; offline and portable)
Concept identifiers
- for datasets: DataLad dataset ID
- for files in a dataset: DataLad dataset ID + path within a dataset
Content/version identifiers
- for datasets: Git commit SHA ID
- for files: Git blob SHA / Git-annex key
By tracking unique identifiers, DataLad can manage files across multiple sources
What is a DataLad Dataset?
- a container for metadata on the evolution of a collection of files
- content identity (think: checksums)
- content availability (think: URLs)
- provenance of change (think: who did what when?)
- a regular, but managed directory on computer file system
- provides a familiar look and feel of files and folders
- a Git repository
- compatible with anything that can handle Git repositories
- a git-annex repository for storing, tracking and transporting file content
- supports any storage service and transport protocol supported by git-annex
Hands-on: Tracking Changes
![]()