Exercise Review 1

Working with DataLad Datasets

Michał Szczepanik

Research Center Juelich

Ole Bialas

University of Bonn

August 19, 2025

Consuming an Existing DataLad Dataset

datalad clone is fast because it only downloads symbolic file links
datalad get downloads the actual file content
Useful for working on very large datasets:
- datalad clone <dataset>
- datalad get <file>
- python <script>
- datalad drop <file>
This works because DataLad manages the file identifiers that uniquely associate every links with it’s corresponding file content

DataLad provides globally unique, persistent identifiers (without a central issuing service; offline and portable)
Concept identifiers
- for datasets: DataLad dataset ID
- for files in a dataset: DataLad dataset ID + path within a dataset
Content/version identifiers
- for datasets: Git commit SHA ID
- for files: Git blob SHA / Git-annex key
By tracking unique identifiers, DataLad can manage files across multiple sources

a container for metadata on the evolution of a collection of files
- content identity (think: checksums)
- content availability (think: URLs)
- provenance of change (think: who did what when?)
a regular, but managed directory on computer file system
- provides a familiar look and feel of files and folders
a Git repository
- compatible with anything that can handle Git repositories
a git-annex repository for storing, tracking and transporting file content
- supports any storage service and transport protocol supported by git-annex