Exercise Review 1

Working with DataLad Datasets

Michał Szczepanik

Research Center Juelich

Ole Bialas

University of Bonn

August 19, 2025

Consuming an Existing DataLad Dataset

  • datalad clone is fast because it only downloads symbolic file links
  • datalad get downloads the actual file content
  • Useful for working on very large datasets:
    • datalad clone <dataset>
    • datalad get <file>
    • python <script>
    • datalad drop <file>
  • This works because DataLad manages the file identifiers that uniquely associate every links with it’s corresponding file content

A Meta View

astronaut meme: wait it's all identifiers? always has been

Provider and Manager of Identifiers

  • DataLad provides globally unique, persistent identifiers (without a central issuing service; offline and portable)

  • Concept identifiers

    • for datasets: DataLad dataset ID
    • for files in a dataset: DataLad dataset ID + path within a dataset
  • Content/version identifiers

    • for datasets: Git commit SHA ID
    • for files: Git blob SHA / Git-annex key
  • By tracking unique identifiers, DataLad can manage files across multiple sources

What is a DataLad Dataset?

  • a container for metadata on the evolution of a collection of files
    • content identity (think: checksums)
    • content availability (think: URLs)
    • provenance of change (think: who did what when?)
  • a regular, but managed directory on computer file system
    • provides a familiar look and feel of files and folders
  • a Git repository
    • compatible with anything that can handle Git repositories
  • a git-annex repository for storing, tracking and transporting file content
    • supports any storage service and transport protocol supported by git-annex

Hands-on: Tracking Changes