Data Management for Open Science
August 19, 2025
Well-structured datasets (using community standards), and portable computational environments — and their evolution — are the precondition for reproducibility
# turn any directory into a dataset
# with version control
% datalad create <directory>
# save a new state of a dataset with
# file content of any size
% datalad save
Which data were needed at which version, as input into which code, running with what parameterization in which computional environment, to generate an outcome?
# execute any command and capture its output
# while recording all input versions too
% datalad run --input ... --output ... <command>
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
# transfer data and metadata to other sites and services with fine-grained access control for dataset components
% datalad push --to <site-or-service>
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated “upgrades” of outputs.
# obtain dataset (initially only identity,
# availability, and provenance metadata)
% datalad clone <url>
# immediately actionable provenance records
# full abstraction of input data retrieval
% datalad rerun <commit|tag|range>
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re)used as modular components in larger contexts — propagating their traits
# declare a dependency on another dataset and
# reuse it at particular state in a new context
% datalad clone -d <superdataset> <path-in-dataset>