Part 3: Creating Backups and Sharing DataLad Datasets

Creating a Backup

Command Description
git init --bare ~/mydir Create a --bare repository called mydir in the home directory (on Linux/macOS)
git init --bare %USERPROFILE%\mydir Create a --bare repository called mydir in the home directory (on Windows / CMD)
git init --bare "$env:USERPROFILE\mydir" Create a --bare repository called mydir in the home directory (on Windows / PowerShell)
datalad siblings List all siblings of the current dataset
datalad sibings add --name new --url ~/mydir Add the repository at ~/mydir as a new sibling with the name new
datalad push --to new Push the dataset content to the sibling named new

Exercise 1 List all siblings of the current dataset.

datalad siblings

Exercise 2 Initialize a --bare git repository at a path outside of this dataset.

On Linux/macOS

git init --bare ~/penguins_backup

On Windows

git init --bare %USERPROFILE%\penguins_backup

Exercise 3 add a new sibling to the dataset using the path to the newly created git repository as the --url. Then, list all siblings to confirm it was added.

On Linux/macOS

datalad siblings add --name backup --url ~/penguins_backup
datalad siblings

On Windows

datalad siblings add --name backup --url %USERPROFILE%\penguins_backup
datalad siblings

Exercise 4 Push the dataset to the new sibling twice.

We need to push tiwce because the first push initializes the repository’s annex ID and the second (and each subsequent) push actually tranfer the annexed files.

datalad push --to backup
datalad push --to backup

Exercise 5 Move to a directory outside of this dataset and clone the new sibling dataset.

On Linux/macOS

cd ..
datalad clone ~/penguins_backup

On Windows

datalad clone %USERPROFILE%\penguins_backup

BONUS: Sharing your Dataset online

Command Description
ssh-keygen Generate a public and private authentication key pair
datalad siblings List all siblings of the current dataset
datalad sibings add --name gin --url git@gin.g-node.org:/user/repo.git Add the gin repository at /https://gin.g-node.org/user/repo as a new sibling with the name gin
datalad push --to gin Push the dataset content to the sibling named gin

Exercise 6 Use ssh-keygen to generate a public and private key pair (you don’t have to use a passphrase). Note the location where the public key is stored, e.g. .ssh/id_ed25519.pub. Open the .pub file and copy the whole content — it should look something like this: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBOYcoRKZZLWA4FWECpW2K/fTOvuRYXBnBA6gcea2bFq <user>@<computer>

ssh-keygen

Exercise 7 Login in to your GIN account, go to your user settings and add the copied ssh key. Now datalad should be able to connect to your GIN account!

Exercise 8 Create a new repository on GIN, make sure to NOT initialize it with a README.

Exercise 9 add a new sibling to the dataset using the --url of the newly created gin repository and confirm the connection. Then, list all siblings to confirm it was added.

For the repository in the image above, the command would look like this:

datalad siblings add --name gin --url git@gin.g-node.org:/adswa/DataLad-101.git

Exercise 10 Push the dataset to the new GIN sibling. Then, open the repository in your browser to confirm the content was pushed.

datalad push --to gin

Exercise 11 Move to a directory outside of this dataset and clone the new GIN sibling.

For the repository in the image above, the command would look like this:

cd ..
datalad clone datalad clone https://gin.g-node.org/adswa/DataLad-101

Further reading

In the examples above, the annex was published together with the Git repository. However, this is a bit of a special case, and in many scenarios they can be moved separately. For an overview and examples of several different publishing scenarios, see the Beyond shared infrastructure chapter of the DataLad handbook.

Git-annex supports multiple options for publishing file contents; see the list of built-in special remotes. And for a very special case, in which the Git repository is placed by git-annex in a non-git-aware hosting, see git-remote annex.

Finally, Forgejo is gaining popularity as a self-hosted software forge. Forgejo-aneksajo is a soft fork of Forgejo which adds git-annex capability. See also Collaborative infrastructure for a lab: Forgejo on the DataLad blog .s