Part 1: Working with DataLad Datasets

Consuming Existing Datasets

In this section, you are going to clone an existing DataLad dataset and download its contents. While the datalad API is universal, the commands for navigating the data set differ between operating systems (see table below).

Terminal commands
Linux/macOS Windows Description
ls -a dir /a List the content of the current directory (including hidden files)
ls -a data dir /a data List the content of the data directory
du -sh dir /s Get the disk usage of the current directory
du -sh data dir /s data Get the disk usage of the data directory
cd data cd data Change the directory to data
DataLad commands
Command Description
datalad clone https://example.com Clone the data set from example.com
datalad get folder/ Get the file content of the folder/
datalad get folder/image.png Get the file content of the file image.png
datalad drop folder/ Drop the file content of the folder/

Open a terminal to do the following exercises. For Windows users, we recommend CMD (the solutions will assume you are using CMD).

Exercise 1 Clone the dataset from https://gin.g-node.org/obi/penguins

datalad clone https://gin.g-node.org/obi/penguins

Exercise 2 Change the directory to penguins and list the directory’s content

On Linux/macOS:

cd penguins
ls -a

On Windows:

cd penguins
dir /a

Exercise 3 Check the disk usage of the penguins directory

On Linux/macOS:

du -sh

On Windows:

dir /s

Exercise 4 Get the content of the examples subdirectory

datalad get examples

Exercise 5 Check the disk usage of the penguins directory again

On Linux/macOS:

du -sh

On Windows:

dir /s

Exercise 6 Drop the content of examples/chinstrap.jpg and check the disk usage again

datalad drop examples/chinstrap.jpg

On Linux/macOS:

du -sh

On Windows:

dir /s

Checking File Identity and Location with git-annex

Since DataLad is built on top of git-annex, you can use its commands on any DataLad dataset. In this section, you’ll use git-annex to get information on the dataset and locate its file contents.

Command Description
git annex info Show the git-annex information for the whole dataset
git annex info folder/image.png Show the git-annex information for the file image.png
git annex whereis folder/image.png List the repositories that have the file content for image.png

Exercise 7 Display the git annex info for the file examples/gentoo.jpg. What is the size of that file? Is it present on your machine?

git annex info examples/gentoo.jpg

The file is 4.81 megabtyes and it should be present since we previously loaded the content of the examples folder.

Exercise 8 Display the git-annex info of the whole data set. How many annexed files are there in the working tree?

git annex info

The number of annexed files is displayed in this line: annexed files in working tree: 6

Exercise 9 Use git annex whereis to list the repositories that have the file content for the image examples/gentoo.jpg.

git annex whereis examples/gentoo.jpg

Exercise 10 Use git annex whereis to list the repositories that have the file content for the table data/table_220.csv. How does this differ from the list of repositories that contain the content for gentoo.jpg?

git annex whereis data/table_220.csv

The table is not stored in the local repository, listed in the line marked [here].