Part 1: Working with DataLad Datasets
Consuming Existing Datasets
In this section, you are going to clone an existing DataLad dataset and download its contents. While the datalad API is universal, the commands for navigating the data set differ between operating systems (see table below).
| Linux/macOS | Windows | Description |
|---|---|---|
ls -a |
dir /a |
List the content of the current directory (including hidden files) |
ls -a data |
dir /a data |
List the content of the data directory |
du -sh |
dir /s |
Get the disk usage of the current directory |
du -sh data |
dir /s data |
Get the disk usage of the data directory |
cd data |
cd data |
Change the directory to data |
| Command | Description |
|---|---|
datalad clone https://example.com |
Clone the data set from example.com |
datalad get folder/ |
Get the file content of the folder/ |
datalad get folder/image.png |
Get the file content of the file image.png |
datalad drop folder/ |
Drop the file content of the folder/ |
Open a terminal to do the following exercises. For Windows users, we recommend CMD (the solutions will assume you are using CMD).
Exercise 1 Clone the dataset from https://gin.g-node.org/obi/penguins
datalad clone https://gin.g-node.org/obi/penguinsExercise 2 Change the directory to penguins and list the directory’s content
On Linux/macOS:
cd penguins
ls -aOn Windows:
cd penguins
dir /aExercise 3 Check the disk usage of the penguins directory
On Linux/macOS:
du -shOn Windows:
dir /sExercise 4 Get the content of the examples subdirectory
datalad get examplesExercise 5 Check the disk usage of the penguins directory again
On Linux/macOS:
du -shOn Windows:
dir /sExercise 6 Drop the content of examples/chinstrap.jpg and check the disk usage again
datalad drop examples/chinstrap.jpgOn Linux/macOS:
du -shOn Windows:
dir /sChecking File Identity and Location with git-annex
Since DataLad is built on top of git-annex, you can use its commands on any DataLad dataset. In this section, you’ll use git-annex to get information on the dataset and locate its file contents.
| Command | Description |
|---|---|
git annex info |
Show the git-annex information for the whole dataset |
git annex info folder/image.png |
Show the git-annex information for the file image.png |
git annex whereis folder/image.png |
List the repositories that have the file content for image.png |
Exercise 7 Display the git annex info for the file examples/gentoo.jpg. What is the size of that file? Is it present on your machine?
git annex info examples/gentoo.jpgThe file is 4.81 megabtyes and it should be present since we previously loaded the content of the examples folder.
Exercise 8 Display the git-annex info of the whole data set. How many annexed files are there in the working tree?
git annex infoThe number of annexed files is displayed in this line: annexed files in working tree: 6
Exercise 9 Use git annex whereis to list the repositories that have the file content for the image examples/gentoo.jpg.
git annex whereis examples/gentoo.jpgExercise 10 Use git annex whereis to list the repositories that have the file content for the table data/table_220.csv. How does this differ from the list of repositories that contain the content for gentoo.jpg?
git annex whereis data/table_220.csvThe table is not stored in the local repository, listed in the line marked [here].