Part 1: Working with DataLad Datasets
Consuming Existing Datasets
In this section, you are going to clone an existing DataLad dataset and download its contents. While the datalad
API is universal, the commands for navigating the data set differ between operating systems (see table below).
Linux/macOS | Windows | Description |
---|---|---|
ls -a |
dir /a |
List the content of the current directory (including hidden files) |
ls -a data |
dir /a data |
List the content of the data directory |
du -sh |
dir /s |
Get the disk usage of the current directory |
du -sh data |
dir /s data |
Get the disk usage of the data directory |
cd data |
cd data |
Change the directory to data |
Command | Description |
---|---|
datalad clone https://example.com |
Clone the data set from example.com |
datalad get folder/ |
Get the file content of the folder/ |
datalad get folder/image.png |
Get the file content of the file image.png |
datalad drop folder/ |
Drop the file content of the folder/ |
Open a terminal to do the following exercises. For Windows users, we recommend CMD (the solutions will assume you are using CMD).
Exercise 1 Clone the dataset from https://gin.g-node.org/obi/penguins
datalad clone https://gin.g-node.org/obi/penguins
Exercise 2 Change the directory to penguins
and list the directory’s content
On Linux/macOS:
cd penguins
ls -a
On Windows:
cd penguins
dir /a
Exercise 3 Check the disk usage of the penguins
directory
On Linux/macOS:
du -sh
On Windows:
dir /s
Exercise 4 Get the content of the examples
subdirectory
datalad get examples
Exercise 5 Check the disk usage of the penguins
directory again
On Linux/macOS:
du -sh
On Windows:
dir /s
Exercise 6 Drop the content of examples/chinstrap.jpg
and check the disk usage again
datalad drop examples/chinstrap.jpg
On Linux/macOS:
du -sh
On Windows:
dir /s
Checking File Identity and Location with git-annex
Since DataLad is built on top of git-annex
, you can use its commands on any DataLad dataset. In this section, you’ll use git-annex
to get information on the dataset and locate its file contents.
Command | Description |
---|---|
git annex info |
Show the git-annex information for the whole dataset |
git annex info folder/image.png |
Show the git-annex information for the file image.png |
git annex whereis folder/image.png |
List the repositories that have the file content for image.png |
Exercise 7 Display the git annex info
for the file examples/gentoo.jpg
. What is the size of that file? Is it present on your machine?
git annex info examples/gentoo.jpg
The file is 4.81 megabtyes and it should be present since we previously loaded the content of the examples
folder.
Exercise 8 Display the git-annex info
of the whole data set. How many annexed files are there in the working tree?
git annex info
The number of annexed files is displayed in this line: annexed files in working tree: 6
Exercise 9 Use git annex whereis
to list the repositories that have the file content for the image examples/gentoo.jpg
.
git annex whereis examples/gentoo.jpg
Exercise 10 Use git annex whereis
to list the repositories that have the file content for the table data/table_220.csv
. How does this differ from the list of repositories that contain the content for gentoo.jpg
?
git annex whereis data/table_220.csv
The table is not stored in the local repository, listed in the line marked [here]
.