Creating and managing data packages

Warning

🚧 This section is still in active development and is subject to changes 🚧

At the core of Sprout is the data package, which is a standardized way of structuring and sharing data. This guide will show you how to create and manage data packages using Sprout.

Important

For both the Python library and the CLI, Sprout assumes you have full control over the folders and files of the system, or at least your user’s home directory. This includes being given space on a server that mostly has access through a Terminal, where you have control over the directories you can write to.

An easy example of this is if you install Sprout on your own computer because you want to create some data packages for research studies you are running. Another example would be if your research group has storage space on a server that you need to use a Terminal and SSH to be able to access.

Creating a data package

The first thing you’ll need to decide is where you want to store your data packages. By default, Sprout will create it in ~/sprout/packages/ on Linux (see Outputs for operating system specific locations), but you can change this by setting the SPROUT_ROOT environment variable. For instance, maybe you want the location to be ~/Documents/data-packages/. You can set this in your Python script like so:

import sprout.core as sp
import os

os.environ["SPROUT_ROOT"] = "~/Documents/data-packages/"

Afterwards, you can create the structure for your first data package by using:

sp.create_package_structure(path=sp.path_packages())
[PosixPath('~/Documents/data-packages/1/datapackage.json'),
 PosixPath('~/Documents/data-packages/1/README.md')]

This creates the initial structure of your new package with the ID 1. The output above shows that the folder of your data package 1 has been created. This folder consists of two files: datapackage.json and README.md. The datapackage.json file initially contains fields with some default values in them, but it will eventually contain the metadata, a.k.a. the properties, of your data package. README.md is a prettified, human-readable version of the content of the datapackage.json.

While you can manually fill in the details in the datapackage.json file, we have several helper classes, such as PackageProperties, LicenseProperties, and ContributorProperties, to make it easier for you.

properties = sp.PackageProperties(
    title="Diabetes and Hypertension Study",
    description="Data from the 2021 study on diabetes and hypertension",
    contributors=[sp.ContributorProperties(
        title="Jamie Jones",
        email="jamie_jones@example.com",
        roles=["creator"]
    )],
    licenses=[sp.LicensesProperties("ODC-BY-1.0")]
)
print(properties)
# TODO: This will eventually show the actual output.
PackageProperties(...)

Then, to update the current datapackage.json file with these properties, you can use the update_package_properties() function:

package_properties = sp.update_package_properties(
  path=sp.path_properties(package_id=1),
  properties=properties
)
print(package_properties)
# TODO: add an example output of the above.
{...}
Important

The update_package_properties() function will give an error if the required fields are not filled in to create a valid datapackage.json file.

To save the package properties to the datapackage.json file, run:

sp.write_package_properties(
  properties=package_properties,
  path=sp.path_properties(package_id=1)
)

If you need help with filling in the right properties, see the documentation for the PackageProperties classes or run e.g., print(sp.PackageProperties()) to get a list of all the fields you can fill in for a package.

You now have the basic starting point for adding data resources to your data package.

The CLI is a bit more straightforward as long as you are comfortable using the Terminal. You can set the SPROUT_ROOT environment variable to change the location of the data packages. For instance, maybe you want the location to be ~/Documents/data-packages/. You can set this in your terminal like so:

export SPROUT_ROOT=~/Documents/data-packages/

Then creating a new package would be as simple as:

sprout package create

This will prompt you for some required fields you need to fill in, like the title and description of the data package. If you want to skip the prompt, you can provide the information directly in the command:

sprout package create \
  --title "Diabetes and Hypertension Study" \
  --description "Data from the 2021 study on diabetes and hypertension"

This creates the initial structure of your new package with the ID 1. The output above shows that the folder of your data package 1 has been created. This folder consists of two files: datapackage.json and README.md. The datapackage.json file is empty initially, but it will contain the metadata, a.k.a. the properties, of your data package. README.md is a prettified, human-readable version of the content of the datapackage.json.

You now have the basic starting point for adding data resources to your data package in the location you specified:

~/Documents/data-packages/1/datapackage.json
~/Documents/data-packages/1/README.md
Warning

In development.