Creating and managing data resources

Warning

🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧

In each data package are data resources, which contain conceptually standalone sets of data. This page shows you how to create and manage data resources inside a data package using Sprout. You will need to have already created a data package.

Important

The Sprout Python package assumes you have full control over the folders and files of the system, or at least your user’s home directory. This includes being given space on a server that mostly has access through a Terminal, where you have control over the directories you can write to.

Important

Data resources can only be created from tidy data. Before you can store it, you need to process it into a tidy format, ideally using Python so that you have a record of the steps taken to clean and transform the data.

Putting your raw data into a data package makes it easier for yourself and others to use later on. The steps you’ll take to get your data into the structure used by Sprout are:

  1. Create the properties for the resource, using the original data as a starting point, and edit as needed.
  2. Create the folders for the new resource within the package and save the resource properties in the datapackage.json file.
  3. Add your data to the batches of data for the resource.
  4. Merge the data batches into a new resource data file.
  5. Re-build the data package’s README.md file from the updated datapackage.json file.
  6. If you need to update the properties at a later point, you can use update_resource_properties() and then write the result to the datapackage.json file.

Making a data resource requires that you have data that can be made into a resource in the first place. Usually, generated or collected data starts out in a bit of a “raw” shape that needs some working. This work needs to be done before adding the data as a data package, since Sprout assumes that the data is already tidy. For this guide, we use a (fake) data file that is already tidy and that looks like:

id,age,sex,height,weight,diabetes_type
1,54,F,167.5,70.3,Type 1
2,66,M,175,80.5,Type 2
3,64,F,165.8,94.2,Type 1
4,36,F,168.6,121.9,Type 2
5,47,F,176.4,77.5,Type 1
6,61,M,175.1,87.3,Type 2
7,41,M,193.7,103.6,Type 1
8,65,M,199.5,102.1,Type 2
9,30,M,183.8,137.4,Type 1
10,59,F,164.7,148,Type 2
11,67,M,186.3,113.8,Type 1
12,56,M,191.5,76.7,Type 2
13,40,F,170.9,127.4,Type 1
14,29,M,191.2,142.5,Type 2
15,53,M,189.5,84.9,Type 1
16,42,F,181.1,142.2,Type 2
17,68,F,177.7,135.5,Type 1
18,27,F,169.8,111.1,Type 2
19,39,M,187.9,106.7,Type 1
20,63,M,178.6,87.8,Type 2

If you want to follow this guide with the same data, you can find it here. The path to this data, which is stored in a variable called raw_data_path, is:

/tmp/tmps3u2rssv/patients.csv

Before we start, you need to import Sprout as well as other helper packages:

import seedcase_sprout.core as sp

# For pretty printing of output
from pprint import pprint

# TODO: This could be a wrapper helper function instead
# To be able to write multiline strings without indentation
from textwrap import dedent

Extracting resource properties from raw data

You’ll start by creating the resource’s properties. Before you can store data in your data package, you need to describe it using properties (i.e., metadata). The resource’s properties are what allow other people to understand what your data is about and to use it more easily. These properties also define what it means for data in the resource to be correct, as all data in the resource must match the properties. While you can create a resource properties object manually using ResourceProperties, it can be quite intensive and time-consuming if you, for example, have many columns in your data. To make this process easier, extract_resource_properties() allows you to create an initial resource properties object by extracting as much information as possible from the raw data. Afterwards, you can edit these properties as needed.

Now you are ready to use extract_resource_properties() to extract the resource properties from the data. This function tries to infer the data types from the data, but it might not get it right, so make sure to double check the output. It is not possible to infer information that is not included in the data, like a description of what the data contains or the unit of the data.

resource_properties = sp.extract_resource_properties(
    data_path=raw_data_path
)
pprint(resource_properties)
ResourceProperties(name='patients',
                   path='/tmp/tmps3u2rssv/patients.csv',
                   type='table',
                   title=None,
                   description=None,
                   sources=None,
                   licenses=None,
                   format='csv',
                   mediatype='text/csv',
                   encoding='utf-8-sig',
                   bytes=None,
                   hash=None,
                   schema=TableSchemaProperties(fields=[FieldProperties(name='id',
                                                                        title=None,
                                                                        type='integer',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None),
                                                        FieldProperties(name='age',
                                                                        title=None,
                                                                        type='integer',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None),
                                                        FieldProperties(name='sex',
                                                                        title=None,
                                                                        type='string',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None),
                                                        FieldProperties(name='height',
                                                                        title=None,
                                                                        type='number',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None),
                                                        FieldProperties(name='weight',
                                                                        title=None,
                                                                        type='number',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None),
                                                        FieldProperties(name='diabetes_type',
                                                                        title=None,
                                                                        type='string',
                                                                        format=None,
                                                                        description=None,
                                                                        example=None,
                                                                        constraints=None,
                                                                        categories=None,
                                                                        categories_ordered=None,
                                                                        missing_values=None)],
                                                fields_match=None,
                                                primary_key=None,
                                                unique_keys=None,
                                                foreign_keys=None,
                                                missing_values=None))

You may be able to see that some things are missing, for instance, the individual columns (called fields) don’t have any descriptions. You will have to manually add this yourself. You can run a check on the properties to confirm what is missing:

print(sp.check_resource_properties(resource_properties))
  + Exception Group Traceback (most recent call last):
  |   File "/home/runner/work/seedcase-sprout/seedcase-sprout/.venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3667, in run_code
  |     exec(code_obj, self.user_global_ns, self.user_ns)
  |   File "/tmp/ipykernel_2202/4200374576.py", line 1, in <module>
  |     print(sp.check_resource_properties(resource_properties))
  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/core/sprout_checks/check_properties.py", line 110, in check_resource_properties
  |     raise error_info
  |   File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/core/sprout_checks/check_properties.py", line 102, in check_resource_properties
  |     _generic_check_properties(
  |   File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/core/sprout_checks/check_properties.py", line 166, in _generic_check_properties
  |     raise ExceptionGroup(
  | ExceptionGroup: The following checks failed on the properties:
  | PackageProperties(name=None, id=None, title=None, description=None, homepage=None, version=None, created=None, contributors=None, keywords=None, image=None, licenses=None, resources=[ResourceProperties(name='patients', path='/tmp/tmps3u2rssv/patients.csv', type='table', title=None, description=None, sources=None, licenses=None, format='csv', mediatype='text/csv', encoding='utf-8-sig', bytes=None, hash=None, schema=TableSchemaProperties(fields=[FieldProperties(name='id', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='age', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='sex', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='height', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='weight', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='diabetes_type', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None)], fields_match=None, primary_key=None, unique_keys=None, foreign_keys=None, missing_values=None))], sources=None) (4 sub-exceptions)
  +-+---------------- 1 ----------------
    | seedcase_sprout.core.check_datapackage.check_error.CheckError: Error at `$.description` caused by `required`: 'description' is a required property
    +---------------- 2 ----------------
    | seedcase_sprout.core.check_datapackage.check_error.CheckError: Error at `$.path` caused by `pattern`: '/tmp/tmps3u2rssv/patients.csv' does not match '^((?=[^./~])(?!file:)((?!\\/\\.\\.\\/)(?!\\\\)(?!:\\/\\/).)*|(http|ftp)s?:\\/\\/.*)$'
    +---------------- 3 ----------------
    | seedcase_sprout.core.check_datapackage.check_error.CheckError: Error at `$.path` caused by `pattern`: 'path' should contain the resource ID
    +---------------- 4 ----------------
    | seedcase_sprout.core.check_datapackage.check_error.CheckError: Error at `$.title` caused by `required`: 'title' is a required property
    +------------------------------------

Time to fill in the missing fields of the resource properties:

# TODO: Need to consider/design how editing can be done in an easier, user-friendly way.
# TODO: Add more detail when we know what can and can't be extracted.
resource_properties.title = "Patient Data"
resource_properties.description = "This data resource contains data about..."

Creating a data resource

Now that you have the properties for the resource, you can create the resource itself within your existing data package. As a data package can contain multiple resources, each resource is stored in a separate folder. So in this step you will make a folder for the new resource within the data package and save the resource properties to the datapackage.json file of the package.

We assume that you’ve already created a data package (for example, by following the package guide) and stored the path to it as a PackagePath object in the package_path variable. In this guide, the root folder of the package is in a temporary folder:

print(package_path.root())
/tmp/tmps3u2rssv/diabetes-study

Let’s take a look at the current files and folders in the data package:

[PosixPath('/tmp/tmps3u2rssv/diabetes-study/README.md'), PosixPath('/tmp/tmps3u2rssv/diabetes-study/datapackage.json')]

This shows that the data package already includes a datapackage.json file and a README.md file. Now, to create the resource structure in this package, use create_resource_structure(). By default, this function assumes that the data package is located in the current working directory. To point it to another location, pass it the path to your data package folder, like in the example below.

resource_path, _ = sp.create_resource_structure(path=package_path.root())
resource_properties = sp.create_resource_properties(
    path=resource_path,
    properties=resource_properties
)

The next step is to write the resource properties to the datapackage.json file. Before they are added, they will be checked to confirm that they are correctly filled in and that no required fields are missing. You can use the PackagePath().properties() helper function to give you the location of the datapackage.json file based on the path to your package.

sp.write_resource_properties(
    path=package_path.properties(),
    resource_properties=resource_properties,
)
PosixPath('/tmp/tmps3u2rssv/diabetes-study/datapackage.json')

Let’s check the contents of the datapackage.json file to see that the resource properties have been added:

pprint(sp.read_properties(package_path.properties()))
PackageProperties(name='example-package',
                  id='fac8e8f5-e8f3-4bc4-a707-294504d2e998',
                  title='Example fake data package',
                  description='Data from a fake data package on something.',
                  homepage=None,
                  version='0.1.0',
                  created='2025-04-15T13:58:38+00:00',
                  contributors=[ContributorProperties(title='Jamie Jones',
                                                      path=None,
                                                      email='jamie_jones@example.com',
                                                      given_name=None,
                                                      family_name=None,
                                                      organization=None,
                                                      roles=['creator'])],
                  keywords=None,
                  image=None,
                  licenses=[LicenseProperties(name='ODC-BY-1.0',
                                              path='https://opendatacommons.org/licenses/by',
                                              title='Open Data Commons '
                                                    'Attribution License 1.0')],
                  resources=[ResourceProperties(name='patients',
                                                path='resources/1/data.parquet',
                                                type='table',
                                                title='Patient Data',
                                                description='This data '
                                                            'resource contains '
                                                            'data about...',
                                                sources=None,
                                                licenses=None,
                                                format='csv',
                                                mediatype='text/csv',
                                                encoding='utf-8-sig',
                                                bytes=None,
                                                hash=None,
                                                schema=TableSchemaProperties(fields=[FieldProperties(name='id',
                                                                                                     title=None,
                                                                                                     type='integer',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None),
                                                                                     FieldProperties(name='age',
                                                                                                     title=None,
                                                                                                     type='integer',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None),
                                                                                     FieldProperties(name='sex',
                                                                                                     title=None,
                                                                                                     type='string',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None),
                                                                                     FieldProperties(name='height',
                                                                                                     title=None,
                                                                                                     type='number',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None),
                                                                                     FieldProperties(name='weight',
                                                                                                     title=None,
                                                                                                     type='number',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None),
                                                                                     FieldProperties(name='diabetes_type',
                                                                                                     title=None,
                                                                                                     type='string',
                                                                                                     format=None,
                                                                                                     description=None,
                                                                                                     example=None,
                                                                                                     constraints=None,
                                                                                                     categories=None,
                                                                                                     categories_ordered=None,
                                                                                                     missing_values=None)],
                                                                             fields_match=None,
                                                                             primary_key=None,
                                                                             unique_keys=None,
                                                                             foreign_keys=None,
                                                                             missing_values=None))],
                  sources=None)

Storing a backup of the raw data as a batch file

Note

See the flow diagrams for a simplified flow of steps involved in adding batch files.

As shown above, the data is currently stored in the path called raw_data_path. Time to store this data in the resource’s folder by using:

# TODO: eval when function implemented
sp.write_resource_batch(
    data=...,
    resource_properties=resource_properties
)

This function uses the properties object to determine where to store the data as a batch file, which is in the batch/ folder of the resource’s folder. You can check the newly added file by using:

print(package_path.resource_batch_files(1))
[]

Building the resource data file

Now that you’ve stored the data as a batch file, you can build the Parquet file that will be used as the data resource. This Parquet file is built from all the data in the batch/ folder. Since there is only one batch data file stored in the resource’s folder, only this one will be used to build the data resource’s Parquet file:

# TODO: eval when function implemented
sp.join_resource_batches(
    data_list=...,
    resource_properties=resource_properties
)
Tip

If you add more raw data to the resource later on, you can update this Parquet file to include all data in the batch folder using the build_resource_parquet() function like shown above.

Re-building the README file

One of the last steps to adding a new data resource is to re-build the README.md file of the data package. To allow some flexibility with what gets added to the README text, this next function will only build the text, but not write it to the file. This allows you to add additional information to the README text before writing it to the file.

readme_text = sp.as_readme_text(
    properties=sp.read_properties(package_path.properties())
)

For this guide, you’ll only use the default text and not add anything else to it. Next you write the text to the README.md file by:

sp.write_file(
    string=readme_text,
    path=package_path.readme()
)
PosixPath('/tmp/tmps3u2rssv/diabetes-study/README.md')

Updating resource properties

After having created a resource, you may need to make edits to the properties. While technically you can do this manually by opening up the datapackage.json file and editing it, we strongly recommend you use the update functions to do this. These functions help to ensure that the datapackage.json file is still in a correct JSON format and has the correct fields filled in. You can call the update_resource_properties() function with two arguments: the current resource properties and a resource properties object representing the updates you want to make. This latter object will very often be a partial resource properties, in the sense that it will have only those fields filled in that you want to update. The function will return an updated resource properties object. Any field in the object representing the updates will overwrite the corresponding field in the current properties object.

# TODO: eval when function implemented
resource_properties = sp.update_resource_properties(
    current_properties=resource_properties,
    update_properties=sp.ResourceProperties(
        title="Basic characteristics of patients"
    )
)
pprint(resource_properties)

Finally, to write your changes back to datapackage.json, use the write_resource_properties() function:

sp.write_resource_properties(
    path=package_path.properties(),
    resource_properties=resource_properties,
)
PosixPath('/tmp/tmps3u2rssv/diabetes-study/datapackage.json')