Resource metadata

What is a data resource?

In the previous sections, we have seen how to create and manage package metadata for our data package. In this section, we will explore how we can add data files and manage its metadata (e.g. documenting the type of data in each column). In a Data Package, data files are referred to as data resources, each containing a conceptually distinct set of data. We refer to the metadata for a data resource as “resource metadata”.

Creating a data resource

Creating a data resource requires that your data is in the correct format. Usually, generated or collected data starts out in a “raw” shape that needs to be cleaned and organized into so called “tidy data” before it can become a data resource. How to tidy data will differ from dataset to dataset and is outside the scope of Sprout, so we will not cover the procedure in detail here. Ideally you would use a Python package such as Polars. to tidy your data, so that you have a record of the steps taken to clean and transform the data. After cleaning, your data should follow the specification outlined in our documentation, specifically that it needs to be a Polars DataFrame.

For this guide, you will use the (fake) data on patients that is already tidy. We’ve placed this data in a raw/ folder in the Data Package and called it patients.csv, so that we can keep the original “raw” data separate from the processed data.

At this point, your data package diabetes-study has the following structure:

📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 scripts/
│ └─📄 package_properties.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml

And the raw/patients.csv file includes data about patients with diabetes, which look like this:

shape: (20, 6)

id	age	sex	height	weight	diabetes_type
i64	i64	str	f64	f64	str
1	54	"F"	167.5	70.3	"Type 1"
2	66	"M"	175.0	80.5	"Type 2"
3	64	"F"	165.8	94.2	"Type 1"
4	36	"F"	168.6	121.9	"Type 2"
5	47	"F"	176.4	77.5	"Type 1"
…	…	…	…	…	…
16	42	"F"	181.1	142.2	"Type 2"
17	68	"F"	177.7	135.5	"Type 1"
18	27	"F"	169.8	111.1	"Type 2"
19	39	"M"	187.9	106.7	"Type 1"
20	63	"M"	178.6	87.8	"Type 2"

Managing resource metadata with Sprout

Before you can store a data file as a resource in your data package, you need to describe its metadata properties. The resource’s metadata are what allow other people to understand what your data is about and to use it more easily.

The resource’s metadata also define what it means for data in the resource to be correctly entered, as all data in the resource must match the metadata properties. Sprout checks that the metadata properties are correctly filled in and that no required metadata fields are missing. It also checks that the data in the data resources matches the metadata properties, so that you can be sure that data actually contains what you expect it to contain. These checks can protect you from human errors introduced when adding new data or editing an existing data resource.

Creating a script to help manage resource metadata

As you did for the package metadata, you will create a script to manage the resource medata. You can do this by using the create_resource_properties_script() function. This function needs the resource_name parameter to specify the name of the resource, which is used to identify the resource in the data package. It is a required property and should not be changed, since it’s used in the file name of the data resource’s properties script. Because a data package can contain multiple resources, the resource’s name must also be unique.

Building upon the main.py file we created in the previous section, we will go ahead and add the create_resource_properties_script() step to the pipeline:

main.py

import seedcase_sprout as sp
from scripts.package_properties import package_properties

def main():
1    sp.create_properties_script()
2    sp.create_resource_properties_script(resource_name="patients")
3    ...
if __name__ == "__main__":
    main()

1: We already had this line in our script from the previous section of the guide.
2: This is the only new line of code. It creates the resource properties script. Note that the script is only created once, it will not be overwritten if it already exists.
3: We have removed saving the properties into datapackage.json by deleting write_properties() and the creation of the README to make it easier to see what is new. You will learn how to write the resource metadata and the README later in this guide.

Now go ahead and run the script like previously:

Terminal

uv run main.py

Your folders and files should now look like:

📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 scripts/
│ ├─📄 package_properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml

As you can see, the filename has been suffixed with your chosen resource_name, i.e. “patients”. If you view the content of scripts/resource_properties_patients.py, you will see that it looks similar to the package metadata script, but with different names for the metadata properties. Since the file content is somewhat lengthy, it’s hidden by default on this page. You can click the banner below to show it.

Inside the resource/properties_patients.py script

scripts/resource_properties_patients.py

import seedcase_sprout as sp

resource_properties_patients = sp.ResourceProperties(
    ## Required:
    name="patients",
    title="",
    description="",
    ## Optional:
    type="table",
    format="parquet",
    mediatype="application/parquet",
    schema=sp.TableSchemaProperties(
        ## Required
        fields=[
            sp.FieldProperties(
                ## Required
                name="",
                type="",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
        ],
        ## Optional
        # fields_match=["equal"],
        # primary_key=[""],
        # unique_keys=[[""]],
        # foreign_keys=[
        #     sp.TableSchemaForeignKeyProperties(
        #         ## Required
        #         fields=[""],
        #         reference=sp.ReferenceProperties(
        #             ## Required
        #             resource="",
        #             fields=[""],
        #         ),
        #     ),
        # ],
    ),
    # sources=[
    #     sp.SourceProperties(
    #         ## Required:
    #         title="",
    #         ## Optional:
    #         path="",
    #         email="",
    #         version="",
    #     ),
    # ],
)

As with the template for the package metadata, the comments in this file indicate which properties are required and which are optional. You can see that you would need to fill out a title and a description just as we did previously for the package metadata. However, you also need to fill out the name and the type inside FieldProperties (a “field” is the same as a “column” or “variable” in your dataset). Doing this manually can be tedious for datasets with many columns, so Sprout provides a way to extract this metadata directly from the data file.

Extracting column metadata directly from the data

To ease the process of adding fields to your resource metadata, Sprout provides a function called extract_field_properties(), which allows you to extract metadata from each column in your dataset. To use this function, we need to read in our data as a Polars DataFrame and clean it until we are happy with the name and data type for each column. In the code chunk below, we’ve added all the required steps to our main.py file.

main.py

import seedcase_sprout as sp
from scripts.package_properties import package_properties
import polars as pl

def main():
    sp.create_properties_script()

1    raw_data_patients = pl.read_csv(package_path.root() / "raw" / "patients.csv")
2    ... # Data cleaning
3    field_properties = sp.extract_field_properties(data=raw_data_patients)
    sp.create_resource_properties_script(
        resource_name="patients",
4        fields=field_properties,
    )

if __name__ == "__main__":
    main()

1: Read the original data from the raw/ folder into a Polars DataFrame.
2: Clean the data until you have the name and data type of each column to what you want or need them to be. In our example, the raw data is already cleaned and in a tidy format, but in an actual project, you would likely have a separate cleaning script and call some of its functions here.
3: Extract field properties from the cleaned data, such as the column name and its data type.
4: Pass the extracted field properties to the function that saves the properties script, so that they are included in the created file.

Warning

extract_field_properties() extracts the field properties from the Polars DataFrame’s schema and maps the Polars data type to a Data Package field type. The mapping is not perfect, so you may need to edit the extracted properties to ensure that they are as you want them to be.

Before running the script, we need to manually delete scripts/resource_properties_patients.py since the function create_resource_properties_script will never overwrite an existing file (this is a precaution in case you have made manual edits to the file and accidentally trigger the function). Then run the main.py file in your Terminal, which will re-create the resource properties script:

Terminal

uv run main.py

Your folders and files should now look like:

📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 scripts/
│ ├─📄 package_properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml

If you now open scripts/resource_properties_patients.py, you will see that field properties are added for all the columns in the data. The fields that could be automatically extracted also have their metadata filled in already, in this case the name and type:

Inside the resource_properties_patients.py script

scripts/resource_properties_patients.py

import seedcase_sprout as sp

resource_properties_patients = sp.ResourceProperties(
    ## Required:
    name="patients",
    title="",
    description="",
    ## Optional:
    type="table",
    format="parquet",
    mediatype="application/parquet",
    schema=sp.TableSchemaProperties(
        ## Required
        fields=[
            sp.FieldProperties(
                ## Required
                name="id",
                type="integer",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
            sp.FieldProperties(
                ## Required
                name="age",
                type="integer",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
            sp.FieldProperties(
                ## Required
                name="sex",
                type="string",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
            sp.FieldProperties(
                ## Required
                name="height",
                type="number",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
            sp.FieldProperties(
                ## Required
                name="weight",
                type="number",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
            sp.FieldProperties(
                ## Required
                name="diabetes_type",
                type="string",
                ## Optional
                # title="",
                # format="",
                # description="",
                # example="",
                # categories=[],
                # categories_ordered=False,
            ),
        ],
        ## Optional
        # fields_match=["equal"],
        # primary_key=[""],
        # unique_keys=[[""]],
        # foreign_keys=[
        #     sp.TableSchemaForeignKeyProperties(
        #         ## Required
        #         fields=[""],
        #         reference=sp.ReferenceProperties(
        #             ## Required
        #             resource="",
        #             fields=[""],
        #         ),
        #     ),
        # ],
    ),
    # sources=[
    #     sp.SourceProperties(
    #         ## Required:
    #         title="",
    #         ## Optional:
    #         path="",
    #         email="",
    #         version="",
    #     ),
    # ],
)

Writing the resource metadata to `datapackage.json`

Before writing the metadata to file, we need to make sure that all the required properties are filled out. In the resource properties script, the name property is already set to patients. However, the two other required properties, title and description, are empty. You will need to fill these in yourself in the script, like so:

scripts/resource_properties_patients.py

resource_properties_patients = sp.ResourceProperties(
    ## Required:
    name="patients",
    title="Patients Data",
    description="This data resource contains data about patients in a diabetes study.",
    ...  # Additional metadata are omitted here to save space
)

Warning

If the title and description properties are not filled in, you’ll get a CheckError when you try to use write_properties() to save the resource’s properties to the datapackage.json file. You will understand these errors more deeply after reading the next section in the guide; for now focus on making sure that the three required fields above are all filled out.

Our resource properties file include the name, title, and description of that data resource together with the name and type of each field in the data resource. To write these resource metadata to datapackage.json, you need to include the resource properties in the package_properties.py file of your data package. You can do this by adding the following lines to the properties.py file:

scripts/package_properties.py

# Import the resource properties object.
from scripts.resource_properties_patients import resource_properties_patients

package_properties = sp.PackageProperties(
    # Your existing package metadata goes here...
    resources=[
        resource_properties_patients,
    ],
)

You can click the banner below to view the full file at this point:

Inside the package_properties.py script

scripts/package_properties.py

import seedcase_sprout as sp
1from scripts.resource_properties_patients import resource_properties_patients

package_properties = sp.PackageProperties(
    name="diabetes-study",
    title="A Study on Diabetes",
    # You can write Markdown below, with the helper `sp.dedent()`.
    description=sp.dedent("""
        # Data from a 2021 study on diabetes prevalence

        This data package contains data from a study conducted in 2021 on the
        *prevalence* of diabetes in various populations. The data includes:

        - demographic information
        - health metrics
        - survey responses about lifestyle
        """),
    contributors=[
        sp.ContributorProperties(
            title="Jamie Jones",
            email="jamie_jones@example.com",
            path="example.com/jamie_jones",
            roles=["creator"],
        ),
        sp.ContributorProperties(
            title="Zdena Ziri",
            email="zdena_ziri@example.com",
            path="example.com/zdena_ziri",
            roles=["creator"],
        )
    ],
    licenses=[
        sp.LicenseProperties(
            name="ODC-BY-1.0",
            path="https://opendatacommons.org/licenses/by",
            title="Open Data Commons Attribution License 1.0",
        )
    ],
2    resources=[
        resource_properties_patients,
    ],
    ## Autogenerated:
    id="8f301286-2327-45bf-bbc8-09696d059499",
    version="0.1.0",
    created="2025-11-07T11:12:56+01:00",
)

1: Import the resource metadata from the resource properties script.
2: Set the resources parameter of the package metadata to hold information about all the data resources. There would be one item in the list per data resource.

The next step is to write the resource properties to the datapackage.json file. Since we included the resource_properties object directly into the PackageProperties class in the scripts/package_properties.py file, we can add the same steps as in the last section of the guide to write the datapackage.json file and the README.md file.

main.py

import seedcase_sprout as sp
from scripts.package_properties import package_properties
import polars as pl

def main():
    sp.create_properties_script()

    raw_data_patients = pl.read_csv(package_path.root() / "raw" / "patients.csv")
    field_properties = sp.extract_field_properties(data=raw_data_patients)
    sp.create_resource_properties_script(
        resource_name="patients",
        fields=field_properties,
    )

1    sp.write_properties(properties=package_properties)
2    readme_text = sp.as_readme_text(package_properties)
3    sp.write_file(readme_text, sp.PackagePath().readme())

if __name__ == "__main__":
    main()

1: Write datapackage.json (including both the package metadata and the resource metadata).
2: Create the README text.
3: Write the README file.

Now run the main.py file form the terminal:

Terminal

uv run main.py

Let’s check the contents of the datapackage.json file to see that the resource properties have been added:

datapackage.json

{
  "name": "diabetes-study",
  "id": "9b5e1943-6748-4acc-93f0-027dcf391809",
  "title": "A Study on Diabetes",
  "description": "# Data from a 2021 study on diabetes prevalence\n\nThis data package contains data from a study conducted in 2021 on the\n*prevalence* of diabetes in various populations. The data includes:\n\n- demographic information\n- health metrics\n- survey responses about lifestyle\n",
  "version": "0.1.0",
  "created": "2026-02-23T20:52:21+00:00",
  "contributors": [
    {
      "title": "Jamie Jones",
      "path": "example.com/jamie_jones",
      "email": "jamie_jones@example.com",
      "roles": [
        "creator"
      ]
    }
  ],
  "licenses": [
    {
      "name": "ODC-BY-1.0",
      "path": "https://opendatacommons.org/licenses/by",
      "title": "Open Data Commons Attribution License 1.0"
    }
  ],
  "resources": [
    {
      "name": "patients",
      "path": "resources/patients/data.parquet",
      "type": "table",
      "title": "Patients Data",
      "description": "This data resource contains data about patients in a diabetes study.",
      "format": "parquet",
      "mediatype": "application/parquet",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer"
          },
          {
            "name": "age",
            "type": "integer"
          },
          {
            "name": "sex",
            "type": "string"
          },
          {
            "name": "height",
            "type": "number"
          },
          {
            "name": "weight",
            "type": "number"
          },
          {
            "name": "diabetes_type",
            "type": "string"
          }
        ]
      }
    }
  ]
}

All the resource metadata is now also saved in this file! If you need to update the resource properties later on, you can simply edit the scripts/resource_properties_patients.py file and then re-run the main.py file to update the datapackage.json file and the README.md file.

Storing a backup of the data as a batch file

Note

See the flow diagrams for a simplified flow of steps involved in adding batch files. Also see the design docs for why we include these batch files in the resource’s folder.

Each time you add new or modified data to a resource, this data is stored in a batch file. These batch files will be used to create the final data file that is actually used as the resource at the path resource/<name>/data.parquet. The first time a batch file is saved, it will create the folders necessary for the resource.

As shown above, the data is currently loaded as a Polars DataFrame called raw_data_patients. Now, it’s time to store this data in the resource’s folder by using the write_resource_batch() function in the main.py file.

main.py

# This code is shortened to only show what was changed.
# Add this to the imports.
from scripts.resource_properties_patients import resource_properties_patients

def main():
    # Previous code is excluded to keep it short.
    # New code inserted at the bottom -----
    # Save the batch data.
    sp.write_resource_batch(
        data=raw_data_patients,
        resource_properties=resource_properties_patients
    )

This function uses the properties object to determine where to store the data as a batch file, which is in the batch/ folder of the resource’s folder. If this is the first time adding a batch file, all the folders will be set up. So the file structure should look like this now:

print(file_tree(package_path.root()))

📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 resources/
│ └─📁 patients/
│   └─📁 batch/
│     └─📄 2026-02-23T205222Z-acc2798b-9fec-42bb-b88e-bd68dba3a9db.parquet
├─📁 scripts/
│ ├─📄 package_properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml

Building the resource data file

Now that you’ve stored the data as a batch file, you can build the resource/<name>/data.parquet file that will be used as the data resource. This Parquet file is built from all the data in the batch/ folder. Since there is only one batch data file stored in the resource’s folder, only this one will be used to build the data resource’s Parquet file. To create this main Parquet file, you need to read in all the batch files, join them together, and, optionally, do any additional processing on the data before writing it to the file. The functions to use are read_resource_batches(), join_resource_batches(), and write_resource_data(). Several of these functions will internally run checks via check_data().

Let’s add these functions to the main.py file to make the data.parquet file:

main.py

# This code is shortened to only show what was changed.
def main():
    # Previous code is excluded to keep it short.
    # New code inserted at the bottom -----
    # Read in all the batch data files for the resource as a list.
    batch_data = sp.read_resource_batches(
        resource_properties=resource_properties_patients
    )
    # Join them all together into a single Polars DataFrame.
    joined_data = sp.join_resource_batches(
        data_list=batch_data,
        resource_properties=resource_properties_patients
    )
    sp.write_resource_data(
        data=joined_data,
        resource_properties=resource_properties_patients
    )

Tip

If you add more data to the resource later on as more batch files, you can update this main data.parquet file to include the updated data in the batch folder using this same workflow.

Now the file structure should look like this:

📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 resources/
│ └─📁 patients/
│   ├─📁 batch/
│   │ └─📄 2026-02-23T205222Z-acc2798b-9fec-42bb-b88e-bd68dba3a9db.parquet
│   └─📄 data.parquet
├─📁 scripts/
│ ├─📄 package_properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml