📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 scripts/
│ └─📄 properties.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml
Creating and managing the properties and data for data resources
In each data package are data resources, which contain conceptually distinct sets of data. This page shows you how to use Sprout to create and manage data resources inside a data package. You will need to have created a data package already.
Sprout assumes you have control over your system’s files and folders, or at least your user’s home directory. This includes having access to a server through the Terminal where you can write to specific folders.
Data resources can only be created from tidy data in Polars DataFrames. Before you can store it, you need to process it into a tidy format, ideally using Python so that you have a record of the steps taken to clean and transform the data. Either way, the tidy data needs to be read into a Polars DataFrame before you can use it in Sprout.
The steps you’ll take to get your data into the structure used by Sprout are:
- Create the properties for the resource, using the original data as a starting point, and then edit the resource properties as needed.
- Create the folders for the new resource within the package and save the resource properties in the
datapackage.json
file. - Optionally store the raw data in a
raw/
folder in the package, so that you can keep the original data separate from the tidied data. - Tidy the data as needed and convert it into a Polars DataFrame (if you haven’t already) before adding it to the resource.
- Add your tidy data as a batch of data to the resource.
- Merge the data batches into a new resource data file.
- Re-build the data package’s
README.md
file from the updateddatapackage.json
file. - If you need to update the properties at a later point, you would edit the resource properties script and run the
main.py
file to re-create the properties and write them to thedatapackage.json
file.
Many of these steps are taken care of by functions in Sprout, but some require writing your own code.
Making a data resource requires that you have data that can be made into a resource in the first place. Usually, generated or collected data starts out in a bit of a “raw” shape that needs some working. This work needs to be done before adding the data to a data package, since Sprout assumes that the data is already tidy.
For this guide, you will use (fake) data that is already tidy. You can find the data here. We’ve placed this data in a raw/
folder in the data package and called it patients.csv
, so that we can keep the original data separate from the processed data.
At this point, your data package diabetes-study
has the following structure:
And the raw/patients.csv
file includes data about patients with diabetes, and it looks like this:
id | age | sex | height | weight | diabetes_type |
---|---|---|---|---|---|
i64 | i64 | str | f64 | f64 | str |
1 | 54 | "F" | 167.5 | 70.3 | "Type 1" |
2 | 66 | "M" | 175.0 | 80.5 | "Type 2" |
3 | 64 | "F" | 165.8 | 94.2 | "Type 1" |
4 | 36 | "F" | 168.6 | 121.9 | "Type 2" |
5 | 47 | "F" | 176.4 | 77.5 | "Type 1" |
Extracting properties from the data
Before you can store resource data in your data package, you need to describe its properties. The resource’s properties are what allow other people to understand what your data is about and to use it more easily.
The resource’s properties also define what it means for data in the resource to be correct, as all data in the resource must match the properties. Sprout checks that the properties are correctly filled in and that no required properties fields are missing. It also checks that the data matches the properties, so that you can be sure that data actually contains what you expect it to contain.
Like with the package properties, you will create a properties script for the resource that allows you to easily edit the properties later on. You can do this by using the create_properties_script()
function. This function needs the resource_name
and optionally the fields
(i.e., columns or variables) that you want to include in the resource.
The resource_name
is the name of the resource, which is used to identify the resource in the data package. It is required and should not be changed, since it’s used in the file name of the resource properties script. Since a data package can contain multiple resources, the resource’s name
must also be unique.
To ease the process of adding fields
to your resource properties, Sprout provides a function called extract_field_properties()
, which allows you to extract information from the Polars DataFrame with your data. Use these extracted properties as a starting point and edit as needed.
extract_field_properties()
extracts the field properties from the Polars DataFrame’s schema and maps the Polars data type to a Data Package field type.
The mapping is not perfect, so you may need to edit the extracted properties to ensure that they are as you want them to be.
Let’s add these steps to our main.py
file: First, we need to load the original data from the raw/
folder into a Polars DataFrame, then we can extract the field properties from it, and use these properties in the creation of our resource properties script:
main.py
import seedcase_sprout as sp
from scripts.properties import properties
import polars as pl
def main():
# Create the properties script in default location.
sp.create_properties_script()
# New code here -----
# Load the "patients" raw, but tidy, data from the CSV file.
= pl.read_csv(package_path.root() / "raw" / "patients.csv")
raw_data_patients # Extract field properties from the raw, but tidy, data.
= sp.extract_field_properties(
field_properties =raw_data_patients,
data
)# Create the resource properties script. This is only created once, it will
# not be overwritten if it already exists.
sp.create_resource_properties_script(="patients",
resource_name=field_properties,
fields
)# New code ends here -----
# Save the properties to `datapackage.json`.
=properties)
sp.write_properties(properties# Create text for a README of the data package.
= sp.as_readme_text(properties)
readme_text # Write the README text to a `README.md` file.
sp.write_file(readme_text, sp.PackagePath().readme())
if __name__ == "__main__":
main()
Then run the main.py
file in your Terminal, which will create the resource properties script:
Terminal
uv run main.py
Your folders and files should now look like:
📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 scripts/
│ ├─📄 properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml
Writing the resource properties
Open the newly created scripts/resource_properties_patients.py
file in order to make edits to it. See the full contents in the callout block below.
scripts/resource_properties_patients.py
content
import seedcase_sprout as sp
resource_properties_patients = sp.ResourceProperties(
## Required:
name="patients",
title="",
description="",
## Optional:
type="table",
format="parquet",
mediatype="application/parquet",
schema=sp.TableSchemaProperties(
## Required
fields=[
sp.FieldProperties(
## Required
name="id",
type="integer",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
sp.FieldProperties(
## Required
name="age",
type="integer",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
sp.FieldProperties(
## Required
name="sex",
type="string",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
sp.FieldProperties(
## Required
name="height",
type="number",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
sp.FieldProperties(
## Required
name="weight",
type="number",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
sp.FieldProperties(
## Required
name="diabetes_type",
type="string",
## Optional
# title="",
# format="",
# description="",
# example="",
# categories=[],
# categories_ordered=False,
),
],
## Optional
# fields_match=["equal"],
# primary_key=[""],
# unique_keys=[[""]],
# foreign_keys=[
# sp.TableSchemaForeignKeyProperties(
# ## Required
# fields=[""],
# reference=sp.ReferenceProperties(
# ## Required
# resource="",
# fields=[""],
# ),
# ),
# ],
),
# sources=[
# sp.SourceProperties(
# ## Required:
# title="",
# ## Optional:
# path="",
# email="",
# version="",
# ),
# ],
)
In the resource properties script, the name
property is set to patients
. However, the two other required properties, title
and description
, are empty. You will need to fill these in yourself in the script, like so:
= sp.ResourceProperties(
resource_properties_patients ## Required:
="patients",
name="Patients Data",
title="This data resource contains data about patients in a diabetes study.",
description## Optional:
# Rest of the properties that we don't show here, but that is above ...
)
If the title
and description
properties are not filled in, you’ll get a CheckError
when you try to use write_properties()
to save the resource’s properties to the datapackage.json
file.
Below you can see how CheckErrors
look like when you try to check the resource properties without filling in the required fields. When you use check_resource_properties()
on the ResourceProperties
object, it checks if everything is correctly filled in and that no required fields are missing. It’s used internally in write_properties()
to ensure that the properties are always correct before writing them to the datapackage.json
file.
sp.check_resource_properties(resource_properties_patients)
+-+---------------- 1 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.description` caused by `blank`: The 'description' field is blank, please fill it in.
+---------------- 2 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.title` caused by `blank`: The 'title' field is blank, please fill it in.
+------------------------------------
Now, the resource properties for patients include the following information (printed as a dictionary representation that omits empty fields):
sp.pprint(resource_properties_patients.compact_dict)
{'description': 'This data resource contains data about patients in a diabetes '
'study.',
'format': 'parquet',
'mediatype': 'application/parquet',
'name': 'patients',
'path': 'resources/patients/data.parquet',
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'age', 'type': 'integer'},
{'name': 'sex', 'type': 'string'},
{'name': 'height', 'type': 'number'},
{'name': 'weight', 'type': 'number'},
{'name': 'diabetes_type', 'type': 'string'}]},
'title': 'Patients Data',
'type': 'table'}
To include these properties in your data package, you need to include the resource properties in the properties.py
file of your data package. You can do this by adding the following lines to the properties.py
file:
properties.py
# Import the resource properties object.
from .resource_properties_patients import resource_properties_patients
= sp.PackageProperties(
properties # Your existing properties here...
=[
resources
resource_properties_patients,
], )
A data package can contain multiple resources, so their name
property must be unique. This name
property is what will be used later to create the folder structure for that resource.
The next step is to write the resource properties to the datapackage.json
file. Since we included the resource_properties
object directly into the scripts/properties.py
file, and since the scripts/properties.py
file is imported already in main.py
, we can simply re-run the main.py
file and it will update both the datapackage.json
file and the README.md
file.
Terminal
uv run main.py
Let’s check the contents of the datapackage.json
file to see that the resource properties have been added:
print(sp.read_properties(package_path.properties()))
PackageProperties(name='diabetes-study', id='5bec70c7-1465-44ff-ad58-a6db6b0ae311', title='A Study on Diabetes', description='\n# Data from a 2021 study on diabetes prevalence\n\nThis data package contains data from a study conducted in 2021 on the\n*prevalence* of diabetes in various populations. The data includes:\n\n- demographic information\n- health metrics\n- survey responses about lifestyle\n', homepage=None, version='0.1.0', created='2025-07-15T13:50:26+00:00', contributors=[ContributorProperties(title='Jamie Jones', path='example.com/jamie_jones', email='jamie_jones@example.com', given_name=None, family_name=None, organization=None, roles=['creator'])], keywords=None, image=None, licenses=[LicenseProperties(name='ODC-BY-1.0', path='https://opendatacommons.org/licenses/by', title='Open Data Commons Attribution License 1.0')], resources=[ResourceProperties(name='patients', path='resources/patients/data.parquet', type='table', title='Patients Data', description='This data resource contains data about patients in a diabetes study.', sources=None, licenses=None, format='parquet', mediatype='application/parquet', encoding=None, bytes=None, hash=None, schema=TableSchemaProperties(fields=[FieldProperties(name='id', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None), FieldProperties(name='age', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None), FieldProperties(name='sex', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None), FieldProperties(name='height', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None), FieldProperties(name='weight', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None), FieldProperties(name='diabetes_type', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None)], fields_match=None, primary_key=None, unique_keys=None, foreign_keys=None))], sources=None)
If you need to update the resource properties later on, you can simply edit the scripts/resource_properties_patients.py
file and then re-run the main.py
file to update the datapackage.json
file and the README.md
file.
Storing a backup of the data as a batch file
See the flow diagrams for a simplified flow of steps involved in adding batch files. Also see the design docs for why we include these batch files in the resource’s folder.
Each time you add new or modified data to a resource, this data is stored in a batch file. These batch files will be used to create the final data file that is actually used as the resource at the path resource/<name>/data.parquet
. The first time a batch file is saved, it will create the folders necessary for the resource.
As shown above, the data is currently loaded as a Polars DataFrame called raw_data_patients
. Now, it’s time to store this data in the resource’s folder by using the write_resource_batch()
function in the main.py
file.
main.py
# This code is shortened to only show what was changed.
# Add this to the imports.
from scripts.resource_properties_patients import resource_properties_patients
def main():
# Previous code is excluded to keep it short.
# New code inserted at the bottom -----
# Save the batch data.
sp.write_resource_batch(=raw_data_patients,
data=resource_properties_patients
resource_properties )
This function uses the properties object to determine where to store the data as a batch file, which is in the batch/
folder of the resource’s folder. If this is the first time adding a batch file, all the folders will be set up. So the file structure should look like this now:
print(file_tree(package_path.root()))
📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 resources/
│ └─📁 patients/
│ └─📁 batch/
│ └─📄 2025-07-15T135026Z-b657e595-2420-46e5-acd3-732a57484318.parquet
├─📁 scripts/
│ ├─📄 properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml
Building the resource data file
Now that you’ve stored the data as a batch file, you can build the resource/<name>/data.parquet
file that will be used as the data resource. This Parquet file is built from all the data in the batch/
folder. Since there is only one batch data file stored in the resource’s folder, only this one will be used to build the data resource’s Parquet file. To create this main Parquet file, you need to read in all the batch files, join them together, and, optionally, do any additional processing on the data before writing it to the file. The functions to use are read_resource_batches()
, join_resource_batches()
, and write_resource_data()
. Several of these functions will internally run checks via check_data()
.
Let’s add these functions to the main.py
file to make the data.parquet
file:
main.py
# This code is shortened to only show what was changed.
def main():
# Previous code is excluded to keep it short.
# New code inserted at the bottom -----
# Read in all the batch data files for the resource as a list.
= sp.read_resource_batches(
batch_data =resource_properties_patients
resource_properties
)# Join them all together into a single Polars DataFrame.
= sp.join_resource_batches(
joined_data =batch_data,
data_list=resource_properties_patients
resource_properties
)
sp.write_resource_data(=joined_data,
data=resource_properties_patients
resource_properties )
If you add more data to the resource later on as more batch files, you can update this main data.parquet
file to include the updated data in the batch folder using this same workflow.
Now the file structure should look like this:
📁 diabetes-study/
├─📁 raw/
│ └─📄 patients.csv
├─📁 resources/
│ └─📁 patients/
│ ├─📁 batch/
│ │ └─📄 2025-07-15T135026Z-b657e595-2420-46e5-acd3-732a57484318.parquet
│ └─📄 data.parquet
├─📁 scripts/
│ ├─📄 properties.py
│ └─📄 resource_properties_patients.py
├─📄 .gitignore
├─📄 .python-version
├─📄 README.md
├─📄 datapackage.json
├─📄 main.py
└─📄 pyproject.toml