π§ Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes π§
In each data package are data resources, which contain conceptually standalone sets of data. This page shows you how to create and manage data resources inside a data package using Sprout. You will need to have already created a data package.
Important
Sprout assumes you have control over your systemβs files and folders, or at least your userβs home directory. This includes having access to a server through the Terminal where you can write to specific folders.
Important
Data resources can only be created from tidy data. Before you can store it, you need to process it into a tidy format, ideally using Python so that you have a record of the steps taken to clean and transform the data.
Putting your data into a data package makes it easier for yourself and others to use later on. The steps youβll take to get your data into the structure used by Sprout are:
Create the properties for the resource, using the original data as a starting point, and edit as needed.
Create the folders for the new resource within the package and save the resource properties in the datapackage.json file.
Add your data to the batches of data for the resource.
Merge the data batches into a new resource data file.
Re-build the data packageβs README.md file from the updated datapackage.json file.
If you need to update the properties at a later point, you can use update_resource_properties() and then write the result to the datapackage.json file.
Making a data resource requires that you have data that can be made into a resource in the first place. Usually, generated or collected data starts out in a bit of a βrawβ shape that needs some working. This work needs to be done before adding the data as a data package, since Sprout assumes that the data is already tidy. For this guide, we use (fake) data that is already tidy and loaded into a Polars DataFrame:
data.head(5)
shape: (5, 6)
id
age
sex
height
weight
diabetes_type
i64
i64
str
f64
f64
str
1
54
"F"
167.5
70.3
"Type 1"
2
66
"M"
175.0
80.5
"Type 2"
3
64
"F"
165.8
94.2
"Type 1"
4
36
"F"
168.6
121.9
"Type 2"
5
47
"F"
176.4
77.5
"Type 1"
If you want to follow this guide with the same data, you can find it here.
Before we start, you need to import Sprout as well as other helper packages:
import seedcase_sprout as sp# For pretty printing of outputfrom pprint import pprint
Since Sprout only works with Polars DataFrames, we will need to load in the data using Polars:
import polars as ploriginal_data = pl.read_csv(raw_data_path)print(original_data)
shape: (20, 6)
βββββββ¬ββββββ¬ββββββ¬βββββββββ¬βββββββββ¬ββββββββββββββββ
β id β age β sex β height β weight β diabetes_type β
β --- β --- β --- β --- β --- β --- β
β i64 β i64 β str β f64 β f64 β str β
βββββββͺββββββͺββββββͺβββββββββͺβββββββββͺββββββββββββββββ‘
β 1 β 54 β F β 167.5 β 70.3 β Type 1 β
β 2 β 66 β M β 175.0 β 80.5 β Type 2 β
β 3 β 64 β F β 165.8 β 94.2 β Type 1 β
β 4 β 36 β F β 168.6 β 121.9 β Type 2 β
β 5 β 47 β F β 176.4 β 77.5 β Type 1 β
β β¦ β β¦ β β¦ β β¦ β β¦ β β¦ β
β 16 β 42 β F β 181.1 β 142.2 β Type 2 β
β 17 β 68 β F β 177.7 β 135.5 β Type 1 β
β 18 β 27 β F β 169.8 β 111.1 β Type 2 β
β 19 β 39 β M β 187.9 β 106.7 β Type 1 β
β 20 β 63 β M β 178.6 β 87.8 β Type 2 β
βββββββ΄ββββββ΄ββββββ΄βββββββββ΄βββββββββ΄ββββββββββββββββ
Then we can continue with the next steps.
Extracting resource properties from the data
Youβll start by creating the resourceβs properties. Before you can store data in your data package, you need to describe it using properties (i.e., metadata). The resourceβs properties are what allow other people to understand what your data is about and to use it more easily. These properties also define what it means for data in the resource to be correct, as all data in the resource must match the properties. While you can create a resource properties object manually using ResourceProperties, it can be quite intensive and time-consuming if you, for example, have many columns in your data. To make this process easier, extract_resource_properties() allows you to create an initial resource properties object by extracting as much information as possible from the Polars DataFrame with your data. Afterwards, you can edit these properties as needed.
Now you are ready to use extract_resource_properties() to extract the resource properties from the data. This function tries to infer the data types from the data, but it might not get it right, so make sure to double check the output. It is not possible to infer information that is not included in the data, like a description of what the data contains or the unit of the data.
You may be able to see that some things are missing, for instance, the individual columns (called fields) donβt have any descriptions. You will have to manually add this yourself. You can run a check on the properties to confirm what is missing:
+ Exception Group Traceback (most recent call last):
| File "/home/runner/work/seedcase-sprout/seedcase-sprout/.venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3667, in run_code
| exec(code_obj, self.user_global_ns, self.user_ns)
| File "/tmp/ipykernel_2336/4200374576.py", line 1, in <module>
| print(sp.check_resource_properties(resource_properties))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/check_properties.py", line 110, in check_resource_properties
| raise error_info
| File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/check_properties.py", line 102, in check_resource_properties
| _generic_check_properties(
| File "/home/runner/work/seedcase-sprout/seedcase-sprout/src/seedcase_sprout/check_properties.py", line 166, in _generic_check_properties
| raise ExceptionGroup(
| ExceptionGroup: The following checks failed on the properties:
| PackageProperties(name=None, id=None, title=None, description=None, homepage=None, version=None, created=None, contributors=None, keywords=None, image=None, licenses=None, resources=[ResourceProperties(name=None, path=None, type='table', title=None, description=None, sources=None, licenses=None, format=None, mediatype=None, encoding=None, bytes=None, hash=None, schema=TableSchemaProperties(fields=[FieldProperties(name='id', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='age', title=None, type='integer', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='sex', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='height', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='weight', title=None, type='number', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None), FieldProperties(name='diabetes_type', title=None, type='string', format=None, description=None, example=None, constraints=None, categories=None, categories_ordered=None, missing_values=None)], fields_match='equal', primary_key=None, unique_keys=None, foreign_keys=None, missing_values=None))], sources=None) (4 sub-exceptions)
+-+---------------- 1 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.description` caused by `required`: 'description' is a required property
+---------------- 2 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.name` caused by `required`: 'name' is a required property
+---------------- 3 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.path` caused by `required`: 'path' is a required property
+---------------- 4 ----------------
| seedcase_sprout.check_datapackage.check_error.CheckError: Error at `$.title` caused by `required`: 'title' is a required property
+------------------------------------
Time to fill in the missing fields of the resource properties:
# TODO: Need to consider/design how editing can be done in an easier, user-friendly way.# TODO: Add more detail when we know what can and can't be extracted.resource_properties.name ="patient-data"resource_properties.path ="resources/patient-data/data.parquet"resource_properties.title ="Patient Data"resource_properties.description ="This data resource contains data about..."
Creating a data resource
Now that you have the properties for the resource, you can create the properties for the resource within your existing data package. A data package can contain multiple resources, so their name property must be unique. This name property is what will be used later to create the folder structure for that resource.
We assume that youβve already created a data package (for example, by following the package guide) and stored the path to it as a PackagePath object in the package_path variable. In this guide, the root folder of the package is in a temporary folder:
print(package_path.root())
/tmp/tmpk41g67tx/diabetes-study
Letβs take a look at the current files and folders in the data package:
This shows that the data package already includes a datapackage.json file and a README.md file.
The next step is to write the resource properties to the datapackage.json file. Before they are added, they will be checked to confirm that they are correctly filled in and that no required fields are missing. You can use the PackagePath().properties() helper function to give you the location of the datapackage.json file based on the path to your package.
See the flow diagrams for a simplified flow of steps involved in adding batch files.
Batch files are used to store data in a data package each time you add data to a resource. These batch files will be used to create the data file that is actually used as the resource at the path resource/<name>/data.parquet. The first time a batch file is saved, it will create the folders necessary for the resource.
As shown above, the data is currently stored loaded as a Polars DataFrame called data. Now, itβs time to store this data in the resourceβs folder by using:
This function uses the properties object to determine where to store the data as a batch file, which is in the batch/ folder of the resourceβs folder. If this is the first time adding a batch file, all the folders will be set up. You can check the newly added file by using:
print(package_path.resource_batch_files(1))
[]
Building the resource data file
Now that youβve stored the data as a batch file, you can build the Parquet file that will be used as the data resource. This Parquet file is built from all the data in the batch/ folder. Since there is only one batch data file stored in the resourceβs folder, only this one will be used to build the data resourceβs Parquet file:
# TODO: eval when function implementedsp.join_resource_batches( data_list=..., resource_properties=resource_properties)
Tip
If you add more data to the resource later on, you can update this Parquet file to include all data in the batch folder using the build_resource_parquet() function like shown above.
Re-building the README file
One of the last steps to adding a new data resource is to re-build the README.md file of the data package. To allow some flexibility with what gets added to the README text, this next function will only build the text, but not write it to the file. This allows you to add additional information to the README text before writing it to the file.
After having created a resource, you may need to make edits to the properties. While technically you can do this manually by opening up the datapackage.json file and editing it, we strongly recommend you use the update functions to do this. These functions help to ensure that the datapackage.json file is still in a correct JSON format and has the correct fields filled in. You can call the update_resource_properties() function with two arguments: the current resource properties and a resource properties object representing the updates you want to make. This latter object will very often be a partial resource properties, in the sense that it will have only those fields filled in that you want to update. The function will return an updated resource properties object. Any field in the object representing the updates will overwrite the corresponding field in the current properties object.
# TODO: eval when function implementedresource_properties = sp.update_resource_properties( current_properties=resource_properties, update_properties=sp.ResourceProperties( title="Basic characteristics of patients" ))pprint(resource_properties)
Finally, to write your changes back to datapackage.json, use the write_resource_properties() function: