Functions and classes
🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧
We created this document mainly as a way to help us as a team all understand and agree on what we’re making and what needs to be worked on. Which means that the descriptions and explanations of these functions will likely change quite a bit and may even be deleted later when they are no longer needed.
Based on the naming scheme and the Frictionless Data Package standard, these are the external-facing functions in Sprout. See the Outputs section for an overview and explanation of the different outputs provided by Sprout.
Nearly all functions have a path
argument. Depending on what the function does, the path object will be different. Use the PackagePath.*()
functions to get the correct path object for the specific function. It’s designed this way to make it more flexible to where individual packages and resources are stored and to make it a bit easier to write tests for the functions. For a similar reason, most of the functions output either a dict
Python object, a custom Properties
dataclass, or a path object to make them easier to test.
Several of the functions have an argument called properties
. The properties argument is a list of key-value pairs (as a JSON-style dict
object), built using the Properties
object, that describes the package and resource(s) in the package. This metadata is stored in the datapackage.json
file and follows the Frictionless Data specification.
write_*()
functions will always overwrite the file and they will always create the folders to the file if they don’t exist.
Functions shown with a icon are not yet implemented while those with a icon are implemented.
For some reason, the diagrams below don’t display well on some browsers like Firefox. To see them, try using a different browser like Chrome or Edge.
Data package functions
as_readme_text(properties)
See the help documentation with help(as_readme_text)
for more details.
write_package_properties(properties, path)
See the help documentation with help(write_package_properties)
for more details.
Data resource functions
write_resource_properties(resource_properties, path)
See the help documentation with help(write_resource_properties)
for more details.
write_resource_batch(data, resource_properties)
See the help documentation with help(write_resource_batch)
for more details.
read_resource_batches(resource_properties, paths)
See the help documentation with help(read_resource_batches)
for more details.
join_resource_batches(data_list, resource_properties)
See the help documentation with help(join_resource_batches)
for more details.
write_resource_data(data, resource_properties)
See the help documentation with help(write_resource_data)
for more details.
read_resource_data(resource_name, path)
This function takes the name of the resource you want to read and optionally the path to the datapackage.json
file and will read in the data resource (as a Parquet file) into a Polars DataFrame. If the path
is not given, it will look for the datapackage.json
file in the current working directory.
update_resource_properties(current, updates)
Edit the properties of a resource in a package. The properties
argument must be a ResourceProperties
object. Use the PackagePath().properties()
function to provide the correct path location. Outputs a ResourceProperties
object; use write_resource_properties()
to save the properties object to the datapackage.json
file. This function can also be used to delete all of a resource’s properties.
write_resource_properties(resource_properties, path)
See the help documentation with help(write_resource_properties)
for more details.
extract_resource_properties(data)
See the help documentation with help(extract_resource_properties)
for more details.
Path functions
See the help documentation for help(PackagePath)
for more details.
Properties dataclasses
These dataclasses contain an explicit, structured set of official properties defined within a data package. The main purpose of these is to allow us to pass structured properties objects between functions. They also enable users to create valid properties objects more easily and get an overview of optional and required class fields.
PackageProperties
See the help documentation with help(PackageProperties())
for more details on the properties.
Properties functions
read_properties(path)
See the help documentation with help(read_properties)
for more details.
Reads the datapackage.json
file, checks that it is correct, and then outputs a PackageProperties
object.
Check functions
All the check functions check the properties given in the datapackage.json
file against the Frictionless Data Package standard. Even the checks on the data itself are based on the properties given in the datapackage.json
file. The checks on the properties are mainly wrappers around the check_datapackage
“sub-package” (that we will eventually split into its own package).
check_properties(properties)
See the help documentation with help(check_properties)
for more details. This checks all the properties, both package and resources.
check_package_properties(properties)
See the help documentation with help(check_package_properties)
for more details. This only checks the package properties.
check_resource_properties(resource_properties)
See the help documentation with help(check_resource_properties)
for more details. The only checks the resource properties.
check_data(data, resource_properties)
See the help documentation with help(check_data)
for more details. This function checks the data against the properties in the datapackage.json
file. It includes checks on the data headers against the field.name
properties, the columns’ data types against the fields.type
properties, and the data itself against the constraints given in the fields.constraints
properties (if it exists). See the function flow for more details on the internal flow of this particular function.
Observational unit functions
An observational unit is the level of detail on the entity (e.g. human, animal, event) that the data was collected on at a given point in time. An example would be: A person in a research study who came to the clinic in May 2024 to have their blood collected and to fill out a survey.
delete_observational_unit(data_list, observational_unit)
As per legal and privacy regulations of multiple countries, a person can request that any personally identifiable and sensitive data of theirs be deleted in a variation of a “right to be forgotten” regulation or law. This function makes this process easier by searching for all places where their data is stored in the data and deletes it. We can not guarantee that all data is deleted in the history (e.g. backups) or in projects that have used the data for research purposes, but the data will no longer exist in the current data package and all subsequent uses of it. This has the potential to be highly destructive, but it doesn’t yet write back to the files. So be cautious and check that everything is correct at this stage before writing back.
The data given must be a list of DataFrames, even if it is only a list of one DataFrame. The observational_unit
is a dictionary that contains the information about the observational unit that needs to be deleted. The dictionary must contain one or more primary keys that represent the observational unit in the data as well as the value to delete. For example:
# For one person with the ID "1234", even if there are multiple time-points
# of data for that person.
= {"person_id": ["1234"]}
observational_unit
# For multiple people with the IDs "1234" and "5678"
= {"person_id": ["1234", "5678"]}
observational_unit
# For one person with the IDs "1234" and date of collections "2024-05-01"
# and "2024-05-02"
= {
observational_unit "person_id": ["1234"],
"date_of_collection": ["2024-05-01", "2024-05-02"]
}
The function will search for all instances of the keys and values in the data and delete them.
Base functions
write_file(text, path)
See the help documentation with help(write_file)
for more details.