Functions and classes

Important

We created this document mainly as a way to help us as a team all understand and agree on what we’re making and what needs to be worked on. Which means that the descriptions and explanations of these functions will likely change quite a bit and may even be deleted later when they are no longer needed.

Based on the naming scheme and the Frictionless Data Package standard, these are the external-facing functions in Sprout. See the Outputs section for an overview and explanation of the different outputs provided by Sprout.

Nearly all functions have a path argument. Depending on what the function does, the path object will represent a different location. Use the PackagePath.*() functions to get the correct path object for the specific function. It’s designed this way to make it more flexible to where individual packages and resources are stored and to make it a bit easier to write tests for the functions. For similar reasons, most of the functions output either a dict Python object, a custom Properties dataclass, or a path object.

Several of the functions have an argument called properties. The properties argument is a custom Properties dataclass (a collection of key-value pairs, much like a JSON-style dict object) that describes the package and its resource(s). This metadata is stored in the datapackage.json file and follows the Frictionless Data specification.

write_*() functions always overwrite their target file and always create the parent folders of the file if they don’t exist.

Important

Functions shown with a icon are not yet implemented while those with a icon are implemented.

Caution

For some reason, the diagrams below don’t display well on some browsers like Firefox. To see them, try using a different browser like Chrome or Edge.

Data package functions

`as_readme_text(properties)`

See the help documentation with help(as_readme_text) for more details.

`write_properties(properties, path)`

See the help documentation with help(write_properties) for more details.

Data resource functions

`write_resource_batch(data, resource_properties)`

See the help documentation with help(write_resource_batch) for more details.

`read_resource_batches(resource_properties, paths)`

See the help documentation with help(read_resource_batches) for more details.

`join_resource_batches(data_list, resource_properties)`

See the help documentation with help(join_resource_batches) for more details.

`write_resource_data(data, resource_properties)`

See the help documentation with help(write_resource_data) for more details.

`read_resource_data(resource_name, path)`

See the help documentation with help(read_resource_data) for more details.

`extract_field_properties(data)`

See the help documentation with help(extract_field_properties) for more details.

Path functions

See the help documentation with help(PackagePath) for more details.

Properties dataclasses

These dataclasses contain an explicit, structured set of official properties defined within a data package. The main purpose of these is to allow us to pass structured properties objects between functions. They also enable users to create valid properties objects more easily and get an overview of optional and required class fields.

`PackageProperties`

See the help documentation with help(PackageProperties()) for more details on the properties.

Properties functions

`read_properties(path)`

See the help documentation with help(read_properties) for more details.

Reads the datapackage.json file, checks that it is correct, and then outputs a PackageProperties object.

Check functions

All check functions check the properties given in the datapackage.json file against the Frictionless Data Package standard. Even the checks on the data itself are based on the properties given in the datapackage.json file. The checks on the properties are mainly wrappers around the check_datapackage “sub-package” (that we will eventually split into its own package).

`check_properties(properties)`

See the help documentation with help(check_properties) for more details. This checks all the properties, both package and resources.

`check_package_properties(properties)`

See the help documentation with help(check_package_properties) for more details. This only checks the package properties.

`check_resource_properties(resource_properties)`

See the help documentation with help(check_resource_properties) for more details. The only checks the resource properties.

`check_data(data, resource_properties)`

See the help documentation with help(check_data) for more details. This function checks the data against the properties in the datapackage.json file. It checks column names in the data against field names (field.name) in the properties, column data types against field types (field.type), and the data itself against any constraints on column values (field.constraints). See the function flow page for more details on the internal flow of this function.

Observational unit functions

An observational unit is the level of detail on the entity (e.g. human, animal, event) that the data was collected on at a given point in time. An example would be: A person in a research study who came to the clinic in May 2024 to have their blood collected and to fill out a survey.

`delete_observational_unit(data_list, observational_unit)`

As per legal and privacy regulations of multiple countries, a person can request that any personally identifiable and sensitive data of theirs be deleted in a variation of a “right to be forgotten” regulation or law. This function makes this process easier by searching for all places where their information is stored in the data and deleting it. We cannot guarantee that all data is deleted in the history (e.g. backups) or in projects that have used the data for research purposes, but the data will no longer exist in the current data package and all subsequent uses of it. This action has the potential to be highly destructive, but the function doesn’t automatically write back to the files. So be cautious and check that everything is correct at this stage before writing back.

The data given must be a list of DataFrames, even if it is only a list of one DataFrame. The observational_unit is a dictionary that contains the information about the observational unit that needs to be deleted. The dictionary must contain one or more primary keys that represent the observational unit in the data as well as the value to delete. For example:

# For one person with the ID "1234", even if there are multiple time-points
# of data for that person.
observational_unit = {"person_id": ["1234"]}

# For multiple people with the IDs "1234" and "5678"
observational_unit = {"person_id": ["1234", "5678"]}

# For one person with the IDs "1234" and date of collections "2024-05-01"
# and "2024-05-02"
observational_unit = {
    "person_id": ["1234"],
    "date_of_collection": ["2024-05-01", "2024-05-02"]
}

The function will search for all instances of the keys and values in the data and delete them.

Base functions

`write_file(text, path)`

See the help documentation with help(write_file) for more details.