Flows

Warning

🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧

Important

We created this document mainly as a way to help us as a team all understand and agree on what we’re making and what needs to be worked on. This means that the flows may change quite substantially until we’ve reached a stable full release at v1.0.0.

Based on the functions page that lists and describes the main functions and classes that make up the interface in detail including their input and output, this document describes and shows how all these objects work together and flow into one another.

Each diagram uses specific shapes and lines to represent different things:

Caution

For some reason, the diagrams below don’t display well on some browsers like Firefox. To see them, try using a different browser like Chrome or Edge.

Creating or updating a package

This is the flow for making a new package. The write_properties() function will internally call check_properties(), but this can also be called separately to check the properties before writing them to the datapackage.json file. The PackageProperties input is imported from a Python script in the scripts/ folder of the data package. To update the properties in the datapackage.json file, you would edit the Python script with the properties directly and then rerun your main script to overwrite the datapackage.json file with the new properties.

PackageProperties

PackagePath().properties()

check_properties()

write_properties()

Figure 1: Diagram showing the flow of objects and functions to create a new package.

Extracting resource properties from data

The flow for extracting resource properties from data. This is useful when the data is in a format that contains metadata about the data, such as a CSV file with a header row that contains the column names. The extract_resource_properties() function cannot extract all required properties from the data, so both TableSchemaProperties and ResourceProperties will need many of their fields filled in after extraction. The output of the extract_resource_properties() function can be used to generate a Python script to give you a starting point for writing the resource properties. Afterwards, if you want to update the resource properties, you’ll edit this Python script and then re-run your build process to generate the datapackage.json file with the updated properties.

DataFrame
(Tidy)

ResourceProperties
(extracted)

extract_resource_properties()

Python script
(generated with
extracted properties)

Figure 2: Diagram showing the flow of objects and functions to extract resource properties from data.

Updating README after changing package or resource properties

The flow for updating the README file after changing the package or resource properties. Since the README template text is generated from the properties in the datapackage.json file, any change to that file will require updating the README file. The split between as_readme_text() and write_file() is to allow for testing or programmatically modifying the generated README text before writing it to the file.

PackageProperties

PackagePath().properties()

read_properties()

as_readme_text()

write_file()

PackagePath().readme()

Figure 3: Diagram showing the flow of objects and functions to update the README file after changing the package or resource properties.

Checking the properties of packages or resources

The flow to check the datapackage.json file’s properties is fairly simple. You check PackageProperties (with or without resources) using either the check_properties() function or the check_package_properties() function. You check a specific resource’s ResourceProperties with the check_resource_properties() function. All these functions are customised wrappers around a generic _check_properties() function, which uses arguments to decide which properties to check. The _check_properties() function is itself a wrapper around our check_datapackage “sub-package” (which we intend to split into its own package later).

PackageProperties

check_properties()

check_package_properties()

ResourceProperties

check_resource_properties()

Figure 4: Diagram showing the flow of objects and functions to check the properties of packages or resources.

Saving new or modified data to batch

A data resource needs data, not just properties. You can add data to any data resource that has resource properties. Whenever data is added to a data resource, it gets first saved in the batch/ folder to keep track of additions or changes. You add the data when:

  • A new data resource is created and data is added to it.
  • Additional data is added to an existing data resource.
  • You need to fix, update, or modify existing data in the resource by correcting the data (e.g. fixing a data entry issue).

The data must be in a tidy format and must have already been loaded in as a Polars DataFrame.

DataFrame
(original data)

ResourceProperties

write_resource_batch()

check_data()

Figure 5: Diagram showing the flow of objects and functions to save new, added, or modified data to a batch.

Checking data against the properties

The data must always match what is described in the properties. This means that the data must have the same column names, column types, and column constraints. The check_data() function will internally call several separate functions for these specific checks. Each of these functions outputs an error message describing what the problems are if the check fails. Otherwise, the input data frame is returned unchanged.

Internal

_check_column_names()

DataFrame
or error

_check_column_types()

_check_column_values_constraints()

DataFrame

ResourceProperties

check_data()

Figure 6: Diagram showing the flow of objects and functions to check data against the properties.

Creating or re-creating the resource data

The batch data files are used to keep track of changes to the data. The data that will be used is kept clean and ready for analysis, while original data is not deleted. This flow converts batch data into the final resource data file. The steps are split up so that, if needed or desired, you can make modifications to the data before it is written to the final resource data. While write_resource_data() will call check_data() internally, this can also be called separately to check the data before writing it to the final resource data.

PackagePath().resource_batch_files()

read_resource_batches()

List[DataFrame]

join_resource_batches()

DataFrame

write_resource_data()

ResourceProperties

check_data()

Figure 7: Diagram showing the flow of objects and functions to create or re-create the resource data.

Modifying data types or table schema

Sometimes you may need to modify the data types or table schema of the data in a resource. Before using this flow, you first need to modify the resource properties. The update_resource_batches() function will update each batch DataFrame with the new data types or table schema. Then, you can use write_resource_batch() to save the updated data back to the batch files. The same file names should be used to overwrite old batch files.

path_resource_batch_files()

read_resource_batches()

ResourceProperties
(updated)

List[DataFrame]

List[DataFrame]
(updated)

update_resource_batches()

write_resource_batch()

Figure 8: Diagram showing the flow of objects and functions to modify data types or table schema.

Deleting an observational unit

If you need to delete an observational unit from the data, you can use this flow. The delete_observational_unit() function will delete the observational unit from the data and output a list of DataFrames with the deletions. The write_resource_batch() function will save the updated data back to the batch files.

path_resource_batch_files()

read_resource_batches()

List[DataFrame]
(original)

List[DataFrame]
(deletions)

dict
(observational unit)

delete_observational_unit()

write_resource_batch()

Figure 9: Diagram showing the flow of objects and functions to delete an observational unit.