Flows

Warning

đźš§ Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes đźš§

Important

We created this document mainly as a way to help us as a team all understand and agree on what we’re making and what needs to be worked on. This means that the flows may change quite substantially until we’ve reached a stable full release at v1.0.0.

Based on the functions that list and describe the main functions and classes that make up the interface, this document describes and shows how all these objects work together and flow into one another, but not necessarily the functions’ exact input and output. Those are already described in the functions document.

Each diagram uses specific shapes and lines to represent different things:

Caution

For some reason, the diagrams below don’t display well on some browsers like Firefox. To see them, try using a different browser like Chrome or Edge.

Creating or updating a package

This is the flow for making a new package. The write_package_properties() will internally call check_properties(), but it can also be called separately to check the properties before writing them to the datapackage.json file. The PackageProperties are written in a Python script in folder that has the datapackage.json and you’d run that script to generate the datapackage.json. To update the properties in the datapackage.json file, you would edit the Python script directly and then run it again to overwrite the datapackage.json file with the new properties.

flowchart
    properties[(PackageProperties)]
    path_properties("PackagePath().properties()")
    check_properties("check_properties()")
    write_package_properties("write_package_properties()")

    properties -.-> check_properties
    properties --> write_package_properties
    path_properties --> write_package_properties
Figure 1: Diagram showing the flow of objects and functions to create a new package.

Extract resource properties from data

The flow for extracting resource properties from data. This is useful when the data is in a format that contains metadata about the data, such as a CSV file with a header row that contains the column names. The extract_resource_properties() function cannot extract all required properties from the data, so they must be updated by the user. So both TableSchemaProperties and ResourceProperties will need many of their fields filled in after using extract_resource_properties(). The output of the function extract_resource_properties() can be used to generate a Python script from a template to give you a starting point for writing the resource properties. Afterwards, if you want to update the resource properties, you’ll edit the Python script with the properties and then re-run your build process to generate the datapackage.json file with the updated properties.

flowchart
    data[("DataFrame<br>(Tidy)")]
    properties_extracted[("ResourceProperties<br>(extracted)")]
    extract_resource_properties("extract_resource_properties()")
    python_script("Python script<br>(generated with<br>extracted properties)")

    data --> extract_resource_properties
    extract_resource_properties --> properties_extracted
    properties_extracted --> python_script
Figure 2: Diagram showing the flow of objects and functions to extract resource properties from data.

Update README after changing package or resource properties

The flow for updating the README file after changing the package or resource properties. Since the README template text is generated from the properties in the datapackage.json file, any change to that file will require updating the README file. The split between as_readme_text() and write_file() is to allow for testing or programmatically modifying the generated README text before writing it to the file.

flowchart
    properties[(PackageProperties)]
    path_properties("PackagePath().properties()")
    read_properties("read_properties()")
    as_readme_text("as_readme_text()")
    write_file("write_file()")
    path_readme("PackagePath().readme()")

    path_properties --> read_properties
    read_properties --> properties
    properties --> as_readme_text
    as_readme_text --> write_file
    path_readme --> write_file
Figure 3: Diagram showing the flow of objects and functions to update the README file after changing the package or resource properties.

Create resource properties manually

If the user doesn’t have a raw data file yet, but the user knows what the properties will be, they can use this workflow to add the resource properties manually. The write_resource_properties() function will internally call check_resource_properties(), but it can also be called separately to check the properties before writing them to the datapackage.json file.

flowchart
    properties[(ResourceProperties)]
    path_properties("PackagePath().properties()")
    check_resource_properties("check_resource_properties()")
    write_resource_properties("write_resource_properties()")

    properties -.-> check_resource_properties
    properties --> write_resource_properties
    path_properties --> write_resource_properties
Figure 4: Diagram showing the flow of objects and functions to create resource properties manually.

Checking the properties of packages or resources

The flow to check the datapackage.json file’s properties is fairly simple. For the PackageProperties (that includes or does not include resources), you give it to either the check_properties() function or the check_package_properties() function. For the specific resource’s ResourceProperties, you give it to the check_resource_properties() function. In all cases, these functions are mostly customised wrappers using a generic _check_properties() function, which includes arguments to subset which properties to check. The _check_properties() function itself is also mostly a wrapper around our “sub-package” check_datapackage (which we intend to split into its own package later).

flowchart
    properties[(PackageProperties)]
    check_properties("check_properties()")
    check_package_properties("check_package_properties()")
    resource_properties[(ResourceProperties)]
    check_resource_properties("check_resource_properties()")

    properties --> check_properties
    properties --> check_package_properties
    resource_properties --> check_resource_properties
Figure 5: Diagram showing the flow of objects and functions to check the properties of packages or resources.

Save new, added, or modified data to batch

A data resource needs data, not just properties. You can add data to any data resource that has resource properties. Whenever data is added to a data resource, it gets first saved in the batch/ folder to keep track of additions or changes. You add the data when:

  • A new data resource is created and data is added to it.
  • Additional data is added to an existing data resource.
  • You need to fix, update, or modify existing data in the resource by correcting the data (e.g. fixing a data entry issue).

The data must be in a tidy format and must have already been loaded in as a Polars DataFrame.

flowchart
    data[("DataFrame<br>(original data)")]
    resource_properties[(ResourceProperties)]
    write_resource_batch("write_resource_batch()")
    check_data("check_data()")

    data --> write_resource_batch
    resource_properties --> write_resource_batch
    data -.-> check_data
    resource_properties -.-> check_data
Figure 6: Diagram showing the flow of objects and functions to save new, added, or modified data to a batch.

Check data against the properties

The data must always match what is described in the properties. This means that the data must have the same column names, column types, type of columns’ individual values (in rows), and constraints of columns’ values. The check_data() function will internally call several separate functions for these specific checks. Each of these functions outputs a string with an error message describing what the problems are if the check fails, otherwise it outputs a data frame with the same data as the input. The check_data() then gathers together all the messages and gives an error using those messages if there are any.

flowchart
    data[(DataFrame)]
    resource_properties[(ResourceProperties)]
    check_data("check_data()")
    subgraph Internal
        check_column_names("_check_column_names()")
        check_column_types("_check_column_types()")
        check_column_values_constraints("_check_column_values_constraints()")
        output[("DataFrame<br>or error")]
    end

    data --> check_data
    resource_properties --> check_data
    check_data -.-> Internal
    check_column_names -.-> output
    check_column_types -.-> output
    check_column_values_constraints -.-> output
    style Internal fill:#ffffff
Figure 7: Diagram showing the flow of objects and functions to check data against the properties.

Create or re-create the resource data

The batch data files are used to keep track of changes to the data without deleting the original data, while also keeping the data that will be used clean and ready for analysis. With this flow, the batch data is converted into the final resource data. These steps are split up so that, if needed or desired, you can make modifications to the data before it is written to the final resource data. While write_resource_data() will call check_data() internally, it can also be called separately to check the data before writing it to the final resource data.

flowchart
    path_resource_batch_files("PackagePath().resource_batch_files()")
    read_resource_batches("read_resource_batches()")
    list_data[("List[DataFrame]")]
    join_resource_batches("join_resource_batches()")
    joined_data[(DataFrame)]
    write_resource_data("write_resource_data()")
    resource_properties[(ResourceProperties)]
    check_data("check_data()")

    path_resource_batch_files --> read_resource_batches
    read_resource_batches --> list_data
    list_data --> join_resource_batches
    join_resource_batches --> joined_data
    joined_data --> write_resource_data
    resource_properties --> write_resource_data
    joined_data -.-> check_data
    resource_properties -.-> check_data
Figure 8: Diagram showing the flow of objects and functions to create or re-create the resource data.

Modifying data types or table schema

Sometimes you may need to modify the data types or table schema of the data in a resource. So you’d first modify the resource properties before using this flow. The update_resource_batches() function will update each batch DataFrame with the new data types or table schema, then you can use write_resource_batch() to save the updated data back to the batch files. The same file name should be used, so it will completely overwrite the old batch files.

flowchart
    path_resource_batch_files("path_resource_batch_files()")
    read_resource_batches("read_resource_batches()")
    properties_updated[("ResourceProperties<br>(updated)")]
    list_data[("List[DataFrame]")]
    list_data_updated[("List[DataFrame]<br>(updated)")]
    update_resource_batches("update_resource_batches()")
    write_resource_batch("write_resource_batch()")

    path_resource_batch_files --> read_resource_batches
    read_resource_batches --> list_data
    list_data --> update_resource_batches
    properties_updated --> update_resource_batches
    update_resource_batches --> list_data_updated
    list_data_updated --> write_resource_batch
    properties_updated --> write_resource_batch
Figure 9: Diagram showing the flow of objects and functions to modify data types or table schema.

Deleting an observational unit

If you need to delete an observational unit from the data, you can use this flow. The delete_observational_unit() function will delete the observational unit from the data about output a list of DataFrames with the deletions. The write_resource_batch() function will save the updated data back to the batch files.

flowchart
    path_resource_batch_files("path_resource_batch_files()")
    read_resource_batches("read_resource_batches()")
    list_data_current[("List[DataFrame]<br>(original)")]
    list_data_deletions[("List[DataFrame]<br>(deletions)")]
    obs_unit[("dict<br>(observational unit)")]
    delete_observational_unit("delete_observational_unit()")
    write_resource_batch("write_resource_batch()")

    path_resource_batch_files --> read_resource_batches
    read_resource_batches --> list_data_current
    list_data_current --> delete_observational_unit
    obs_unit --> delete_observational_unit
    delete_observational_unit --> list_data_deletions
    list_data_deletions --> write_resource_batch
    path_resource_batch_files --> write_resource_batch
Figure 10: Diagram showing the flow of objects and functions to delete an observational unit.