Core Python functions

Warning

🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧

Important

We created this document, and especially the function diagrams, mainly as a way to help us as a team all understand and agree on what we’re making and what needs to be worked on. Which means that the descriptions and explanations of these functions, in particular the diagrams, will likely change quite a bit and may even be deleted later when they are no longer needed.

Based on the naming scheme and the Frictionless Data Package standard, these are the core external-facing functions in Sprout, which are stored in sprout/core/. See the Outputs section for an overview and explanation of the different outputs provided by Sprout.

There are some small differences between the naming scheme and the functions described here:

  1. Whenever view is used, it is always for one object (either a package or a resource). Whenever list is used, it is always to show basic details of all existing objects of a given type (packages or resources). Both view and list only ever show the information contained in the datapackage.json and never show actual data. We don’t provide the ability to view data directly since it is out of scope of Sprout and we want to minimise any security and privacy risks. The IT admin and data package admin users can always view the data directly (outside of Sprout) as they will have the appropriate IT and legal permissions.
  2. edit only ever edits the datapackage.json which contains the properties/metadata of the package or resource(s) but never edits the data itself. If there is a need for updates to the data itself, the user can update it by uploading a new file with the updates (as long as the primary keys like participant ID and time of collection are the same). We don’t provide access for users to directly edit the data because we want to limit security risks as well as to maintain privacy and legal compliance.
  3. add only ever adds more data to an existing data resource and does not add any new metadata to the datapackage.json. Since Sprout contains the full Frictionless Data specification, using edit is enough to make updates to the existing properties found in the datapackage.json file.

Nearly all functions have a path argument. Depending on what the function does, the path object will be different. Use the path_*() functions to get the correct path object for the specific function. It’s designed this way to make it more flexible to where individual packages and resources are stored and to make it a bit easier to write tests for the functions. For a similar reason, most of the functions output either a JSON file or path object to make them easier to test.

Several of the functions have an optional argument called properties. The properties argument is a list of key-value pairs (as a JSON object) that describe the package and resource(s) in the package. This metadata is stored in the datapackage.json file and follows the Frictionless Data specification.

Data package functions

This function lists some basic details contained in the datapackage.json files of all data packages found in the path location. This would be a list, showing very basic information that would be allowed by privacy and legal regulations (e.g. the package ID as well as the name, description, and contact persons for each package). Use path_packages() to provide the path to the place packages are stored by default. Outputs a JSON object by default.

This will show the information contained within the datapackage.json file of a package for only the fields relevant to the package itself, and only basic details of the resources within the package. Use path_properties() to provide the correct path location. Outputs a JSON object.

This is the first function to use to create a new data package. It assigns a package ID and then creates a package folder and all the necessary files for a package (excluding the resources), as described in the Outputs section. Creates the files and then outputs the file paths for the created files. Use path_packages() to provide the correct path location to create this structure.

A Plant UML schematic of the detailed code flow within the `create_package_structure()` function.
Figure 1: Diagram showing the internal function flow of the create_package_structure() function.

See the help documentation with help(edit_package_properties()) for more details.

A Plant UML schematic of the detailed code flow within the `edit_package_properties()` function.
Figure 2: Diagram showing the internal function flow of the edit_package_properties() function.

Completely delete a specific package and all it’s data resources. Because this action would be permanent, the confirm argument would default to false so that the user needs to explicitly provide true to the function argument as confirmation. This is done to prevent accidental deletion. Use path_package() function in the path to get the correct location. Outputs true if the deletion was successful.

Writes JSON object containing package properties (not including resource properties) back to the datapackage.json file. The path argument is the location of the datapackage.json file. Use the path_properties() function to provide this path to the correct location. Returns the same path object as given in the path argument.

A Plant UML schematic of the detailed code flow within the `write_package_properties()` function.
Figure 3: Diagram showing the internal function flow of the write_package_properties() function.

Data resource functions

This looks in one specific package’s datapackage.json file and lists a basic summary of all the data resources contained within the package. The list would be basic information like resource ID, name, and description. Use path_resources() to provide the correct path location. Outputs a JSON object.

Views the information on a package’s specific resource that is contained within the datapackage.json file. Use the path_resource() function to provide the correct location for the path and path_properties() for the properties_path argument. Outputs a JSON object.

This is the first function to use to set up the structure for a data resource. It creates the paths for a new data resource in a specific (existing) package by creating the folder setup described in the Outputs section. Use the path_resources() function to provide the correct path location. It creates two paths: the resources/<id>/ path and the resources/<id>/raw/ path. The output is a list of these two path objects.

A Plant UML schematic of the detailed code flow within the `create_resource_structure()` function.
Figure 4: Diagram showing the internal function flow of the create_resource_structure() function.

This function sets up and structures a new resource property by taking the fields given in the properties argument to fill them and prepare them to be added to the datapackage.json file. It must be given as a JSON object following the Data Package specification (use the ResourceProperties class to get an object that follows the Frictionless Data Package standard). The path argument provides the path to the resource id that the properties are for; use path_resource() to provide the correct path or use the output of create_resource_structure(). Outputs a JSON object; use write_resource_properties() to save the JSON object to the datapackage.json file.

A Plant UML schematic of the detailed code flow within the `create_resource_properties()` function.
Figure 5: Diagram showing the internal function flow of the create_resource_properties() function.

Copy the file from data_path over into the resource location given by path. This will compress the file and use a timestamped, unique file name to store it as a backup. See the explanation of this file in the Outputs section. Use path_resource_raw() to provide the correct path location. Copies and compresses the file, and outputs the path object of the created file.

A Plant UML schematic of the detailed code flow within the `write_resource_data_to_raw()` function.
Figure 6: Diagram showing the internal function flow of the write_resource_data_to_raw() function.

This function takes the files provided by raw_files and merges them into a data.parquet file provided by path. Use path_resource_data() to provide the correct path location for path and path_resource_raw_files() for the raw_files argument. Outputs the path object of the created file.

Edit the properties of a resource in a package. The properties argument must be a JSON object that follows the Frictionless Data specification. Use the path_properties() function to provide the correct path location. Outputs a JSON object only; use write_resource_properties() to save the JSON object to the datapackage.json file.

Use this to delete a raw file in the raw/ folder of a resource. This is useful if the file is no longer needed or if it is incorrect or had an issue during the time when a user first added the file to a resource. Use path_resource_raw_files() to select which file to delete from the list of files found. Will only delete one file. The confirm argument defaults to false so that the user needs to explicitly provide true to the function argument as confirmation. This is done to prevent accidental deletion. Outputs true if the deletion was successful.

Delete all data (Parquet and in the database) and raw data of a specific resource. Use path_resource_raw() to provide the correct path location. The confirm argument defaults to false so that the user needs to explicitly provide true to the function argument as confirmation. This is done to prevent accidental deletion. Outputs true if the deletion was successful. Use delete_resource_properties() to delete the the associated properties/metadata for the resource.

Deletes all properties for a resource within the datapackage.json file. Use path_properties() to provide the correct location for path. The confirm argument defaults to false so that the user needs to explicitly provide true to the function argument as confirmation. This is done to prevent accidental deletion. Outputs true if the deletion was successful.

Use to write the resource JSON object back to the datapackage.json file. This function validates whether the JSON object follows the Data Package resource specification. The properties argument must be given as a JSON object following the Data Package specification (use the ResourceProperties class to get an object that follows the Frictionless Data Package standard). The path argument is the location of the datapackage.json file. Use the path_properties() function to provide this path to the correct location. Returns the same path object as given in the path argument.

A Plant UML schematic of the detailed code flow within the `write_resource_properties()` function.
Figure 7: Diagram showing the internal function flow of the write_resource_properties() function.

Writes a data object to the path database location. Use path_resource_database() to provide the correct location for the path object. The properties argument, using view_resource(), ensures the correct table is created in the database. Outputs the same path object as well as the name of the newly created database table.

This takes the data found at the data_path location and creates a JSON object following the Frictionless Data standard. Internally it uses the frictionless Python package to guess and fill in some of the property fields required by the Frictionless Data resource specification. This function is often followed by edit_resource_properties() to fill in any remaining missing fields and then the verify_resource_properties() function. Usually, you use either this function or the create_resource_properties() function to create the resource properties for the specific data resource given. The path argument points to the resource id folder; use path_resource() to help give the correct path. Outputs a JSON object. Use write_resource_properties() to save the JSON object to the datapackage.json file.

A Plant UML schematic of the detailed code flow within the `extract_resource_properties()` function.
Figure 8: Diagram showing the internal function flow of the extract_resource_properties() function.

Path functions

  • Nearly all the below path_*() functions use path_sprout_global() internally to get the global path as well as package_id and/or resource_id to complete the correct path for the specific package/resource.
  • All of these functions output a path object.
  • The paths returned by all of these functions are paths that exist, so all include a verify_is_dir() or verify_is_file() check.
  • If the wrong package_id or resource_id is given, an error message will include a list of all the actual package_id at the path_sprout_global() location or all actual resource_id for a specific package.
A Plant UML schematic of the detailed code flow within a typical `path_*()` function.
Figure 9: Diagram showing the internal function flow of most of the path_*() functions.

Creates the absolute path to the package’s resources directory.

Example usage

print(path_resources(1))
[PosixPath('~/.sprout/packages/1/resources')]

Creates the absolute path to a specific resource directory within a package.

Example usage

print(path_resource(1, 2))
[PosixPath('~/.sprout/packages/1/resources/2')]

Creates the absolute path to a resource’s raw data directory within a package.

Example usage

print(path_resource_raw(4, 2))
[PosixPath('~/.sprout/packages/4/resources/2/raw')]

Creates a list of absolute paths to all the raw data files within a resource’s raw data directory within a package.

Example usage

print(path_resource_raw_files(4, 2))
[PosixPath('~/.sprout/packages/4/resources/2/raw/2024-05-01T12:00:00Z-***.csv.gz')]

Creates the absolute path to a resource’s Parquet data file in a package. Internally also has a verify_is_file() check.

Example usage

print(path_resource_data(4, 2))
[PosixPath('~/.sprout/packages/4/resources/2/data.parquet')]

Creates the absolute path to the package’s properties datapackage.json file. Internally also has a verify_is_file() check.

Example usage

print(path_properties(6))
[PosixPath('~/.sprout/packages/6/datapackage.json')]

Creates the absolute path to the package’s database.

Creates an absolute path to the directory that has or will have all the data packages.

Example usage

print(path_packages())
[PosixPath('~/.sprout/packages')]

Creates the absolute path to the specific package folder.

Example usage

print(path_package(10))
[PosixPath('~/.sprout/packages/10')]

If the SPROUT_GLOBAL environment variable isn’t provided, this function will return the default path to where data packages will be stored. The default locations are dependent on the operating system. This function also creates the necessary directory if it doesn’t exist.

A Plant UML schematic of the detailed code flow within the `path_sprout_global()` function.
Figure 10: Diagram showing the internal function flow of the path_sprout_global() function.

Example usage

print(path_sprout_global())
[PosixPath('~/.sprout')]

Properties dataclasses

These dataclasses contain an explicit, structured set of official properties defined within a data package. The main purpose of these is to allow us to pass structured properties objects between functions. They also enable users to create valid properties objects more easily and get an overview of optional and required class fields.

Creates a dataclass object with all the necessary properties for the top level metadata of a data package.

Example usage

print(PackageProperties())
PackageProperties(title=None, description=None, licenses=None, contributors=None, resources=None)
print(PackageProperties(title="Diabetes Cohort"))
PackageProperties(title="Diabetes Cohort", description=None, licenses=None, contributors=None, resources=None)
print(PackageProperties(licenses=[LicenseProperties(name="ODC-BY-1.0")]))
PackageProperties(title=None, description=None, licenses=[LicenseProperties(name="ODC-BY-1.0")], contributors=None, resources=None)

Creates a dataclass object with all the necessary properties for a resource, which would be given in the resources field of a PackageProperties object.

Example usage

print(ResourceProperties())
ResourceProperties(name=None, description=None, path=None, schema=None)
print(ResourceProperties(name="Blood Samples"))
ResourceProperties(name="Blood Samples", description=None, path=None, schema=None)

Creates a dataclass object with all the necessary properties for a contributor. This would be given in the contributors field of a PackageProperties object.

Example usage

print(ContributorProperties())
ContributorProperties(title=None, email=None, roles=None)

Creates a dataclass object with all the necessary properties for a license, so that it can be added to the licenses field of a PackageProperties object.

Example usage

print(LicenseProperties())
LicenseProperties(name=None, path=None, title=None)

Creates a dataclass object with all the necessary properties for a table schema, so that it can be added to the schema field of a ResourceProperties object.

Example usage

print(TableSchemaProperties())
TableSchemaProperties(fields=[], missingValues=[], primaryKey=[], foreignKeys=[])

Helper functions

List all files in a directory. Outputs a list of path objects.

Create a prettier, human readable version of a JSON object.

Observational unit level functions

Observational unit is the level of detail on the entity (e.g. human, animal, event) that the data was collected on. Example would be: A person in a research study who came to the clinic in May 2024 to have their blood collected and to fill out a survey.

TODO.