join_resource_batches

join_resource_batches(
    data_list: list[pl.DataFrame],
    resource_properties: ResourceProperties,
)

Join all the batch resource DataFrames into a single (Polars) DataFrame.

This function takes a list of DataFrames, joins them together and drops any duplicate observational units based on the primary key from resource_properties. Then, it confirms that the data are correct against the resource_properties after the join.

The observational unit is the primary key of the resource. For example, if a person is part of a research study and has multiple observations, the person’s ID and the date of collection would be the observational unit.

If there are any duplicate observational units in the data, only the most recent observational unit will be kept based on the timestamp of the batch file. This way, if there are any errors or mistakes in older batch files that have been corrected in later files, the mistake will be kept in the batch file, but won’t be included in the data.parquet file.

Parameters

data_list : list[pl.DataFrame]: A list of Polars DataFrames for all the batch files. Use read_resource_batches() to get a list of DataFrames that have been checked against the properties individually.
resource_properties : ResourceProperties: The ResourceProperties object that contains the properties of the resource to check the data against.

Returns

pl.DataFrame: A single DataFrame object of all the batch data with duplicate observational units removed.

Raises

ValueError: If an empty data_list is provided.
polars.exceptions.ShapeError: If the dataframes in data_list have different shapes, such as mismatched column names or numbers.
polars.exceptions.SchemaError: If the dataframes in data_list have different schemas, e.g., their column data types don’t match.

Examples

import seedcase_sprout as sp

with sp.ExamplePackage():
    resource_properties = sp.example_resource_properties()
    sp.write_resource_batch(sp.example_data(), resource_properties)
    batches = sp.read_resource_batches(resource_properties=resource_properties)

    sp.join_resource_batches(batches, resource_properties)

PosixPath('/tmp/tmpca4pycr3/example-package/resources/example-resource/batch/2025-07-07T211607Z-c76bf58f-970e-4ad5-b33c-a751c9df42a2.parquet')

shape: (3, 3)

id	name	value
i64	str	f64
34	"Helly R"	123.123
99	"Mark S"	9988.0
100	"Ms Casey"	-76.0009