Outputs

Sprout is designed to structure, organize, and store data in a standardized, coherent, and consistent way. At the base of how data is structured and stored are files and folders. This document describes the file and folder structure created by Sprout.

We follow our naming scheme, as well as the Frictionless Data specification, to decide on the structure and name of the files and folders. All data stored in Sprout for a specific set of conditions is called a data package and each set of data within the data package is called a data resource (see Note below).

Note

When a user starts using Sprout to structure their data, that data will form a Data Package (from the Frictionless Data standard). A data package is called many things in other fields. For instance it could be a “cohort study data”, a “data resource”, a clinical trial dataset, a dataset, a study dataset, simply “data”, or any other combination of words that include the word data. To keep it consistent with the Frictionless Data standard, we name it a data package.

A data package consists of one or more files (called data resources) that contain data specific to a given set of conditions. For instance, a data package could be a research study that collects information from people older than 50 who live in Denmark and who have diabetes. In this example data package, there might be data collected on basic demographics like gender, ethnicity, or education, or blood sample data like blood glucose, cholesterol, or blood pressure. Here, the basic demographics could be one data resource, while the blood sample data could be another data resource within the data package.

Single data package in working directory

As per the overall design patterns, the default behaviour of Sprout is to create a data package in the working directory or in a new folder for the data package. Below, the package folder is taken to be the working directory, represented as ..

Folders

The folders that Sprout creates and uses are:

./resources/<name>/: Whenever a specific type of data, like demographics data or blood sample data, is stored in the data package, it will be stored as a data resource. In general, a data resource encompasses data collected in a specific way and for a specific set of measurements from a specific set of input sources for a specific purpose. The data resource(s) will always be within one specific package, with each data resource in its own folder having a unique resource name <name>, e.g. ./resources/blood-samples/.

Note

Using the demographics and blood sample example above, the reason why these sources of data would be considered two separate data resources, rather than a single, joint data resource, is because, while they both collect data on the same group of people, they collect the data in different ways. Demographic data is usually collected with a survey by a researcher or technician (or the participants themselves if online) in a relatively consistent way using some survey software. Meanwhile, blood sample data is usually taken at a specific time and analyzed by a specific machine by a specific laboratory in a specific way that is output in a specific format. The two types of data are collected in different ways and should be recognized as separate entities and thus separate data resources, even though they contain data about the same group of people.
./resources/<name>/batch/: Data for a specific data resource is often stored and structured in batches. These are bulk collections of data that are processed together. This batch/ folder is where the data is stored before it is processed and saved in the data.parquet file. This folder makes it possible to keep track of data additions and updates, serving as a version history for the data resource. This can help with troubleshooting issues that may arise during data entry or processing.

Files

Sprout creates and uses these files:

./datapackage.json: This machine-readable JSON file is the foundation of a data package and contains the full Frictionless Data specification. This file contains metadata on the package (like title, description, and contributors) as well as metadata on the data resources (like column names, data types, descriptions, and a path to the Parquet file with the data in it).
./README.md: The README.md file is a generated, more human readable summary of the information in the datapackage.json file. This file is used to a) adhere to interoperable principles and b) serve as the basis for displaying human-readable information about the data package.
./resources/<name>/batch/<timestamp>-<uuid>.parquet: These are Parquet files that store the data in a structured, columnar format. These files are created when data is added or updated in a specific data resource and are stored in the batch/ folder before being processed and stored in the data.parquet file. The files in the batch/ folder are not linked within the datapackage.json file. Every time a user adds or updates data in the data resource, the data file is checked for correctness against the datapackage.json properties, before being stored in the batch/ folder. All files within this folder are used to (re-)generate the data resource’s data.parquet file.
./resources/<name>/data.parquet: When a user creates a data resource and adds or updates data in the resource, all resource data is processed and stored in the Parquet data format (if it is tabular data).