Outputs
🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧
Sprout is designed to structure, organize, and store data in a standardized, coherent, and consistent way. At the base of how data is structured and stored are files and folders. This document describes the file and folder structure created by Sprout.
We follow our naming scheme, as well as the Frictionless Data specification, to decide on the structure and name of the files and folders. All data stored in Sprout for a specific set of conditions is called a data package and each set of data within the data package is called a data resource (see Note below).
When a user starts using Sprout to structure their data, that data will form a Data Package (from the Frictionless Data standard). A data package is called many things in other fields. For instance it could be a “cohort study data”, a “data resource”, a clinical trial dataset, a dataset, a study dataset, simply “data”, or any other combination of words that include the word data. To keep it consistent with the Frictionless Data standard, we name it a data package.
A data package consists of one or more files (called data resources) that contain data specific to a given set of conditions. For instance, a data package could be a research study that collects information from people older than 50 who live in Denmark and who have diabetes. In this example data package, there might be data collected on basic demographics like gender, ethnicity, or education, or blood sample data like blood glucose, cholesterol, or blood pressure. Here, the basic demographics could be one data resource, while the blood sample data could be another data resource within the data package.
Single data package in working directory
As per the overall design patterns, the default behaviour of Sprout is to create a data package in the working directory or in a new folder for the data package. Below, the package folder is taken to be the working directory, represented as .
.
Folders
The folders that Sprout creates and uses are:
./resources/<name>/
-
Whenever a specific type of data, like demographics data or blood sample data, is stored in the data package, it will be stored as a data resource. In general, a data resource encompasses data collected in a specific way and for a specific set of measurements from a specific set of input sources for a specific purpose. The data resource(s) will always be within one specific package, with each data resource in its own folder having a unique resource name
<name>
, e.g../resources/blood-samples/
.NoteUsing the demographics and blood sample example above, the reason why these sources of data would be considered two separate data resources, rather than a single, joint data resource, is because, while they both collect data on the same group of people, they collect the data in different ways. Demographic data is usually collected with a survey by a researcher or technician (or the participants themselves if online) in a relatively consistent way using some survey software. Meanwhile, blood sample data is usually taken at a specific time and analyzed by a specific machine by a specific laboratory in a specific way that is output in a specific format. The two types of data are collected in different ways and should be recognized as separate entities and thus separate data resources, even though they contain data about the same group of people.
./resources/<name>/batch/
-
Data for a specific data resource is often stored and structured in batches. These are bulk collections of data that are processed together. This
batch/
folder is where the data is stored before it is processed and saved in thedata.parquet
file. This folder makes it possible to keep track of data additions and updates, serving as a version history for the data resource. This can help with troubleshooting issues that may arise during data entry or processing.
Files
Sprout creates and uses these files:
./datapackage.json
-
This machine-readable JSON file is the foundation of a data package and contains the full Frictionless Data specification. This file contains metadata on the package (like title, description, and contributors) as well as metadata on the data resources (like column names, data types, descriptions, and a path to the Parquet file with the data in it).
./README.md
-
The
README.md
file is a generated, more human readable summary of the information in thedatapackage.json
file. This file is used to a) adhere to interoperable principles and b) serve as the basis for displaying human-readable information about the data package. ./resources/<name>/batch/<timestamp>-<uuid>.parquet
-
These are Parquet files that store the data in a structured, columnar format. These files are created when data is added or updated in a specific data resource and are stored in the
batch/
folder before being processed and stored in thedata.parquet
file. The files in thebatch/
folder are not linked within thedatapackage.json
file. Every time a user adds or updates data in the data resource, the data file is checked for correctness against thedatapackage.json
properties, before being stored in thebatch/
folder. All files within this folder are used to (re-)generate the data resource’sdata.parquet
file. ./resources/<name>/data.parquet
-
When a user creates a data resource and adds or updates data in the resource, all resource data is processed and stored in the Parquet data format (if it is tabular data).