Design
🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧
The purpose of these documents is to describe the design of Sprout in enough detail to help us develop it in a way that is sustainable (i.e., maintainable over the long term) and that ensures we as a team have a shared understanding of what Sprout is and is not.
Sprout’s design builds off of our overall Seedcase design principles and patterns, in that any design we create for Sprout should strongly follow the same design principles and patterns described there. Unless explicitly stated otherwise, Sprout’s design adds to but does not replace the overall design.
There are two main sections of this design documentation:
- Architecture: The architecture section describes the high-level requirements, components, use cases, expected users, and how the system will be organized and structured.
- Interface: The interface section gives a much more detailed description of the design of the software interface. The detail is at the level of exact Python functions, as well as their arguments, inputs, and outputs.
Purpose
The overall aim of Sprout is to take data and metadata and convert them into a standardized and organized storage structure that follows best practices for data engineering, particularly with a focus on research contexts. Specifically, Sprout aims to:
- Take generated data from various source locations (such as clinics or laboratories), which may be distributed geographically or organizationally, and store it in a standardized and efficient format.
- Ensure that metadata is included for the data and organized in a standardized format, explicitly and programmatically linking the metadata to the data.
Aligning with our modular design pattern to build software that has a focused and narrow scope as well as that can be easily extended or customized, Sprout core (this package) has an additional aim to:
- Build a flexible, extensible, and fine-grained set of functionalities that implement Sprout’s overall design, enabling users to configure and customize Sprout to their own needs. Specifically, to build functionality that supports us in building more opinionated interfaces and extensions of Sprout.
Requirements
Overall, Sprout must:
- Run on Windows, MacOS, and Linux: Our potential users work on any of these systems (including servers), so we need to ensure compatibility across the most commonly used operating systems.
- Integrate GDPR, privacy, and security compliance: Many of our users work with health or personally sensitive data.
- Run remotely on servers and locally on computers: The location where data are stored should be flexible based on the needs and restrictions of the user.
- Be able to handle a variety of data file sizes: While the size of research data usually does not compare to that found in industry, it can still become large enough that it requires special care and handling.
- Store data in a format that is open source, integrates with many tools, and is storage efficient: Sprout is first and foremost a data engineering tool for research data storage and distribution (or at least, easier sharing).
- Store, organize, and manage multiple distinct data sources per user or group of users (for example in a server setting): Researchers rarely collect and work on one data source at any given time. So, Sprout must be able to handle multiple distinct data sources.
- Upload and update data: Data can be added to Sprout in batches or on a more frequent basis. We anticipate that batch uploads will be the most common.
- Store, organize, and manage metadata connected to the data: Metadata are vital to understanding the data and its context, without which data can be near useless. Sprout needs to make managing and organizing metadata fundamental to its functionality.
- Track changes to the data in a changelog and versioning system: Data are not static and can change over time. Sprout must track these changes and provide a way to show, track, and manage versions of the data. This is also necessary for legal compliance, auditing and record-keeping.
In general, Sprout will not:
- Run any analytic computing or data science work: While some data processing and analysis will occur, it will be limited to running checks on the quality of the data and metadata.
Aside from the overall requirements for Sprout, the requirements for this core module of Sprout are:
- Individual pieces of functionality should be independent: To keep it flexible and extensible, each functionality should (ideally) be able to run on its own without needing to depend on other functionality.
- Assume as little as possible about the environment: For each functionality, ideally assume as little about the current state of the environment as possible. For instance, what the exact path is on the filesystem or where in the filesystem the software is running.
- Make as few assumptions and expectations as is reasonable: This package should not be designed to be (too) opinionated about the order steps are taken, what steps are taken, where they are taken, and other specific details.