Input data
🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧
The types of data we expect or anticipate to be input into Sprout are described in this section. We design Sprout with these types of data and formats in mind.
Domain-specific types of data
Currently, we only have experience with health data, so we have a bias towards that type of data.
Health research
Health research data tends to consist of these types of data:
- Clinical: This data is typically collected during patient visits to doctors. Depending on the country or administrative region, there will likely already be well-established data processing and storage pipelines in place.
- Register: This type of data is highly dependent on the country or region. Generally, this data is collected for national or regional administrative purposes, such as recording employment status, income, address, medication purchases, and diagnoses. Like for routine clinical data, the pipelines in place for processing and storing this data are usually very extensive and well established.
- Biological sample data: This type of data is generated from biological samples, like blood, saliva, semen, hair, or urine. Data generated from sample analytic techniques often produce large volumes of data per person. Samples may be generated in larger established laboratories or in smaller research groups, depending on what analytic technology is used and how new it is. The structure and format of the generated data also tend to be highly variable and depend heavily on the technology used, sometimes requiring specialized software to process and output.
- Survey or questionnaire: This type of data is often collected based on a given study’s aims and research questions. There are hundreds of different questionnaires that can have highly specific purposes and uses for their data. They are also highly variable in the volume of data collected based on the survey, and on the format of the data.
Flow or frequency of data collection
In research (and even in most industry settings), we rarely encounter truly real-time data collection. Most data collection is done in “batches”, with data being collected at irregular and inconsistent intervals and then stored to be processed later. This batch collection can be broken down into two categories based on its frequency:
- Routine or continuous collection, where data is collected on a more regular interval and in smaller batches of “observational units”1. Ingestion or processing of this type of data may happen on a more regular basis. Clinical data as well as survey or questionnaire data may likely fall under this category. For example, data collected on a few patients seen during the day at a clinic.
- Grouped collection, where data is collected from many observational units during a short period of time at very irregular intervals or potentially only once. Data ingesting or processing occurs some time after all the data has been collected. Biological sample data would fall under this category, since laboratories usually run several samples at once and input data after internal quality control checks and machine-specific data processing. While register-based and clinical data usually get collected continuously, direct access to them is only given on a batch and infrequent basis, so they may also fall under this category. Survey data may also come in batches, depending on the questionnaire and software used for its collection.
Regardless of the flow or frequency of data generation and collection, the ability to automatically ingest the data into Sprout will vary wildly based on the data source, the organization who generates the data, and their technical expertise. Some data sources may have well-established, but not always programmatic or automatic, workflows and processes. Others may not have any workflow and it may be an extremely manual process.
Footnotes
Observational unit is the “entity” that the data was collected from at a given point in time, such as a human participant in a cohort study or a rat in an animal study at a specific time point.↩︎