Input data

Warning

🚧 Sprout is still in active development and evolving quickly, so the documentation and functionality may not work as described and could undergo substantial changes 🚧

This section describes the types of input data Sprout is expected to handle. We design Sprout with these types of data and formats in mind.

Domain-specific types of data

Currently, we only have experience with health data, so we have a bias towards that type of data.

Health research

Health research data tends to consist of these types of data:

  • Clinical: This data is typically collected during patient visits to doctors. Depending on the country or administrative region, there will likely already be well-established data processing and storage pipelines in place.
  • Register: This type of data is highly dependent on the country or region. Generally, this data is collected for national or regional administrative purposes, such as recording employment status, income, address, medication purchases, and diagnoses. Like for routine clinical data, the pipelines in place for processing and storing this data are usually very extensive and well established.
  • Biological sample data: This type of data is generated from biological samples, like blood, saliva, semen, hair, or urine. Sample analytic techniques often produce large volumes of data per person. Samples may be generated in larger established laboratories or in smaller research groups, depending on what analytic technology is used and how new it is. The structure and format of the generated data also tend to be highly variable and depend heavily on the technology used, sometimes requiring specialized software to process and output.
  • Survey or questionnaire: This type of data is often collected based on a given study’s aims and research questions. There are hundreds of different questionnaires that can have highly specific purposes and uses for their data. The volume and format of the collected data also vary widely between surveys.

Flow or frequency of data collection

In research (and even in most industry settings), we rarely encounter truly real-time data collection. Most data collection is done in “batches”, with data being collected at irregular and inconsistent intervals and then stored to be processed later. This batch collection can be broken down into two categories based on its frequency:

  • Routine or continuous collection, where data is collected at more regular intervals and in smaller batches of “observational units”. Ingestion or processing of this type of data may happen on a more regular basis. Clinical data as well as survey or questionnaire data tends to fall under this category. For example, data collected on a few patients seen during the day at a clinic.
  • Grouped collection, where data is collected from many observational units during a short period of time at very irregular intervals or potentially only once. Data ingesting or processing occurs some time after all the data has been collected. Biological sample data would fall under this category, since laboratories usually run several samples at once and input data after internal quality control checks and machine-specific data processing. While register-based and clinical data usually get collected continuously, direct access to them is only given on a batch and infrequent basis, so they may also fall under this category. Survey data may also come in batches, depending on the questionnaire and software used for its collection.

Regardless of the flow or frequency of data generation and collection, the ability to input the data into Sprout automatically will vary wildly based on the data source, the organization who generates the data, and their technical expertise. Some data sources may have well-established, though not always programmatic or automatic, workflows and processes. Others may not have any workflows at all, and data processing or input may be an entirely manual process.

Footnotes

  1. Observational unit is the “entity” that the data was collected from at a given point in time, such as a human participant in a cohort study or a rat in an animal study at a specific time point.↩︎