Data management and analysis
This module provides a high-level overview of key data analysis techniques and strategies for managing data and ensuring data quality. By this stage, you should have defined your research outcome and study design. IR teams may find it helpful to work through this module with someone who has strong skills in data analysis and management who can provide guidance if needed.
Many IR studies require a team to collect new, purpose-specific data (known as primary data) to address a specific research objective. As described in Module 3, this is typically conducted through the use of survey tools such as questionnaires or interview guides. Before commencing data collection, there are several decisions and steps that must be taken. A key component of data collection is determining the most appropriate way to sample and recruit participants. This will depend on the choice of research methods. Your data collection strategy should describe your approach to identifying and recruiting participants and the process for collecting data. Ensuring the quality and appropriate management of collected data is also a key component; these issues are discussed below.
Sampling and recruitment
Sampling refers to how you will select participants for your research. Sampling methods can be either probability- or non-probability-based. Probability-based or random sampling enables every individual in a population to have equal chances of being selected into a study. Probability sampling techniques are preferable for quantitative research as this type of research typically requires a sample that accurately reflects the characteristics of the broader population from which it was selected. To do this, probability-based sampling requires a sampling frame, that is a complete list of units – individuals, households, clinic facilities etc. – that your sample will be drawn from.
Common examples of random sampling:
- Simple random sampling: This is the ideal sampling strategy because each element of the population has an equal probability of being selected in the sample. The sampling procedure is to assign a number to each element in the sampling frame and use a random number to select elements from the sampling frame. Most statistical packages can generate random numbers.
- Stratified sampling: Stratified sampling can be used in a population that consists of mutually exclusive sub-groups (e.g. school population with classes). A random sampling procedure is then used to select elements from each stratum. Sample size can be selected proportionately to the stratum size.
- Cluster sampling: Cluster sampling is commonly used when the population is large or dispersed across a large geographical area. The goal of cluster sampling is to increase sampling efficiency. However, cluster sampling reduces the population variability in the sample since individuals in the same geographical area are to some extent more homogenous and the probability of each element to be selected in the sample is not equal.
However, in some situations, random sampling is not the preferred option due to factors such as lack of specific resources (e.g. a list of the entire population to provide a sampling frame), time or cost constraints. Additionally, your research may be interested in data that are collected from a specific category of subject, or in a specific context; this is known as non-probability, or non-random sampling and participants are deliberated selected for a specific purpose or based on specific criteria. Given that IR takes place in real-world settings and under real-world conditions, non-probability sampling methods are likely to be more appropriate and feasible for your study.
Common examples of non-random sampling:
- Convenience: Participants are recruited from those accessible during the time or context in which the research is conducted, such as patients currently enrolled in a specific TB clinic.
- Purposive: Participants are purposely recruited based on the researcher's judgment regarding the desired information being collected, such as staff working in a TB unit that is trialling a new electronic reporting system.
- Snowball: Participants are recruited through the personal networks of other participants. In snowball sampling, researchers identify a small number of relevant people to participate, and then ask them to identify other people in their networks who could also participate. Snowball sampling is frequently used for qualitative analysis and to gain access to hard-to-reach populations.
Qualitative research generally requires far fewer participants than quantitative research. A typical sample size for a focus group discussions (FGD), for example, is around 6–10 participants, while key informant interviews (KIIs) are typically conducted with around 10–20 participants in total, given the time they require and the richness of data they provide. FGDs and KIIs are usually stopped when no new information is being generated during data collection – this phenomenon is known as 'data saturation'.
In quantitative research, sample size is determined by statistical power parameters; that is, having enough 'power' or probability within your study to detect an effect under investigation. Sample size can also be guided by the level of expected precision when there is no comparison. Sample size calculation formulae and calculation procedures can be found in standard biostatistics reference materials.1 Early involvement and discussion with a statistician are strongly advised to calculate the appropriate sample size needed for various types of research methods and to deal with any adjustments that may be needed as the study evolves.
Secondary data collection
Many IR studies also use existing data (known as secondary data) to address a specific research objective. By using existing data that is routinely collected, research teams can reduce potential costs and time associated with conducting a study. TB registers are an example of routinely collected data that could be utilized and analysed as a secondary data source.
Additionally, a feature unique to digital interventions is that many have the ability to automatically collect and store electronic data that can be useful to a study team. This type of 'system-generated' data is often under-utilised, yet when combined with purposefully collected primary data, can provide important information useful for monitoring implementation outcomes.
Examples of system-generated data that could be used include:
- for implementation: data on connectivity issues and functionality, data on any irregularities in the time or place from which the data are being sent by each health worker (for example, as noted through timestamps and GPS codes); data on unusual patterns in data collection (for example, follow-up visits being recorded consecutively within a short time span), data on user errors, incomplete recording forms etc.
- for adoption: data on frequency of data uploads, number of user registrations, number of individual users, average time spent on a platform etc.
If your study involves the development and evaluation of a new digital intervention, the study team should consider which data points may be useful from an IR perspective and ensure they are built into the back-end process. These data points should be explicitly discussed with any developers or technologists responsible for building or designing the digital intervention, rather than assuming they will automatically be captured.
For studies using secondary data, either alone or as part of a mixed-methods approach, it is still necessary to describe your data collection process, including how you will identify which data to use (for example, will you take all records from an electronic TB register, or just a sample?) and the methods for extracting and preparing the data for analysis.
Chapter 5 (Assessing data sources and quality of M&E) in: Monitoring and evaluating digital health interventions: a practical guide for conducting research and assessment2 provides a helpful overview of how to review existing data sources and identify potential data that can be used for research purposes.
Data quality and management
Most research projects generate a significant amount of data and it is important that the quality of the data is high because it underpins the quality of the study. Embedding quality management strategies within your proposal is essential to ensure that research meets scientific, ethical and regulatory standards. Various strategies can be adopted to promote data quality, depending on your study procedures and methodology. Your proposal must include an outline of the consistent, ongoing measures the research team will take to monitor and evaluate the quality and rigour of the research throughout the various stages. Consider the following questions:
- What is the data flow (from the source to the place where data is collected)?
- What is the critical data, in which errors are not acceptable?
- If you are using data collectors, have they received adequate training on data collection processes, use of data collection tools etc?
- Are there supervision/quality control processes in place to ensure the consistency and quality of data collection is maintained throughout the data collection period?
- Are there standard operating procedures (SOPs) in place to guide the data collection, entry and safe storage?
- Are there systems in place to support data storage (including backups) and ensure data security?
- What are the processes for data cleaning?
- Are there SOPs in place to guide data validation to ensure quality and consistency?
Data analysis refers to inspection and interpretation of collected data to test hypotheses, generate new knowledge and ultimately to address research objectives. There are various ways to analyse data and the optimal choice of methods depends on the type of data you have collected, your outcome of interest and the overall objective of your research. The sub-sections below reflect some of the main data analysis methods used in both qualitative and quantitative research.
Qualitative data analysis
Qualitative data analysis aims to identify patterns or themes in the data and the links that exist between them (Box 18). Qualitative data analysis typically involves either a deductive or inductive approach, in which the data is collected, analysed and grouped either in relation to the stated research question (deductive) or based on patterns and themes that are identified during the data review process (inductive).
There are various qualitative data analysis tools available to manage and sort qualitative data to make it easier to find and identify themes. Some commonly used tools include:
- AtlasTi (www.atlasti.com) deals with large data sets, unstructured coding, and mimics paper code and sort functions.
- MAXQDA (www.maxqda.com) provides powerful tools for analysing interviews, reports, tables, online surveys, videos, audio files, images and bibliographical data sets.
- QSR NVivo (www.qsrinternational.com) caters for unstructured coding, finds patterns/relationships in codes.
Quantitative data analysis
Quantitative data analysis involves the use of statistical methods to describe the data, assess differences between groups, quantify correlations or associations between variables, and to control for factors that may influence the relationship between variables (known as confounders). Typically, the variables are grouped into outcomes and explanatory variables. Tables are used to document aggregate and disaggregated data, supplemented with information on sample sizes, effect estimates (for example, differences in means or odds ratios), confidence intervals and P values. Design features of your study will need to be considered when deciding which statistical techniques are to be used. For example, when comparing outcomes in two groups, are the data paired or from two independent samples? Are the data points independent observations or is there clustering within the data? Advanced statistical approaches, such as regression, for example, may be used to control for confounding in studies that are focused on associations between variables, or to construct a predictive model if the interest is in prediction. Table 11 provides a brief overview of commonly used quantitative analysis methods for IR. This list is not exhaustive and advice from a local research institution and/or specialist should be sought if needed.
There are many software packages available for quantitative data analysis. Some of the most commonly used packages include Microsoft Excel, STATA, SAS, SPSS and R. These programmes vary in their ability to perform complex analyses and may require training and knowledge of specific programming languages and syntaxes for their effective use.
Your proposal must outline which summary statistics, such as means and standard deviations, will be calculated from data collected, and any statistical analyses that will be conducted, including the tools that will be used.
Proposal checklist: Data management
Exercise: With your team, finalise your research methods section of your proposal, by focusing on the data analysis plan and quality management plan. Work through the checklist to make sure this section includes all necessary information and is correctly formatted.
Describe the approach to sampling and recruiting participants
Describe the measures, data sources and/or process for data collection related to the research outcomes, including time point at which they are measured in relation to the delivery of implementation strategy or intervention.
Data management and analysis
Describe exactly how data you collect will be compiled and managed.
Describe process for collecting, entering and cleaning data (as relevant).
Describe chosen methods of analysis used to assess relationship between key outcomes of interest and other variables. Include any details of subgroup analysis that will be undertaken and use of analysis software if relevant.
Describes the systems that will be adopted to ensure the quality of the data collection, storage and analysis, as well as the quality of the broader project.