The purpose of data synthesis of measurement properties is to evaluate whether the measurement properties for specific instruments are adequate for the intended use of the instrument. Data for each measurement property for each instrument of interest should be synthesized and evaluated.
Homogeneity of the study characteristics
The result with regard to measurement properties can only be generalized to populations that are similar to the study sample in which the measurement properties have been evaluated. This implies that when a measurement property has been evaluated in different studies we need to consider the (dis)similarities in populations and settings in the various studies, and use this to inform whether it is reasonable to combine the results from the studies. A further complexity is that we need also to consider the language version of the instrument that is used, and the form of administration (for example, online versus paper based).
There are two options for data synthesis of each measurement property: meta-analysis or narrative synthesis.
Meta-analysis
Statistical methods exist for pooling parameters related to measurement property data, for example, Cronbach’s alpha coefficient, correlation coefficients (intra-class, Spearman, Pearson), standard error of measurement (SEMs) and minimal important change (MIC) values. Correlations may be pooled using the correlation coefficients directly or using z-transformed coefficients (Shadish & Haddock, 1994). Pooling should only be performed if there are several studies available that are sufficiently similar to be able to combine their results.
Some heterogeneity between the study estimates should be expected due to differences in participants and study characteristics. Thus, a DerSimonian and Laird random effects model should be used in the meta-analysis (DerSimonian & Laird, 1986). Heterogeneity between the studies should be quantified using the I2 statistic, and reasons for heterogeneity should be explored using subgroup and/or sensitivity analyses. In particular, sensitivity analyses excluding studies of poor methodological quality should be performed to assess whether the pooled estimates are strongly influenced by the results of these studies.
While meta-analysis of data is encouraged where appropriate, useful published examples of meta-analysis using measurement property data are limited and there is a lack of standardized statistical methods. More research is needed on the methodology of statistical pooling of the data from studies on measurement properties. Some example systematic reviews with meta-analysis that may be worth consulting include Anderson et al. (2019) (correlation coefficients for internal consistency, reliability, construct validity, criterion validity), Bai et al. (2018) (Cronbach’s alpha for internal consistency, ICC for test-retest reliability, Pearson correlation for hypotheses testing), Chamorro et al. (2017) (LoA for reliability, ICC for criterion validity), Chiarotto et al. (2016) (correlation coefficients for construct validity), and Collins et al. (2016) (standardized response mean (SRM) for responsiveness).
Narrative synthesis
Measurement property data that is not suitable to pool in meta-analysis should be combined using narrative synthesis. The narrative synthesis should take into consideration the following characteristics when reporting the findings of the studies: the methodological quality of the studies, consistency of the results, and homogeneity of the studies.
Evaluation of the measurement instrument(s)
Once the data has been statistically pooled or narratively synthesized, the evidence for each measurement property for each instrument of interest should be compared to accepted criteria for adequate measurement properties. It is recommended to use the ‘criteria for good measurement properties’ suggested by COSMIN (Prinsen et al. (2018) – Table 1; Mokkink et al. (2018b) – Table 4). Using these criteria, each measurement property can be rated as either sufficient, insufficient, or indeterminate. This overall rating is important in determining whether a measurement instrument is adequate for use for particular populations and contexts.