Esp 12.4 Glossary of terms

Term	Definition
Area under the curve (AUC)	In a receiver operating characteristic (ROC) curve analysis, an index of the performance of a diagnostic or screening measure in relation to diagnostic accuracy, summarized in a single value that typically ranges from 0.50 (no better than random classification) to 1.0 (perfect classification) (Polit & Yang, 2016); a measure of criterion validity or responsiveness.
Ceiling effect	The effect of having scores restricted at the upper end of a score continuum which limits discrimination at the upper end of the measurement, constrains true variability and restricts the amount of upward change possible (Polit & Yang, 2016); a measure of content validity.
Clinimetrics	The study of instruments where items may be major or minor; or present or absent (Gewitz et al., 2015).
Comparative fit index (CFI)	A statistic used to evaluate the goodness of fit of a proposed model to the data (e.g. in a confirmatory factor analysis or item response theory analysis) involving the comparison of the proposed model with a null model; a value greater than 0.95 is often considered as indicative of good fit (Polit & Yang, 2016); a measure of construct validity.
Construct validity	The degree to which evidence about a measure’s scores in relation to other scores supports the inference that a construct has been appropriately represented; the degree to which a measure captures the focal construct (Polit & Yang, 2016).
Content validity	The degree to which a multi-item instrument has an appropriate set of relevant items reflecting the full content of the construct domain being measured (Polit & Yang, 2016); incorporates face validity.
Content validity index (CVI)	An index summarizing the degree to which a panel of experts agrees on an instrument’s content validity (i.e. the relevance, comprehensiveness and balance of items comprising a scale) (Polit & Yang, 2016). There are item-level and scale-level CVI.
Criterion validity	The extent to which scores on a measure are an adequate reflection of (or predictor of) a criterion (i.e. ‘gold standard’ measure) (Polit & Yang, 2016).
Cronbach’s alpha coefficients (Coefficient alpha)	An index of internal consistency that indicates the degree to which the items on a multi-item scale are measuring the same underlying construct (Polit & Yang, 2016); a measure of reliability.
Cross cultural validity	The degree to which the items on a translated or culturally adapted scale perform adequately and equivalently, individually and in the aggregate, in relation to their performance on the original instrument; an aspect of construct validity (Polit & Yang, 2016).
Differential item functioning (DIF)	The extent to which an item functions differently for one group or culture than for another despite the groups being equivalent with respect to the underlying latent trait (Polit & Yang, 2016); a measure of cross-cultural validity.
Face validity	The extent to which an instrument looks as though it is a measure of the target construct (Polit & Yang, 2016). An aspect of content validity.
Factor analysis	A statistical procedure for disentangling complex interrelationships among items and identifying the items that ‘go together’ as a unified dimension; A measure of construct validity (Polit & Yang, 2016).
Floor effect	The effect of having scores restricted at the lower end of a score continuum which limits the ability of the measure to discriminate at the lower end of the measurement, constrains true variability and limits the amount of downward change possible (Polit & Yang, 2016); a measure of content validity.
Goodness of fit index (GFI)	A statistic used to evaluate the goodness of fit of a proposed model to the data (e.g. In confirmatory factor analysis); a value greater than .90 is often considered as an adequate fit (Polit & Yang, 2016); a measure of reliability.
Internal consistency	The degree to which the subparts of a composite scale (i.e. the items) are interrelated and are all measuring the same attribute or dimension; a measure of reliability (Polit & Yang, 2016).
Inter-rater reliability	The variation between two or more raters who measure the same group of subjects.
Intra-class correlation coefficients (ICC)	Estimates the proportion of total variance in a set of scores that is attributable to true differences among the people or objects being measured (e.g. the test-retest reliability); a measure of reliability (Polit & Yang, 2016).
Intra-rater reliability	The variation of data measured by a single rater across two or more occasions.
Kappa	A statistical index of chance-corrected agreement or consistency between two nominal (or ordinal) measurements; often used to assess interrater or intra-rater reliability (Polit & Yang, 2016).
Limits of agreement (LOA)	An estimate of the range of differences in two sets of scores that could be considered random measurement error, typically with 95% confidence; graphically portrayed on Bland-Altman plots (Polit & Yang, 2016); a measure of reliability.
Measurement error	The systematic and random error of a person’s score on a measure , reflecting factors other that the construct being measured and resulting in an observed score that is different from a hypothetical true score; a measurement property within the reliability domain (Polit & Yang, 2016).
Measurement properties	Instruments that incorporate psychometric or clinimetric characteristics.
Non-normed fit index (NNFI)	Also known as Tucker-Lewis index (TLI)-see below.
Psychometrics	The study of instruments that consist of items of equal weighting.
Reliability	The degree to which a measurement is free from measurement error; the extent to which scores for people who have not changed are the same for repeated measurements; statistically, the proportion of total variance in a set of scores that is attributable to true differences among those being measured (Polit & Yang, 2016).
Responsiveness	The ability of a measure to detect change over time in a construct that has changed, commensurate with the amount of change that has occurred (Polit & Yang, 2016).
Root mean square error of approximation (RMSEA)	An index used to evaluate how well a hypothesized model fits the data (e.g. in confirmatory factor analysis or item response theory modelling); an RMSEA of less than .06 is considered an indicator of adequate fit (Polit & Yang, 2016); a measure of construct validity.
Sensitivity	The ability of a screening or diagnostic instrument to correctly identify a ‘case’ (i.e. to correctly diagnose a condition) (Polit & Yang, 2016); a measure of criterion validity or responsiveness.
Smallest detectable change (SDC)	An index that estimates the threshold for a ‘real’ change in scores (i.e. a change that, with 95% confidence, is beyond measurement error); the SDC is a change score that falls outside the limits of agreement on a Bland-Altman plot (Polit & Yang, 2016); a measure of reliability.
Specificity	The ability of a screening or diagnostic instrument to correctly identify non-cases for a condition (Polit & Yang, 2016); a measure of criterion validity or responsiveness.
Standard error of measurement (SEM)	An index that quantifies the amount of ‘typical’ error on a measure and indicates the precision of individual scores (Polit & Yang, 2016); a measure of reliability.
Standardized root mean square residual (SRMR)	An index used to evaluate how well a hypothesized model fits the data (e.g. In a confirmatory factor analysis); an SRMR of less than 0.08 is considered an indicator of adequate fit (Polit & Yang, 2016); a measure of construct validity.
Structural validity	The extent to which an instrument captures the hypothesized dimensionality of the broad construct; an aspect of construct validity (Polit & Yang, 2016).
Test-retest reliability	The variation in measurements using an instrument on the same subject under the same conditions.
Tucker-Lewis index (TLI)	Also known as non-normed fit index (NNFI). A statistic used to evaluate the goodness of fit of a proposed model to the data (e.g. In confirmatory factor analysis) involving the comparison of the proposed model with a null model; a value greater than 0.95 is often considered as indicative of a good fit (Polit & Yang, 2016); a measure of construct validity.
Validity	In a measurement context, the degree to which an instrument is measuring the construct it purports to measure (Polit & Yang, 2016).