Validation and psychometric properties

 

Validation is an ongoing process of testing an instrument in different ways and in different contexts to determine whether or not the instrument measures what it purports to measure.

Unfortunately, the term validation is commonly misunderstood; many people think it describes whether an instrument is or is not valid. Instruments may be valid in some circumstances but not in others. There are three kinds of evidence suggesting validity.

Content validity: Is defined as how well an instruments items may be considered to be a representative sample of the universe which the researcher is trying to measure. In HRQoL measurement this might be the extent to which the items in an instrument cover the full domain of HRQoL; that is, whether or not the instrument includes items which enquire about each of the dimensions of health that are included in the underlying concept of HRQoL. Where content validity is determined by looking at the instrument this is described as face validity; although popular, apparent face validity does not confer content validity.

Construct validity: Construct validity indicates whether or not instrument items truly represent the underlying construct that is of interest, i.e. scales scores can be used to infer certain concepts. This implies the researcher has to define an underlying construct of whatever it is that is being measured. If no adequate construct is defined, the content of the instrument defines the construct that is being measured. If the construct is correctly represented by the instrument it is possible to draw a succession of inferences from the instrument concerning instrument scores in different contexts. Construct validity therefore involves an ongoing process of testing the instrument in different contexts.

Criterion validity: This describes the relationship between an instruments scores and either other independent measures (the criteria) or other specific measures (predictors), where the criteria or predictors are the gold standard for the measurement (e.g. in the case of breast cancer, the gold standard is the histopathological confirmation of cancer). In the absence of a gold standard, confidence in criterion validity increases if the instrument has high correlation with each of the accepted extant instruments.

 

Both the descriptive system and the utility values associated with health states require validation. More specifically the descriptive system should be shown to have content, construct and criterion validity. Utility scores should correctly reflect the strength of people's preference for health states. As well, the model employed to combine the utility scores from different dimensions should also be valid.

 

Yes. Using the time trade-off (TTO) instrument a person might think the health state being evaluated represents very poor health and readily trade off years of life. Rather than live for their full 10 remaining years in the poor health state, they would readily live for 1 year in full health and trade off 9 years.

It is appropriate at such a low number of years to ask whether the person would prefer death rather than live any time in the health state. When the answer is affirmative, a Worse-than-Death TTO question is asked: If there still remained 10 years of life, what is the maximum time the person could live in the poor health state if they knew there would be a cure which would restore them to full health for the remainder of the 10 years.

Placing a numerical value on these states is difficult. If a person refused even one day in the health state followed by 10 years of full health the implied numerical value of the health state is almost minus infinity. This problem is discussed at length in Richardson and Hawthorne (2001) and various options are discussed and their numerical implications demonstrated. The final algorithm used for the calculation of utilities transforms negative scores in such a way that the lower boundary is U = -0.25; that is there is a disutility of 1.25.

Working Paper 113 - Richardson, J. & Hawthorne G. (2001). Negative utility scores and evaluating AQoL all worst health state. Melbourne, Monash University.

Interview methodology is presented in detail in Iezzi and Richardson (2009). Measuring Quality of Life at the Centre for Health Economics. Melbourne, Monash University.

 

This involves two separate issues.

  • the validity of the TTO utility measurement technique; and
  • the validity of AQoL scores after dimension utilities have been combined using a multiplicative model.

Validating scaling techniques is problematical as it is not possible to observe actual trade-offs between the quality and length of life which correspond with the trade-offs measured by the various scaling instruments. Consequently, validity has been determined primarily by face validity. Some have argued that the standard gamble should be regarded as the gold standard for utility measurement as its use assumes the axioms of von Neumann & Morgenstern; this appears to make the standard gamble results consistent with mainstream economic theory. However, as the axioms have been shown to be empirically incorrect we have not adopted this procedure (Pope CHE Working Paper). Rather, for the reasons outlined by Richardson (1994) and Dolan et al (1996), we have accepted the time trade off as having the greatest prima facie validity.

 

No. The AQoL is the only instrument whose descriptive system was constructed using correct psychometric principles for instrument construction. Most instruments in the literature claim validation on the basis of a correlation between results and results from another instrument which has been validated (often in the same way!). This type of result is necessary but far from sufficient for confidence in an instrument. A valid instrument will produce valid utility scores. The criterion for achieving this is, in fact, exceedingly stringent. The percentage increase in the numerical score of the utility index must indicate an increase in the quality of life which is valued equally to an identical percentage increase in the length of life. No instrument has been shown to have this property.

For a discussion of this so called strong interval property see Richardson Working Paper 5 (1990) Cost utility analysis: what should be measured. Importantly, an instrument may be valid in one context (disease area, intervention) but not in another.

 

This table provides intraclass correlation coefficients for the instrument and its dimensions for reapplication of the AQoL8D instrument after two weeks and after one month.

 

Table 3 Test-Retest reliability: intra class correlation coefficients (ICC)

AQoL8D_test-retest_table

 

Test-retest reliability coefficients are sourced from J Richardson, A Iezzi (2011). Psychometric validity and the AQoL 8D Multi attribute utility instrument. Research Paper 71, Table 3, p13.

 

The AQoL- 4D measures 1.07 billion health states which is a very small subset of the number of health states defined by AQoL- 8D (many of which are a little improbable, eg being blind, deaf, bedridden, full of energy and in control of your life). These health states cannot all be measured individually and, like all other MAU instruments (except for the Rosser-Kind index) AQoL models utility scores from a limited number of observations. To date, most instruments have adopted an additive model in which the disutility associated with each response from each item is independently measured, and the overall disutility estimated or modelled as a weighted average of these disutilities, where the weights are also obtained empirically during the scaling survey. This additive model is probably invalid and the multiplicative model employed by the HUI and AQoL instruments is superior (Richardson & Hawthorne 1998). However there is no certainty that even this more flexible model does not introduce significant estimation bias.