RMT analyses use a mathematical model, the Rasch model, to evaluate the extent to which items from an instrument can be summed to build a proper measurement of the underlying abstract concept.13–15 RMT analyses explore the following properties:
Targeting: With the Rasch model, the parameters for items and participants are estimated in the same continuum, allowing a direct comparison of the distributions of items and participants over this common continuum. Targeting addresses the matching of participants and items ensuring a sufficiently precise estimation of participant and item parameters. This is assessed by comparing the spread of person and item location estimates over the common continuum.
Fit: Items must work together to define a clinically and statistically meaningful score. Otherwise, it is inappropriate to sum item responses to reach a total score and consider the total score an accurate measure of each target concept. When items do not work together in this way (ie, there is item misfit), the validity of an item set is questionable. Item fit is assessed based on ordering of item response options (ie, ordering of item thresholds)16 and comparison of observed and expected responses using statistical indices and graphical examination of item characteristic curve (ICC).17 Statistical indices include standardized fit residuals, which are recommended to lie in the range −2.50 to +2.50,15 and chi-square tests.
Reliability: The principle of reliability is that applying the patient-reported outcome measure on different occasions or by different observers produces consistent results.18 It is assessed using the Person Separation Index (PSI),19 a reliability coefficient estimate. Reliability coefficients are commonly interpreted as follows: <0.70: unsatisfactory; 0.70–0.79: modest; 0.80–0.89: adequate; 0.90–1.00: good.20
Differential item functioning (DIF): A key criterion to achieve a strong measurement is invariance implying that items mean the same within all patients, regardless of their characteristics (demographics, clinical, etc). In these analyses, we used DIF to examine cross-cultural invariance: the expected response to an item was compared for patients who have the same level of the measured concept but belong to the different global regions that are investigated in global clinical trials.
RMT analyses were carried out using RUMM 2030 software (RUMM Laboratory, Perth, Australia) on the pooled baseline SGRQ data from the five trials (regardless of which treatment the individuals were assigned to). Separate analyses were performed for the SGRQ “Symptoms” and “Activity” domains: the item sets included in the Rasch model were iteratively modified based on the results, conceptual examination of the item content and previously published findings. The modification of the item sets included refinement of the item selection (eg, excluding items that would not fit the model) or recoding of the response options (eg, grouping response options). Given the large sample size and the sensitivity of the test used to sample size, non clinically meaningful difference could be statistically significant. Hence, statistical significance was consistently considered with caution. DIF analysis was performed for geographical regions on the “Activity” domain, in order to explore cross-cultural validity of the SGRQ. In order to facilitate interpretation of the findings and to get more homogenous groups, the 42 countries included in the trials were grouped in 13 geographical regions according to the United Nations statistics division categorization:21 Canada, Central and south America, China, Eastern Europe, India, Japan, North Africa and west Asia, Northern Europe, Oceania and south Africa, South and east Asia, Southern Europe, USA, Western Europe. Both uniform and non-uniform DIF were tested: DIF is said to be uniform if the difference in the expected response to an item between two groups is the same across the full range of the targeted concept being measured. A DIF is non-uniform if the difference between groups depends on the targeted concept being measured (eg, patients with low activity limitations from a given global region tend to endorse an item less than patients from other global regions while patients with high activity limitations of that same global region tend to endorse it more).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.