We report here a Classical Test Theory (CTT) analysis of the data. While there are advantages to alternative methods – especially Rasch analysis [26] – the comparative simplicity and familiarity of CTT methods were considered desirable given the objective of maximising accessibility for the largest possible audience [27]. While we analysed the data using both a CTT and Rasch framework, only the CTT values are reported here.
For each item, we calculated the overall mean (or facility) score (between zero, indicating no candidate answered the item correctly, and one, indicating all candidates answered correctly), the Standard Deviation (SD) and the discrimination index (a measure of whether the item could discriminate between candidates who performed well or poorly on the assessment as a whole [28]). Facility and discrimination values did not differ significantly between the two study years, indicating the common content operated similarly in each year, and so we repeated the same analysis on each cohort. We calculated mean item facility (M = 0.74, SD = 0.18) and mean item discrimination on items (M = 0.20, SD = 0.10). We then calculated mean item performance (and associated SDs) for each school, per year. We then identified the proportion of items where the school was one or two SDs above the mean score, and one or two SDs below the mean score as a measure of the school’s overall performance against all medical schools.
To further explore this, we compared the total number of items where the school scored two SDs below the mean. For the analysis, we compared the bottom and top tertiles and ran the analysis for each cohort. This gave a percentage measurement from zero (the school had no items 2 SDs below the mean) to 100 % (the school’s cohort scored 2 SDs below the mean for every item). We calculated tertiles by the school’s mean mark across all the items they used, and so compared the bottom tertile (the ten lowest performing medical schools on this assessment) against the top tertile (the ten highest performing medical schools).
The main goal of this was not to provide a precise comparison – because schools did not sit exactly the same items this was not possible – but to explore whether differences between schools could be explained by some schools exhibiting much higher rates of incorrect responses across a range of domains. Additionally, this relatively straightforward analysis can be reproduced by medical schools for internal evaluation and to address student queries, without requiring advanced statistical knowledge or significant researcher time. We chose to use 2 SDs as a cutoff as this generally indicated a notably lower score compared to the average school. The observed variance may then reflect differences in teaching approaches and curricula between medical schools, or genuine differences in student competence.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.