A descriptive, cross-sectional study was conducted. Different data analysis strategies were followed to address each research questions
Bangdiwala’s weighted statistic for ordinal data (BWN) and Bangdiwala’s agreement chart [23] was calculated for each indicator to study content validity to determine how well the indicators reflect age-related typical support needs. The BWN allows calculating the agreement level among judges (by judges, we refer to the teachers who categorized each indicator using the rating scale) for each indicator to study the judges’ agreement strength. In other words, the study focused not on the agreement between judges, but on the agreement size among judges regarding the indicators to categorize (e.g., a perfect agreement between judges can be found for a category different from agreeing, which would indicate weak evidence of content validity for a given indicator). This statistic expresses agreement strength on a scale from 0 to 1, with 0 indicating the absence of agreement and 1, the strongest agreement possible. Agreement strength can be poor (0.000 to 2.00), weak (0.201 to 0.400), moderate (0.401 to 0.600), good (0.601 to 0.800) and very good (0.801 to 1) [23].
One advantage of the BWN is its graphical approach, allowing researchers to represent the distribution of agreement to complement BWN. Bangdiwala’s agreement chart provides a representation of the agreement among judges based on a contingency table. The chart is built as a square, n x n, where n is the total sample size. The black squares, each one measuring nii x nii, show the observed agreement. The black squares are within larger rectangles; each one sized ni + x n + i. These rectangles show the maximum possible agreement, given the marginal totals. Partial agreement is determined by including a weighted contribution from the cells outside the diagonal and is represented in the chart with shaded rectangles, whose sizes are proportional to the sum of the frequencies of the cells [23]. Analyses involving content validity were addressed using the software R v.3.4.2 [24].
The many-facet Rasch measurement (MFRM) model was used to assess the appropriateness of the rating scale used by teachers to show their agreement with the indicators’ descriptions. The MFRM model is commonly used for performances evaluated with subjective ratings (e.g., speaking assessments), permitting researchers to obtain estimates on a common logit scale of the parameters of the components of the facets involved in construct evaluation [25]. In the construct assessments based on judges’ evaluations, such as those used in this study, the importance of judges’ severity or leniency in determining these evaluation scores, as well as the difficulty of the tasks evaluated, has been highlighted, with the judges and tasks being treated as facets of the construct assessment [26].
The indicators of the list of indicators and the teachers were considered facets of construct evaluation along a logit scale representing the “age-related typical support needs” construct (for the rating scale-related analyses and results, the terms “judge” and “item” will substitute “teachers” and “indicators”, given that this jargon is more common regarding the MFRM model). The analysis of the rating scale focused on judges’ assessment of how the rating scale, developed for assessing each item’s accuracy in describing the age-related typical support needs, was useful for the Spanish context. The aim was to explain whether the 5-category rating scale worked properly using a strong logistic model for assessing the quality of tests (the list of indicators is a survey that collects subjective ratings). Nevertheless, prior to positing any explanations on the rating scale’s functioning, it was necessary to ascertain the facets’ adjustment to the MFRM model (depending on the estimates of their parameters on the common scale). To consider the facets as adjusted to the MFRM model, four estimates need to be calculated: SD, separation, strata, and reliability. Items’ adjustment is indicated by high SD, separation > 1, strata > 2 and reliability > 0.80, whereas judges’ adjustment to the models requires low SD and separation, strata < 2 and low reliability [25]. Thus, evidence of the facets’ misfit would add noise, and no interpretation of the rating scale should be undertaken [27]. Hence, to assess evidence of the rating scale’s functioning, it was first necessary to analyze the facets’ adjustment to the model, then assess whether the rating scale was working. To analyze the rating scale’s adjustment, the Rasch–Andrich thresholds (τ) were calculated. In the case of a polytomous rating scale (as in the teachers’ survey used in this study), τ are understood as local dichotomies between adjacent Likert-scale steps [26]. The rating scale’s fit to the MFRM model is possible only if the τ values exhibit a rising progression or monotonic order [25].
The MFRM model is iterative. Thus, if the data (facets and/or rating scale) evidence a poor model adjustment, researchers can test where the problem may be (e.g., if the problem involves judges’ facets, extreme cases can be removed) and conduct additional estimations to test whether data adjustment to the model is possible [25]. Information on the facet and rating scale adjustment to the MFRM model, the facets’ distributions along the common logic scale and the probability curves of the rating scale categories were analyzed and are reported in the results section (see Section 3.2). Facets software v.3.71.3 [27] was used to answer this research question.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.