Test equating: Methodology (31)
Chair: Michelle Langer, Thursday 23rd July, 14.55 - 16.15, Uppercroft, School of Pythagoras.
Thomas Benton, National Foundation for Educational Research, Slough, UK. A practical criterion for choosing the most appropriate equating method in the single group design. (6)
Kyung (Chris) Han, Graduate Management Admission Council, Craig S. Wells and Ronald K. Hambleton, University of Massachusetts Amherst, USA. Impact of item parameter drift on pseudo-guessing parameter estimates and test equating. (18)
Brad Ching-Chao Wu, Pearson VUE, USA. Evaluating equating error for random group design with unequal sample sizes. (16)
Peter van Rijn and A. A. Béguin, Arnhem, The Netherlands. Test equating using prior information on populations. (95)
ABSTRACTS
A practical criterion for choosing the most appropriate equating method in the single group design. (6)
Thomas Benton
A vast array of potential methods has been developed over the years to enable scores on any two tests to be equated. However, there is little practical guidance on how to decide which method is best in any situation. In particular there is no generally agreed upon statistical criteria for calculating the overall accuracy of different methods. The aim of this paper is to show that it is possible to develop criteria that enable us to identify which of the plethora of potential methods is most appropriate for equating any pair of tests. Previous research in this area has tended to focus on situations where the true equating is known and so has not provided a useful criterion for practical situations. Two possible criteria are developed and demonstrated that take account of both the systematic and random error of each possible method. Simulation studies are performed demonstrating that using these criteria allows the identification of the most appropriate equating function in the majority of cases.
Impact of item parameter drift on pseudo-guessing parameter estimates and test equating. (18)
Kyung (Chris) Han, Craig S. Wells and Ronald K. Hambleton
In item response theory test equating with the mean-sigma method, any changes in the c-parameter estimates of anchor test items estimated separately in the two groups of examinees are not taken into account directly. Even with the TCC equating method, there are still two unsolved problems: (a) a change in the lower asymptotes of TCCs due to the c-parameter estimates may be difficult to capture since the scaling coefficients are based on a limited range of scores on the theta scale, and (b) the computed scaling coefficients have no impact on the c-parameter estimates of the test items. The main research questions in this study concerned how serious the consequences would be if c-parameter estimates were not transformed in the equating procedure when item parameter drift (IPD) was present. The results from a series of Monte Carlo simulation studies under 96 different studied conditions (4 IPD amounts x 3 sample sizes x 4 estimation strategies x 2 scaling methods) showed that the newly proposed calibration strategies, where the c-parameters were placed on the common scale across the test forms, resulted in more robust performance against IPD. The practical effectiveness and theoretical importance of the newly proposed approaches to equating are discussed in the paper.
Evaluating equating error for random group design with unequal sample sizes. (16)
Brad Ching-Chao Wu
The effect of sample size on quality of equating has been researched extensively in the psychometric literature with a common conclusion that small sample is subject to larger equating error. Most studies evaluated equating errors of real or hypothetical tests with various sample sizes. Sample sizes across the forms equated were, however, controlled to be roughly the same. This study examines a scenario where such condition is not sustained. In practice, the two forms in the equating process may have quite varied sample sizes. Several unequal sample scenarios (from a ration of 10:9 to 10:1) are simulated and standard errors of equating are evaluated using the bootstrap method. A real case based on random group design with a sample ratio of approximately 40:1 is also investigated. Standard errors of raw scores increase as the sample size ratio becomes wider. The increase of standard error becomes greater as the sample ratio widens. This study shows equating error is not only affected by the sample size but also sample size ratio of the two forms. The pattern of the equating error across various sample ratio scenarios can be used as a reference when designing an equating plan where equal samples from the two test forms are not attained.
Test equating using prior information on populations. (95)
Peter van Rijn and A. A. Béguin
Test equating in educational measurement consists of statistical procedures with which test scores of a group of students on one test form can be related to the scores of another group of students on another by correcting
for differences in difficulty of the test forms and ability of the populations (see Kolen & Brennan, 2004).
In this paper, a method for equating is devised for the case of recurrent high stakes tests, such as central examinations and admission tests. Such tests are constructed periodically and intend to measure the same ability. They can be equated over time as to prolong the same standard by using a connected design in which a link is created e.g. through pretesting. However, due to security reasons, it can sometimes be difficult to obtain a
well-connected design. A solution then is to assume that adjacent populations do not differ that much in ability. However, by posing a fixed quantification upon this assumption, it can become difficult to expose trends in
the ability of populations over time. With Bayesian estimation methods, prior information can be used for the estimation of the ability distribution of a certain population of examinees. Existing trends now remain visible,
because newly gathered information can outweigh the stochastic assumption. The equating method is discussed and illustrated by an application to central examinations in secondary education in The Netherlands. References:
Kolen, M.J. & Brennan, R.L. (2004). Test equating, scaling, and linking (2nd Ed.). New York: Springer.