Test equating: Cross-validation and concurrent calibration (32)
Chair: Ralph Carlson, Tuesday 21st July, 15.25 - 16.45, Lowercroft, School of Pythagoras.
Alina von Davier and Tie Liang, University of Massachusetts, Amherst, USA. The cross validation method: an alternative to the kernel method in test equating. (061)
Yih-Shan Shih, Bor-Chen Kuo, Tian-Wei Sheu and Ching-Lin Shih, Graduate Institute of Educational Measurement and Statistics, National Taichung University, Taiwan. Scale linking procedure with concurrent calibration under the three-parameter logistic testlet model. (178)
Kei Miyazaki, Takahiro Hoshino and Kazuo Shigemasu, University of Tokyo, Japan. A new concurrent calibration method for non-equivalent group design under non-random assignment and real data analysis. (138)
Koichi Osawa, Center for Japanese-Language Testing, The Japan Foundation,Tokyo, Japan. Concurrent item calibration and logistic ability scaling of 3 large-scale Japanese language assessments of CSAT, JLPT and EJU in a population of Japanese language learners in Korea. (246)
ABSTRACTS
The cross validation method: an alternative to the kernel method in test equating. (061)
Alina von Davier and Tie Liang
The development of the kernel equating (KE) method enhanced the theory of observed score equating (Kolen, 2006). Inspired by the KE method by von Davier, et al. (2004), one alternative, cross validation (CV) method under the kernel equating framework is proposed. The two essential differences between the two methods are first, CV does not adjust the Gaussian kernel density function but KE does and second, the ways to estimate optimal bandwidth are different. CV obtains the optimal bandwidth by maximizing a poisson likelihood function established through cross validation between two subsamples from raw data. On the other hand, KE applies a penalty function to get optimal bandwidth. The resulting equating functions from CV and KE without presmoothing will be finally compared and evaluated for different sample sizes, score ranges and score distributions between two test forms. Only random group design will be focused in the paper for simplicity. Bias, standard error of equating, equating difference will be the three criteria to judge the performance of the two methods. Overall, based on the preliminary results, CV outperformed KE method under the studied conditions in that it provided smoother and more unbiased equating functions and exhibited smaller standard error of equating. To generalize the conclusion of the study, traditional equipercentile equating method will be added to compare with the two kernel equating methods. References: Kolen, M.J.(2006). Review of The kernel method of test equating. Psychometrika, 71(1), pp.211-214; von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer-Verlag.
Scale linking procedure with concurrent calibration under the three-parameter logistic testlet model. (178)
Yih-Shan Shih, Bor-Chen Kuo, Tian-Wei Sheu and Ching-Lin Shih
One of the main goals of large-scale assessment program, such as TOFEL and PIRLS, is to compare subjects’ competences that are obtained with different test forms across years. Since the testlet-based items were found to be used frequently in these programs, scale linking procedure that take the testlet effect into account should be developed and carried out. A scale linking procedure with concurrent calibration under three-parameter logistic testlet model is proposed in this study. External anchor test design is used for controlling the effects of anchor test. Through Monte Carlo simulation studies, the performance of this procedure is compared with standard IRT scale linking procedure which ignoring the testlet effects that might be contained in the data. The results show that the both the latent trait and item parameters are better recovery under the proposed procedure than the standard IRT scale linking procedure which carried out by BILOG-MG. The difference between the performances of these two procedures increases as the testlet effect of the data increases. Besides, the RMSE and bias of latent trait estimation increase as the testlet effect of anchor test increases.
A new concurrent calibration method for non-equivalent group design under non-random assignment and real data analysis. (138)
Kei Miyazaki, Takahiro Hoshino and Kazuo Shigemasu
We propose a new item parameter linking method for the common-item nonequivalent groups design in item response theory (IRT). The scores of the tests that examinees in each group did not answer can be regarded as missing data. Previous studies have assumed that examinees are randomly assigned to either of the test forms. However, situations in which examinees can select their own test forms, as well as in which the tests that examinees take differ according to their abilities, meaning that they are not randomly assigned to a test form, frequently exist. In such cases, the mere application of concurrent calibration or multiple group IRT modeling without modeling test form selection behavior can yield severely biased results as illustrated by simulation studies. To resolve this problem, we proposed the model in which the test form selection behavior is dependent on the scores of the tests and provided a Monte Carlo expectation-maximization algorithm for the model. Through the simulation study, we showed that the proposed method provides adequate estimates of testing parameters, while the traditional method provides biased estimates for both item parameters and ability parameters. The proposed method has been applied to real data and meaningful results have been obtained.
Concurrent item calibration and logistic ability scaling of 3 large-scale Japanese language assessments of CSAT, JLPT and EJU in a population of Japanese language learners in Korea. (246)
Koichi Osawa
To compare statistical properties of College Scholastic Ability Test (CSAT) Japanese section, Japanese Language Proficiency Test (JLPT) and Examination for Japanese University Admission for International Students (EJU), a monitor examination whose 757 test items were those previously administered in the successive main test sessions of those tests was carried out with more than 2,000 subjects of Japanese language learners in Korea. Item response data were collected through a variation of common item design which made it possible to calibrate all the test items simultaneously. Two parameter logistic model of classical item response theory was fitted to the data to estimate item parameters and abilities of the subjects. The result showed bimodal total test information curve which distinguished the test difficulties of CSAT, JLPT (4 grades) and EJU into roughly two levels. The mean item difficulties of CSAT, JLPT (lower 2 grades), JLPT (upper 2 grades) and EJU were -0.64, -0.08, 0.77, 0.98, respectively. It was found that (1) CSAT was the very entry-level test and should be easier than JLPT grade 3, (2) there might be non-ignorable discrepancy in test difficulty between JLPT lower and upper 2 levels, and (3) EJU might be somewhat more difficult than JLPT upper levels.