Models and designs for tests with explanatory rules for their item difficulties.

Symposium organised by Wim van der Linden, CTB/McGraw-Hill, Monteray, USA.

Chair: Wim van der Linden, Thursday 23rd July, 9.40 - 11.00, Palmeston Lecture Theatre, Fisher Building.

Caius and Gonville College, Cambridge

Johannes Hartig, Department of Educational Research Methodology, University of Erfurt, Germany, Claudia Harsch,  Centre for Applied Linguistics, University of Warwick, UK, Jana Höhler, Center for Educational Quality and Evaluation, German Institute for International Educational Research. Explanatory models for item difficulties in reading and listening comprehension.

Cees Glas, Department of Research Methodology, Measurement, and Data Analysis, University of Twente, The Netherlands. MML estimation and Lagrange multiplier tests for item-cloning models.

Hanneke Geerlings, Department of Research Methodology, Measurement and Data Analysis , University of Twente, The Netherlands, Wim van der Linden. Optimal design of tests with rule-based item generation.

Andreas Frey and Nicki-Nils Seitz,  Leibniz Institute for Science Education (IPN), Kiel, Germany. Classification of individuals using multidimensional adaptive testing with feedforward.

ABSTRACTS

Introduction: Models and designs for tests with explanatory rules for their item difficulties
Wim J van der Linden
A recent development in educational and psychological testing is to specify explanatory rules for the item difficulties and base the design, administration, and scoring of the tests directly on empirical estimates of the effects of these rules rather than calibration of the individual items. The first paper presents results from a study in which the effects of the explanatory rules were modeled as fixed effects in a two-dimensional linear logistic model. The model was successfullly applied to tests of reading and listening comprehension in English as a second language. The second paper presents a hierarchical model for item cloning with families of items based on the same explanatory rules but with additional variation in their surface characteristics. It is shown how the model can be estimated and validated using marginal maximum likelihood estimation and Lagrange multiplier tests. The last two papers address the design of rule-based tests. The first of these two papers introduces a linear structure in the hierarchical model for item cloning to explain the differences between item families and shows how optimal design principles can be used to automatically generate a fixed test from a pool of item families. The last paper is for the case of a multidimensional ability structure and is particularly interested in how an adaptive test design can be used to optimize classification decisions about test takers.

Explanatory Models for Item Difficulties in Reading and Listening Comprehension
Johannes Hartig, Claudia Harsch and Jane Höler
Effects of task characteristics on item difficulties can be examined using explanatory item response models, e.g., the linear logistic test model (LLTM). Empirical effects of task characteristics can validate assumptions about the tested construct and serve to describe test scores with reference to task characteristics instead of individual tasks. This paper examines data from an assessment of reading and listening comprehension in English as a foreign language (N=10,054). Task characteristics were defined with reference to characteristics of the reading / listening texts as well of the questions asked for these texts. The goals of the project were to determine the strongest predictors for the item difficulties for the reading and listening test and compare the effects of the characteristics of their items. The data were analyzed with two-dimensional LLTMs incorporating the characteristics of the texts and questions. The characteristics proved to be strong predictors of the item difficulties for the reading as well as listening items. While the strongest predictor for the difficulties of the reading items was the linguistic complexity of their text, the difficulty of the listening items could be explained best by the cognitive demands of the questions they asked. Limitations of the modeling approach are discussed. Keywords: Explanatory IRT models; construct validity

MML Estimation and Lagrange Multiplier Tests for Item-Cloning Models
Cees A.W. Glas
In some areas of measurement, item parameters should not be modelled as fixed but as random. Examples are: item sampling, computerized item generation, surveys with substantial variability of item parameters over subgroups of respondents, measurement with substantial estimation error in the item parameter estimates, and grouping of items under a common stimulus or in a common context. Glas and van der Linden (2001) and Sinharay, Johnson and Williamson (2003) presented a hierarchical version of the three-parameter normal-olive model to represent item parameter variability in such cases. These authors also presented Bayesian estimation procedures with Markov chain Monte Carlo computation for the hierarchical model. In the present paper, we consider a marginal maximum likelihood (MML) estimation procedure and show how the MML framework enables us to rigorously test the assumptions underlying the model using Lagrange multiplier test statistics. Tests of the assumptions of subpopulation invariance of the item parameters (i.e., no differential item functioning), the shape of the response functions, and three different types of conditional independence were derived. Simulations studies were used to show the feasibility of the estimation procedure and estimate the power and Type I error rate of the model fit tests. In addition, the procedures were applied to an empirical data set. Keywords: Item cloning; MML; Lagrange multiplier test

Optimal Design of Tests with Rule-Based Item Generation
Hanneke Geerlings and Wim J. van der Linden
The possibilities of optimal test design with automatic item generation were examined. Two different methods of item generation were addressed. The first method assumed rule-based generation of the items. Statistically, the rules were assumed to have a fixed effect on the difficulty of the items. A well-known response model for such items is the linear logistic test model. The second method was item cloning, which leads to families of items that differ only in surface features. Statistically, the items within these families are assumed to vary only randomly in difficulty. A hierarchical response model was developed that accounts for the fact that items are grouped in families created through the joint application of the two types of item generation. The main goal of the presentation is to demonstrate the use of the model in optimal test assembly. Particularly, the effect of random instead of fixed item parameters on the optimization model and its solution were investigated. Keywords: Optimal design; optimal test assembly; item response theory; automated item generation.


Classification of Individuals using Multidimensional Adaptive Testing with Feedforward
Andreas Frey, Nicki-Nils Seitz
Multidimensional adaptive testing (MAT) can be used to classify individuals into two or more categories (like pass/fail) on multiple dimensions. One possible stopping criterion is to present items until the probability of an incorrect classification falls below a predefined level (e.g. 5%). However, for ability estimates near a cut-off-point, the maximum number of allowed items may be presented without reaching the wanted classification certainty. In order to avoid such unnecessary long tests, a feedforward strategy can be applied, which checks whether the target level of classification certainty would be reached if the following answers were all correct or all incorrect and stops the tests prematurely if this were not be the case. In a simulation study, measurement efficiency of MAT with and without feedforward was compared both to one-dimensional adaptive testing (CAT) and sequential testing with a fixed item set (FIT). The efficiency was calculated as the ratio of the percentage of correct classifications and the number of items presented. The lowest efficiency was obtained for FIT (2.29). For CAT and especially for MAT, the efficiency was substantially higher. Use of MAT with the feedforward strategy led to an additional gain in efficiency. The practical consequences of incorporating MAT with a feedforward strategy relative to those for CAT and FIT will be discussed. Keywords: Multidimensional adaptive testing; item response theory; computerized adaptive testing; adaptive classification testing. (50)