Goodness-of-fit: Item Response Theory and discrete data (42)
Chair: Albertyo Maydeu-Olivares, Friday 24th July, 9.30 - 10.50, Uppercroft, School of Pythagoras.
Muhammad Khalid and Cees A. W. Glas, Department of Research Methodology, Measurement, and Data Analysis, Faculty of Behavioural Science, University of Twente, The Netherlands. Assessing model fit: A comparative study of Frequentist and Bayesian frameworks. (031)
Katrin Kraus, Department of Statistics, Uppsala University, Sweden. A new goodness-of-fit test for categorical data analysis. (216)
Darrell Bock, Center for Health Statistics, University of Illinois at Chicago, USA, and Shelby Haberman, ETS, Princton NJ, USA. Confidence bands for examining goodness-of-fit of estimated item response functions. (106)
Alberto Maydeu-Olivares, Carlos García-Forero, Faculty of Psychology, University of Barcelona, Spain, Harry Joe, Department of Statistics, University of British Columbia, Canada. Testing for approximate fit in IRT modelling. (152)
ABSTRACTS
Assessing model fit: A comparative study of Frequentist and Bayesian frameworks. (031)
Muhammad Khalid and Cees A. W. Glas
Item response theory (IRT) models are used to describe response behavior on psychological tests, educational assessments, and various other measurement situations in the social sciences. However, item response theory model are based on strong mathematical and statistical assumptions, and only when these assumptions are met can the promises and potential of item response theory be realized. The most important assumptions underlying these models are subpopulation invariance, the form of item response function, and local stochastic independence. The fit of item response theory models can be evaluated by the computation of residuals and the associated test statistics. Most research about IRT model fit procedures has been developed in a frequentist frame work. For instance, LM tests for model fit based on residuals have been developed in a CML and MML framework. In the framework of LM model tests, the alternative hypothesis clarifies which assumptions are exactly targeted by the residuals. The alternative to Frequentist paradigm is a Bayesian paradigm. The posterior predictive model-checking method (PPMC) is a much used Bayesian model-checking tool because it has an intuitive appeal, is simple to apply, has a strong theoretical basis, and can provide graphical or numerical information about model misfit. This paper examines the performance of both Frequentist and Bayesian frameworks for a number of discrepancy measures for assessing above stated assumptions of the common IRT models and makes specific recommendations about what measures are most useful in assessing model fit and which procedure is more useful.
A new goodness-of-fit test for categorical data analysis. (216)
Katrin Kraus
We propose a new goodness‐of‐fit test for multinomially distributed random variables. The goodness‐of‐fit test is suitable in situations with large numbers of response patterns, which is common in ordinal and categorical data analysis. The test statistic is asymptotically chi‐square distributed. In contrary to Pearson’s goodness‐of‐fit statistic, the proposed test statistic is not adversely affected by individual observed response patterns that have very low model implied frequencies. Hence, the test statistic is not inflated for small sample sizes, and empirical Type‐I error rates are not too high. Simulation studies indicate that situations with high dimensionality and small sample sizes are a relative strength of the proposed goodness‐of‐fit test. The use of the fit statistic is illustrated with an example from factor analysis of ordinal data.
Confidence bands for examining goodness-of-fit of estimated item response functions. (106)
Darrell Bock and Shelby Haberman
Test of fit presently in use for estimated item response functions are based on chi-square or likelihood ratio statistics that depend on assignment of estimated IRT scale scores of the respondents to successive intervals over the sample range. These procedures are not entirely satisfactory, however, because they are influenced by arbitrary choice of the intervals and by fallibility of the scores. A method of examining fit that avoids these difficulties, and also reveals the location of any statistically significant deviation from fit, can be derived from principles of EM maximum marginal likelihood estimation of item parameters. Essentially, it constructs an empirical response function from the complete data statistics for number of correct responses and number of respondents at given points on the latent continuum; confidence intervals at these points are placed on the residual differences between the empirical function and the expected function, provided the latter has been previously estimated in a sufficiently large sample to be considered fully specified. Standard errors of the residuals are easily computed and yield confidence intervals at the points. The procedure is computationally efficient and leads to readily interpretable graphical displays. Results for binary scored items are especially tractable, but extension to polytomously scored items and corresponding response functions is possible.
Testing for approximate fit in IRT modelling. (152)
Alberto Maydeu-Olivares, Carlos García-Forero and Harry Joe
Given the large number of degrees of freedom involved in IRT applications, it is very unlikely that any model for a realistic application is not rejected by a test of exact fit. Following the footsteps of Browne and Cudeck (1993), we propose a Root Mean Squared Error of Approximation (RMSEA) for multivariate multinomial data. Although the approach presented here is completely general, we focus on its application to IRT models. Asymptotic methods can be used to obtain confidence intervals for an RMSEA, and hence tests of close fit. We show that the asymptotic approximation works well in practice.