Classification and clustering (14)
Chair: Victor Wilson, Tuesday 21st July, 15.25 - 16.45, Castlereagh Room, Fisher Building.
Tomoki Tokuda,Iven Van Mechelen and Francis Tuerlinckx, Department of Psychology, University of Leuven, Belgium. Bayesian mixture modeling with variable selection. (133) ♥
Hye Won Suk and Heungsun Hwang, Department of Psychology, McGill University, Canada. Regularized fuzzy clusterwise ridge regression. (176) ♥
David Kaplan and Bryan Keller, Department of Educational Psychology, University of Wisconsin, USA. Cluster effects in the Latent Class Model. (247)
Ming Lei and Won-Chan Lee, College Board, New York, USA. A comparison of methods for estimating classification accuracy indices. (245)
ABSTRACTS
Bayesian mixture modeling with variable selection. (133)
Tomoki Tokuda,Iven Van Mechelen and Francis Tuerlinckx
A general problem in clustering high-dimensional data is that the presence of irrelevant variables can mask the `true` group structure; for an effective clustering of observations, some form of variable selection then is essential. As a solution to this problem, Tadesse, Sha and Vannucci (2005) proposed a fully Bayesian multivariate normal mixture method that includes a procedure for variable selection. This method, however, appears to suffer from two drawbacks: Firstly, it is not scale-invariant (i.e., transforming the unit of one or more variables may influence the results); secondly, its results are sensitive to the number of irrelevant variables. These drawbacks may considerably hamper the use of the method in practice. In this talk, we propose some modifications of the method to deal with these drawbacks. The main idea is to make the method hierarchical by introducing hyperpriors for some parameters and to apply it to a suitably preprocessed form of the data. In an intensive simulation study, our modified version will be shown to outperform both the original Tadesse method and the method that performed best in an extensive comparative simulation study of clustering methods with variable selection from Steinley & Brusco (2008).
Regularized fuzzy clusterwise ridge regression. (176)
Hye Won Suk and Heungsun Hwang
Fuzzy clusterwise regression has been a useful method for investigating cluster-level heterogeneity of individuals based on linear regression. This method integrates fuzzy clustering and ordinary least-squares regression, thereby enabling to estimate regression coefficients for each cluster and fuzzy cluster memberships of individuals simultaneously. In practice, however, fuzzy clusterwise regression may suffer from multicollinearity as it builds on ordinary least-squares regression. To deal with this problem in fuzzy clusterwise regression, a new method, called regularized fuzzy clusterwise ridge regression, is proposed that combines ridge regression with regularized fuzzy clustering in a unified framework. In the proposed method, ridge regression is adopted to estimate cluster-wise regression coefficients while handling potential multicollinearity within each cluster. In addition, regularized fuzzy clustering based on maximizing entropy is utilized to systematically determine an optimal degree of fuzziness in memberships. The usefulness of the proposed method is illustrated by an application concerning the relationship between the characteristics of used cars.
Cluster effects in the Latent Class Model. (247)
David Kaplan and Bryan Keller
A simulation study was conducted to determine the effect of ignoring multilevel data structure in a three-class latent class analysis on commonly used statistical fit indices. Design conditions included variation of sample size, intraclass correlations, and latent class membership size. For each experimental condition, two analyses were carried out: one for the population model that was used to generate the data and another for the misspecified model was identical except for ignoring the multilevel nature of the data. The misspecified model was created by removing the continuous group-level factor from the population model that was used to generate the data. This resulted in the elimination of the two parameters which control intraclass correlations. The Latent GOLD software program (Vermunt and Magidson 2005) was used to generate data for each condition and estimate parameters for the corresponding models. The outcomes of interest in this study were the differences between population model and misspecified model on each of the following: the Akaike Information Criteria (AIC), the Akaike Information Criteria 3 (AIC3), the Bayesian Information Criteria (BIC), and the entropy R-squared . Results indicate a clear effect of ignoring the multilevel structure of the data. Guidelines for practice are provided.
A comparison of methods for estimating classification accuracy indices. (245)
Ming Lei and Won-Chan Lee
Classification accuracy measures the extent of agreement between the classifications using observed cut scores and the true classifications using known true cut scores. Different combinations of ways the scores are obtained and the estimations of score distributions form different estimation methods of classification accuracy. The purpose of the study is to compare eight estimation methods using real data and simulations with 3PL model. The real data come from examinees’ responses from a large-scale test with three cut scores. The simulation study will use the same instrument and item parameters. Only one cut score, 13 different cut points, and three true score distributions will be considered (13 x 3 conditions). The population classification accuracy indices can be obtained using the c.d.f. probabilities from the population distribution and the proportions of examinees from item responses of 100,000 random samples. The sample item response set with sample size 1000 will be repeated 100 times. They will be used to get classification accuracies using the eight methods. The summary will be computed for each simulation condition and estimation method, which will be compared with the population classification accuracy indices.