Finding and visualizing hierarchical cluster structure as alternative to cognitive diagnosis models
Invited talk by Rebecca Nugent, Department of Statistics, Carnegie Mellon University, USA.
Chair: David Kaplan, Wednesday 22nd July, 14.30 - 15.15, Uppercroft, School of Pythagoras
The goal of clustering is to identify distinct groups in a population and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as a sample from an unknown density p(x). To generate clusters, we estimate the properties of p(x) with either parametric (model-based) or nonparametric methods. In model-based clustering, we assume that groups in the population correspond to mixture components in the density estimate; in nonparametric clustering, they correspond to the density estimate's modes. In contrast, the algorithmic approach to clustering (linkage methods, spectral clustering) applies an algorithm, often based on a distance measure to data in m-dimensional space. Clusters are extracted heuristically. We propose to combine the strengths of the different clustering approaches to visualize the (possibly hierarchical) cluster structure as a faster alternative in educational data mining to commonly used cognitive diagnosis models (e.g. DINA).
After an overview of common clustering methods, we propose the utilization of a linkage algorithm with a minimum density similarity measure to visualize the hierarchical structure of the modes (or components) of the density estimate. The resulting dendrogram can then be used as a tool to subjectively prune or merge clusters. Time permitting, we will introduce "component trees", a tool designed to augment model-based clustering by visualizing the hierarchical structure of the components of the mixture model. Component trees provide evidence for whether or not the number of groups in the population has been overestimated by the number of components and, if so, which components should be merged to estimate the groups. In our motivating examples, we use data from an online mathematics tutor to estimate skill set profiles (i.e. cluster students by skill ability) over time.
Joint work with Nema Dean, University of Glasgow Department of Statistics