Exploratory data analysis (46)
Chair: Henk Kiers, Wednesday 22nd July, 15.25 - 16.45, Boys Smith Room, Fisher Building.
Michel van de Velden, Econometric Institute, Erasmus University, Rotterdam, The Netherlands, and Yoshio Takane, Department of Psychology, McGill Universtity, Montreal, Canada. Generalized canonical correlation analysis with missing values. (132)
P.J.F. Groenen, Econometrics Institute, Erasmus University, Rotterdam, The Netherlands, J.C. Gower, The Open University, Milton Keynes, UK, and M. van de Velden, Econometric Institute, Erasmus University Rotterdam, The Netherlands. Area biplots. (191)
Carolin Strobl, Department of Statistics, Ludwig-Maximilians-Universität, Munich, Germany. Unbiased measures of variable importance for large data sets.(7) ♥
Ali Uenlue and Waqas Ahmed Malik, Institute of Mathematics, University of Augsburg, Germany. Interactive graphical exploration of psychometric multivariate data using glyph representations. (147)
ABSTRACTS
Generalized canonical correlation analysis with missing values. (132)
Michel van de Velden and Yoshio Takane
We propose two new algorithms that allow generalized canonical correlation analysis of data matrices with missing values. The first approach, which does not require iterations, is a generalization of the Test Equating method available for principal component analysis. In the second approach, missing values are imputed in such a way that the generalized canonical correlation analysis objective function does not increase in subsequent steps. Convergence is achieved when the value of the objective function remains constant. By means of a simulation study, we assess the performance of the new methods. We compare the results with those of two available methods, the missing passive method of OVERALS and the GENCOM algorithm developed by Green and Carroll.
Area biplots. (191)
P. J. F. Groenen, J.C. Gower and M. van de Velden
Classical multivariate analysis techniques such as principal components analysis and correspondence analysis use inner products to estimate data values. Results of these techniques may be visualized by presenting row and column points jointly in a biplot, where the projection of a row point onto a column point vector followed by a multiplication by the length of the column point vector gives the inner-product that approximates the corresponding data element. In this paper, we propose a new visualization: after a 90οrotation of the row points, the areas spanned by a triangle of a rotated row point, a column point and the origin approximates the data values. In contrast to the projection biplot, the areas spanned by different row and column points can be compared directly. This property makes the area biplot unique. Therefore, the area biplot is particularly useful for the analysis of a data matrix where all elements can be compared. The area biplot makes it easy to see that, similarly to inner products, higher dimensional area solutions can be represented by summing areas over subsequent pairs of dimensions. Here, the area biplot is developed for principal components analysis, correspondence analysis, and for interaction biplots but it has general applicability.
Unbiased measures of variable importance for large data sets.(7)
Carolin Strobl
Random forests (RF, Breiman, 2001) are a nonparametric regression approach based on ensembles of classification and regression trees (CART, Breiman et al., 1984). The main reasons for the popularity of RF in genetics and related elds are (i) that they can be applied to data sets with large numbers of predictor variables including potentially complex interactions and (ii) that they provide measures of variable importance capturing those interactions (see, e.g., Lunetta et al., 2004). While the original RF variable importance measures have been shown to be biased, recent improvements can be used to reliably compare, e.g., variables of different scales of measurement (Hothorn et al., 2006; Strobl et al., 2007) as well as correlated predictor variables (Strobl et al., 2008), which may prove particularly important in the growing number of applications of RF in psychology. After a short introduction and comparison to more well-established approaches such as dominance analysis (Azen and Budescu, 2003), the talk gives hands-on advice on how to use and interpret RF variable importance measures sensibly and safely. The described methods are freely available in the R system for statistical computing (R Development Core Team, 2009). References: Azen, R. and D. V. Budescu (2003). The dominance analysis approach for comparing predictors in multiple regression. Psychological Methods 8 (2), 129{48; Breiman, L. (2001). Random forests. Machine Learning 45 (1), 5{32; Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classi cation and Regression Trees. New York: Chapman and Hall; Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15 (3), 651{674. Lunetta, K. L., L. B. Hayward, J. Segal, and P. V. Eerdewegh (2004). Screening large-scale association study data: Exploiting interactions using random forests. BMC Genetics 5:32; R Development Core Team (2009). R: A Language and Environment for Statistical Computing; Vienna, Austria: R Foundation for Statistical Computing; Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variable importance for random forests. BMC Bioinformatics 9:307; Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variableimportance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25.
Interactive graphical exploration of psychometric multivariate data using glyph representations. (147)
Ali Uenlue and Waqas Ahmed Malik
Gauguin (Grouping And Using Glyphs Uncovering Individual Nuances) is statistical data visualization software for the interactive graphical exploration of multivariate data using glyph representations. Glyphs are defined as geometric shapes scaled by the values of multivariate data. Each glyph represents one high-dimensional data point (or the average of a group or cluster of data points). In this talk, we will present the software package Gauguin and illustrate key features with part of Programme for International Student Assessment (PISA) data.