Categorical and survey data: Methodology (30)

Chair: Peter van der Heijden, Friday 24th July, 9.30 - 10.50, Castlereagh Room, Fisher Building.

Maria Kateri, Department of Statistics and Insurance Science, University of Piraeus, Greece. Phi-divergence classes of models for categorical data. (011)

Peter van der Heijden, Utrecht University, The Netherlands, Ardo D.L. van den Hout, MRC Biostatistics Unit, University of Cambridge, UK,  Ulf Bockenholt, McGill University, Montreal, Canada. Estimating prevalence and cheating in a double sampling scheme with direct questioning and randomized response. (105)

Joakim Ekström, Department of Statistics, Uppsala University, Sweden.  A generalized definition of the tetrachoric correlation coefficient. (213)

Jay Verkuilen, PhD Program in Educational Psychology, CUNY Graduate Center, New York, USA, Christopher Siefert, Sandia National Laboratories, USA. Statistical theory of the list experiment to measure socially sensitive attitudes. (076) (Slides)

ABSTRACTS

Phi-divergence classes of models for categorical data. (011)
Maria Kateri
Modelling categorical data is viewed through an information theoretic perspective. The models are characterized by their distance from the most parsimonious model in the direction of their qualitative substance, which serves as a reference model. This way, apparently different (and often competitive) models are unified in families sharing common properties. It can be proved that all of them measure the distance from the same reference model under the same conditions. Their difference lies on the measure applied to express this distance and is thus a scale difference. Consequently, if the distances are expressed in terms of a generalized measure, then a family of models is developed, having well-known models as its members. Using the φ-divergence as such a generalized measure, the corresponding classes of models are built: the φ-divergence association model for modelling departures from independence in a contingency table, the generalized quasi-symmetry QS[φ] and ordinal quasi-symmetry OQS[φ] models for modeling departure from complete symmetry in a square contingency tables. Logistic regression is another model that is generalized through φ-divergence to a class of models, unifying alternative approaches.

Estimating prevalence and cheating in a double sampling scheme with direct questioning and randomized response. (105)
Peter van der Heijden, Ardo D.L. van den Hout and Ulf Bockenholt
Randomized response is used in surveys with sensitive questions. To protect the privacy of respondents, misclassification is induced where parameters are fixed by design. The aim of randomized response is to estimate prevalence of sensitive behaviour. Respondents not following the instructions of randomized response are considered to be cheating. A mixture model is proposed to estimate prevalence and cheating in the case of a double sampling scheme with direct questioning and randomized response. The model uses design specific cheating parameters. The research is motivated by randomized response data concerning violations of regulations for social benefit.

A generalized definition of the tetrachoric correlation coefficient. (213)
Joakim Ekström
We generalize the tetrachoric correlation coefficient to a large class of parametric families of bivariate distributions. The generalized definition agrees with the conventional definition on the family of bivariate normal distributions. Furthermore, we provide a necessary and sufficient condition for the generalized tetrachoric correlation coefficient to be well defined for a given family of distributions, and some sufficient criteria which can be useful for practical purposes. Moreover, we illustrate with examples how the distributional assumption can have a profound impact on the conclusions of the association analysis. Using S&P 100 stock data, we exemplify the fact that a correct distributional assumption is vitally important for the analysis. Consequently, it is concluded that the tetrachoric correlation coefficient is not robust to changes of the distributional assumption.

Statistical theory of the list experiment to measure socially sensitive attitudes. (076)
Jay Verkuilen and Christopher Siefert
The list experiment (also known as the item count technique) was proposed as a way to reduce social desirability bias in a telephone or internet survey of sensitive topics. It provides respondents a degree of anonymity while still providing a way to estimate aggregate outcomes of interest (Gilens, Sniderman & Kuklinski, 1997; Janus, 2008; Kuklinski, Cobb & Gilens, 1997; Kuklinski et al., 1997; Streb et. al., 2008, Tsuchiya, 2007). It represents a more practical design than the famous "randomized response" (Warner, 1965) because it requires no randomization or complicated instructions to respondents. While it has been used in a number of empirical studies, the list experiment has not been examined carefully by way of a formal model that leads to the list experiment estimator. This article provides such a formalization based on a latent response profile for each subject, using Teugels’ (1991) multivariate binomial parameterization of the multinomial distribution and a matrix transformation that characterizes all possible designs. A number of results follow: (a) a simple mathematical test of the consistency of a proposed design, (b) designs that have a higher efficiency than the existing ones, (c) guidelines for writing better items, and (d) the appropriate use of individual-level covariates. (slides)