Social and educational data analysis potpourri (47)

Chair: Lianghua Shu, Wednesday 22nd July, 9.55 - 11.15, Dirac Room, Fisher Building. 

Michela Gnaldi and Maria Giovanna Ranalli, Department of Economics, Finance and Statistics, University of Perugia, Italy. The robustness of university rankings: A sensitivity analysis of the Italian scientific research indicators. (148)

Brian Clauser, Melissa J. Margolis and Janet Mee, National Board of Medical Examiners, Philadelphia, USA.  An experimental study of the use of performance data by judges in an Angoff standard setting exercise. (220)

Dmitry Belov, Law School Admission Council, Newtown PA, USA and Ronald D. Armstrong, Rutgers University, New Jersey, USA. A new approach to assess unusual agreement between the incorrect answers of two examinees. (078)

Rolf Steyer, Department of Psychology, University of Jena, Germany. The theory of causal effects and the principle of atomic stratification. (260)

ABSTRACTS

The robustness of university rankings: A sensitivity analysis of the Italian scientific research indicators. (148)
Michela Gnaldi and Maria Giovanna Ranalli
Composite indicators which compare different institution (e.g. schools, universities etc.) performance are increasingly recognised as a useful tool in policy analysis and public communication. It often seems easier for the general public to interpret composite indicators than to identify common trends across many separate indicators. However, composite indicators can send misleading policy messages if they are poorly constructed or misinterpreted. In this work we analyse some individual indicators related to the evaluation of research products of Universities put forward by the Italian Steering Committee for Research Evaluation (CIVR). The underlying structure of the data has been examined to explore the relationships among individual indicators and the existence of common dimensions of academic research. A set of individual indicators has been selected to construct composite indicators of scientific research for the Universities considered. To this objective, five normalisation methods, a weighting scheme, and two aggregation schemes have been computed and combined, resulting in 135 composite indicators. The variation in the rankings assigned by the composite indicators to the Universities has been explored to gauge the robustness of the composite indicator rankings and to assess the contribution of the individual sources of uncertainty to the output variance. The width of the 5th – 95th percentile bounds and the ordering of the medians show that the groups of university laggards (and, to some extent, of the groups of university leaders) is less sensitive to variations. A regression analysis has been employed to analyse the contribution to the variance in the CIs due to the different factors employed.

An experimental study of the use of performance data by judges in an Angoff standard setting exercise. (220)
Brian Clauser, Melissa J. Margolis and Janet Mee
Research suggests that content experts have difficulty making the judgments required in Angoff standard-setting exercises in the absence of performance data. One concern is that judges may rely too heavily on the data and essentially ignore the item content; most authors agree that excessive reliance on the data could call the process into question. This study examined the extent to which judges adjust their ratings when performance data are unrelated to actual examinee item performance. Actual performance data were presented with half of the items; data for the other items were systematically manipulated. Raters were told that some data would be inaccurate and that they should carefully consider their use of the data. Two rounds of Angoff judgments (without and with performance data) were made for 75 items. Ratings were compared to IRT-based conditional p-values for each item. Results varied across three replications of the procedure and across individuals within panels. Raters generally made small changes to their initial ratings; these changes were similar for items with unmanipulated and manipulated data. Results suggest that raters use the data to make adjustments without further consideration of item content. This may result in essentially normreferenced rather than content-based judgments

A new approach to assess unusual agreement between the incorrect answers of two examinees. (078)
Dmitry Belov and Ronald D. Armstrong
A common approach to assess unusual agreement between the incorrect answers of two examinees is K-Index probability (Holland, 1996). K-Index has been used at the Educational Testing Service (ETS) for a decade and its properties were further explored in the literature (Lewis & Thayer, 1998; Sotaridona & Meijer, 2002). The major disadvantage of K-Index is a high Type II error (Lewis & Thayer, 1998). Consider a pair of examinees called subject and source with number of incorrect responses U and V, respectively. We assume U is greater than or equal to V. Given values for U and V, the number of matches between subject's and source's incorrect answer choices M(U, V) is asymptotically normal. For a test of length T, there is lower triangular T+1 by T+1 matrix of distributions of M(U, V), where U and V range from 0 to T; therefore, we have (T+1)(T+2)/2 distributions. To calibrate parameters of these distributions, we developed a Monte Carlo method, applied to 12 administrations of the Law School Admission Test (LSAT) from 2005 to 2007. We called our approach M-Index. A computational study comparing M-Index with K-Index demonstrated a significant decrease of Type II error with the same level of Type I error.

The theory of causal effects and the principle of atomic stratification. (260)
Rolf Steyer
Beyond the randomized experiment, mean differences between treatment conditions are plagued by confounding, which sometimes manifests itself by the reversal of the alleged treatment effects when considering subpopulations instead of the total population. Therefore, Neyman and Rubin built their theories on the causal effects in the smallest possible subpopulations: the observational units. However, there are examples showing that confounding does not necessarily vanish at the level of the observational units. Hence, it is proposed to base the theory of causal effects on the more general notion of atomic strata, which are obtained by conditioning on all potential confounders. Those random variables whose values are the treatment-conditional expected values of the outcome variable Y in the atomic strata are called the true outcome variables. They replace Rubins’s “potential outcome variables”. Once, we have defined the true outcome variables and the causal effects on the level of the atomic strata, we can define average causal effects by the difference between the expected values over the distribution of the strata. Conditioning on the atomic strata and then taking the expected values over the distribution of the strata is what we call the principle of atomic stratification in the theory of causal effects.