September JASA Addresses Topics in Text Analysis and Machine Learning
Joseph Ibrahim, JASA Applications and Case Studies Editor and Coordinating Editor
The September issue of the Journal of the American Statistical Association covers application topics ranging from models for text analysis to imaging genetics. Theory and Methods contributions include new statistical methods for longitudinal data, transformation models, mixture models, partial differential equation models, machine learning, and much more.
Applications and Case Studies
AC&S features a discussion article by Matt Taddy, titled “Multinomial Inverse Regression for Text Analysis.” The discussants of the paper are David Blei and Justin Grimmer. Text data—including speeches, stories, and other document forms—are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and the author shows that logistic regression of phrase counts onto document annotations can be used to obtain low-dimension document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimension response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and an efficient routine for maximization of the joint posterior over coefficients and their prior scale is carried out. This “gamma-lasso” scheme yields stable and effective estimation for general high-dimension logistic regression, and the author argues it will be superior to current methods in many settings. Guidelines for prior specification are provided, algorithm convergence is detailed, and estimator properties are outlined from the perspective of the literature on nonconcave likelihood penalization.
Several other AC&S papers also appear in the September issue, with wide-ranging applications in clinical trials, imaging, spatial statistics, missing data, proteomics, and survey sampling.
Theory and Methods
Machine learning methodology continues to play a prominent role in statistical problems involving classification and regression. In the paper titled “Latent Supervised Learning” by Susan Wei and Michael Kosorok, the authors propose a new machine learning task called latent supervised learning, where the goal is to learn a binary classifier from continuous training labels that serve as surrogates for the unobserved class labels. A specific model is investigated where the surrogate variable arises from a two-component Gaussian mixture with unknown means and variances, and the component membership is determined by a hyperplane in the covariate space. The estimation of the separating hyperplane and Gaussian mixture parameters forms what they call the “change-line classification problem.” A data-driven sieve maximum likelihood estimator for the hyperplane is proposed, which in turn can be used to estimate the parameters of the Gaussian mixture. The estimator is shown to be consistent. Simulations and empirical data show the estimator has high classification accuracy.
Statistical graphics play a crucial role in exploratory data analysis, model checking, and diagnosis. The lineup protocol enables statistical significance testing of visual findings, bridging the gulf between exploratory and inferential statistics. In the paper titled “Validation of Visual Statistical Inference, Applied to Linear Models,” Mahbubul Majumder, Heike Hofmann, and Dianne Cook develop inferential methods for statistical graphics by further refining the terminology of visual inference and framing the lineup protocol in a context that allows direct comparison with conventional tests in scenarios when a conventional test exists. This framework is used to compare the performance of the lineup protocol against conventional statistical testing in the scenario of fitting linear models. A human subjects experiment is conducted using simulated data to provide controlled conditions. Results suggest that the lineup protocol performs comparably with the conventional tests and expectedly outperforms them when data are contaminated, a scenario in which assumptions required for performing a conventional test are violated. Surprisingly, visual tests have higher power than the conventional tests when the effect size is large. And, interestingly, there may be some supervisual individuals who yield better performance and power than the conventional test, even in the most difficult tasks.
Other T&M articles appearing in the September issue include new statistical methods for mixture models, transformation models, longitudinal data, factor analysis, partial differential equation models, wavelets, regularization, quantile regression, and noncompliance in randomized studies.