JASA Highlights: Studies of Immune Response and False Discovery Rate Featured in September Issue
Hal Stern, JASA Applications and Case Studies and Coordinating Editor
New technologies producing large data sets are a major force in modern statistical science, motivating the development of new theory and methods and the application of these methods to important scientific problems. The increased attention to methodology for large data sets is illustrated in the two JASA invited papers that were presented during the 2012 Joint Statistical Meetings in San Diego, California. The articles and the ensuing discussions make for interesting reading in the September issue of the Journal of the American Statistical Association.
Applications and Case Studies
The immune system in humans and other vertebrates provides an adaptive and remarkably effective response to infections or vaccines. The response is determined largely by the spatio-temporal motion of lymphocyte cells. These cells move in response to unobservable gradient fields; learning about these fields is critical to understanding the basic biology behind infection and vaccine response. New technology measuring single-cell motion in real time provides an opportunity for investigators to infer the underlying gradient field by carefully modeling the motion of lymphocyte cells. New types of data require new statistical approaches.
One approach to the spatio-temporal lymphocyte motion data is described in “Bayesian Spatio-Dynamic Modeling in Cell Motility Studies: Learning Nonlinear Taxic Fields Guiding the Immune Response” by Ioanna Manolopoulou, Melanie Matheu, Michael Cahalan, Mike West, and Thomas Kepler. Manolopoulou and colleagues develop a flexible statistical modeling framework by building on a continuous-time stochastic differential equation model for cell motion under a gradient field. Markov chain Monte Carlo computational techniques are used to learn about the parameters that govern individual cell motion and to infer the underlying gradient field. The approach they develop works extremely well on a simulated data set and provides insight for experimental data from the lymph nodes of mice. Invited discussions by Edward Ionides, Samuel Kou, and John Fricks (with colleagues Le Bao and Murali Haran) provide additional insight into the modeling and computational choices made by the authors.
A second application paper in the September issue presents another example of high-dimensional data motivating new methodology. In this case, network data that characterize the interactions of a large number of individual units (e.g., predator-prey relations among animal species) present the challenge. Network data exhibit a number of complex phenomenon that are not easily accommodated by standard models, including a latent hierarchical organization of the species, different types of interactions, and different network topologies (e.g., varying tendencies for within-subcommunity and between-subcommunity interactions).
Qirong Ho, Ankur Parikh, and Eric Xing propose “A Multiscale Community Blockmodel for Network Exploration” that allows investigators to infer these phenomena from a set of observed network interactions. Ho et al. develop a stochastic model for partitioning the units in the network, say species, in a hierarchically organized tree. Each species’ interactions are governed by a multiscale membership vector that describes the species likelihood of interacting with species at different levels of the hierarchical tree.
Finally, a probability model that links the hierarchical tree and the membership vectors to observed network connections can be used to infer the parameters of the model. The authors demonstrate the approach on a network describing the predator-prey relationships among a collection of 75 species of grass-feeding wasps and their parasites.
Theory and Methods
Multiple hypothesis testing is a fundamental problem in high-dimensional statistical problems. For example, in genome-wide association studies, tens (or even hundreds) of thousands of tests are performed simultaneously to determine which, if any, genetic markers are associated with a given disease or trait. Researchers in such settings increasingly rely on procedures that control the false discovery rate (FDR), the proportion of rejected null hypotheses for which the null hypothesis of no effect is actually true. Procedures have been developed that control the FDR in large problems with independent test statistics. When test statistics are correlated, false discovery control becomes challenging, especially if we wish to allow for arbitrary forms of dependence.
The featured Theory and Methods paper—“Estimating False Discovery Proportion Under Arbitrary Covariance Dependence” by Jianqing Fan, Xu Han, and Weijie Gu—proposes a novel method for controlling the false discovery rate based on a principal factor approximation of the covariance matrix of the test statistics. The approximation subtracts the primary sources of dependence and this significantly weakens the remaining correlation structure. This allows for the development of an approximate expression for the false discovery proportion. Discussants Larry Wasserman, Peter Hall, Armin Schwartzman, and Jiashun Jin provide additional insights and raise challenging questions about the proposed approach.
The potential for personalized medicine is explored in the article “Estimating Individualized Treatment Rules Using Outcome Weighted Learning” by Yingqi Zhao, Donglin Zeng, A. John Rush, and Michael Kosorok. Physicians note heterogeneous responses to treatment in many diseases; a drug that works well for some individuals may not work at all for others. Zhao et al. propose an optimal approach for using randomized trial results and individual prognostic factors (which may include genetic and other factors) to develop optimal rules for assigning individuals to treatments. Standard approaches to this challenging problem first use the data to estimate the expected response for an individual for each treatment and then propose to assign patients to the treatment that yields the highest expected response. This can work poorly if the first-stage models are overfit to the data.
The authors show that estimating an optimal treatment rule is equivalent to a classification problem (a patient with a bad outcome on the assigned treatment is considered a misclassification). Some misclassifications are more significant errors than others are; the authors introduce differential weighting based on the patient outcome to address this issue. A machine learning approach, support vector machines, is used to find the optimal decision rule that minimizes the expected weighted misclassification rate without estimating expected responses separately for each treatment. The resulting estimator for the optimal treatment rule has good statistical properties and performs well in simulation studies and in an analysis of chronic depression data.
There are many other informative articles in both sections of the September issue, as well as a set of book reviews. The full list of articles and a list of the books under review can be viewed at the ASA website. ASA should log on through the Members Only link to access their free online access to JASA.