Home » Additional Features, Journal of the American Statistical Association Highlights

March JASA Features ASA President’s Invited Address

1 May 2011 1,712 views No Comment
Hal Stern, JASA Applications and Case Studies Editor


Sastry Pantula, 2010 ASA president, talked about the critically important role for statistics in an increasingly data-rich world during the presidential address at the Joint Statistical Meetings in Vancouver, British Columbia, Canada, last August, and his remarks appear in the March 2011 issue of the Journal of the American Statistical Association. Pantula gives many examples of how statisticians are rising to the challenge in various fields, including biology, meteorology, marketing, and manufacturing. For example, statisticians are playing critical roles in the analysis of genomic data in studies of disease and in the study of large-scale climate models and databases. Pantula notes that collaboration is critical to the study of such large data sets and encourages statisticians to partner with scientists in other fields. He also notes the important role statisticians play in ensuring the role of uncertainty and variation is well understood. The remainder of the March issue provides numerous articles that illustrate many of the topics Pantula talks about.

Theory and Methods

A key Theory and Methods paper addresses fundamental statistical problems that arise in the supposedly simple act of storing the vast amounts of data being generated. Data centers that house multiple servers to handle the increasing flow of data are a critical part of the modern business world and present numerous economic challenges to companies. Data centers generate enormous amounts of heat that must be removed to allow for efficient cooling of the machines residing inside the centers. Indeed, a significant amount of the power consumption in a data center is for heat removal.

To perform heat removal well, it is critical to efficiently learn about the distribution of heat throughout the often irregularly shaped data centers. Ying Hung, in “Adaptive Probability-Based Latin Hypercube Designs,” describes statistical procedures that can be used to design optimal placement of sensors in data centers to study the thermal distribution. The use of adaptive designs that change as data are collected can introduce bias into conventional estimators. Hung develops several design-unbiased estimators and studies their performance through simulation and in a real application.

Another “large data” problem arises when investigators try to combine data from a multitude of studies that are addressing the same or similar problem. In “Confidence Distributions and a Unifying Framework for Meta-Analysis,” Minge Xie, Kesar Singh, and William E. Strawderman develop novel and robust approaches for meta-analysis based on the emerging methodology of confidence distributions.

A confidence distribution (CD) is a probability distribution function that can provide confidence intervals of all levels for a parameter of interest. The authors note that although most people think of the CD as a purely frequentist concept, the CD in fact links to Bayesian inference concepts and to the fiducial arguments of R. A. Fisher. The authors propose robust CD methods that are not sensitive to a small number of outlying studies and study the robust methods under two complementary asymptotic frameworks.

One asymptotic framework is for the case that the size of each component study increases without bound; the second is for the case where study-specific information is fixed, but the number of studies increases. For both cases, the authors derive asymptotic efficiency results of the robust procedures. The authors use two meta-analysis studies (one on prophylactic use of lidocaine after a heart attack; the second on a surgical treatment for stomach ulcers) to compare the robust meta-analysis approaches to conventional model-based meta-analysis approaches.

Applications and Case Studies

One can argue that biology has been the science most dramatically revolutionized by the large amounts of data emerging from new technologies. Gene sequence data allows scientists to identify individual nucleotide-level variation associated with disease and gene expression data allow scientists to identify genes whose products may be implicated in a disease pathway. Many recent studies show interesting patterns of correlation among the expression of genes on a chromosome—genes that are not contiguous along the genome may be highly correlated, most likely because of the three-dimensional chromosome folding that occurs to pack our DNA into the cell.

Guanghua Xiao, Xinlei Wang, and Arkady Khodursky, in “Modeling Three-Dimensional Chromosome Structures Using Gene Expression Data,” develop a hierarchical model that links gene expression to key parameters describing the helical structure of the folded genome. They are able to quantify and infer structure (i.e., they can learn about the way the DNA appears to be organized within the cell) by using data from gene expression microarrays. Simulation studies demonstrate the practicality of the approach. Applications show how genes that are not near each other on the genome can be functionally associated because they are brought into close physical proximity by chromosome folding. This statistical approach helps to further our insight into the relationship of chromosome structure and function.

A final feature article in the March issue concerns methods for sampling difficult populations. In studies of HIV prevalence, it can be difficult to obtain representative samples because the at-risk population is hard to reach for investigators. At the same time, the population is itself highly inter-connected via social networks. This has led to the development of “respondent-driven” sampling, a method whereby an initial sample is selected and then subsequent sample members are selected based on their relationships with earlier sampled units. Of course, when the initial sample is not a probability-based sample, then the subsequent samples are not probability samples either.

Unfortunately, there are few alternatives in such settings. Statistical innovations have focused on ways to improve estimation in such settings. Krista Gile’s article, “Improved Inference for Respondent-Driven Sampling Data with Application to HIV Prevalence Estimation,” continues to develop this important approach. She notes that current popular approaches to obtaining inferences from respondent-driven samples assume each round of sampling is carried out “with replacement” from the population and shows that this can lead to bias in various situations. The article presents an alternative approach that respects the “without replacement” aspect of the sampling process. The method is studied in simulations that vary the size of the hidden population and the prevalence of the characteristic of interest. The approach is illustrated on HIV data collected in two countries with varying characteristics and appears to provide new insight into the data.

Of course, the above articles are just a sample of March’s offerings. The full list of articles, with downloadable abstracts, can be obtained from the JASA website.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.