Big Data Featured in August Special Issue
Peihua Qiu, Technometrics Editor
The August 2016 issue of Technometrics is a special issue on Big Data analysis. A total of 11 papers are included, covering a wide range of topics in describing, analyzing, and computing Big Data.
The first five papers propose numerical algorithms that can analyze Big Data fast. In “Orthogonalizing EM: A Design-Based Least Squares Algorithm” by Shifeng Xiong, Bin Dai, Jared Huling, and Peter Z. G. Qian, an efficient iterative algorithm intended for various least squares problems, based on a design of experiments perspective, is proposed. The algorithm, called orthogonalizing EM (OEM), works for ordinary least squares and can be extended easily to penalized least squares. The main idea of the procedure is to orthogonalize a design matrix by adding new rows and then solve the original problem by embedding the augmented design in a missing data framework.
In “Speeding Up Neighborhood Search in Local Gaussian Process Prediction” by Robert B. Gramacy and Benjamin Haaland, the authors suggested an algorithm for speeding up neighborhood search in local Gaussian process prediction that is commonly used in various nonlinear and nonparametric prediction problems, particularly when deployed as emulators for computer experiments.
“A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data” by Faming Liang, Jinsu Kim, and Qifan Song proposes a so-called bootstrap Metropolis-Hastings (BMH) algorithm that provides a general framework to tame powerful MCMC methods for Big Data analysis. The major idea of the algorithm is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods calculated in parallel from multiple bootstrap samples.
“Compressing an Ensemble with Statistical Models: An Algorithm for Global 3D Spatio-Temporal Temperature” by Stefano Castruccio and Marc G. Genton suggests an algorithm for compressing 3D spatio-temporal temperature using a statistics-based approach that explicitly accounts for the space-time dependence of the data.
“Partitioning a Large Simulation as It Runs” by Kary Myers, Earl Lawrence, Michael Fugate, Claire McKay Bowen, Lawrence Ticknor, Jon Woodring, Joanne Wendelberger, and Jim Ahrens is about analysis of data streams in which data are generated sequentially and data storage, transferring, and analysis are all challenging. The authors suggest a so-called online in situ method for identifying a reduced set of time steps of the data and data analysis results to save in the storage facility to significantly reduce the data transfer and storage requirements.
The next two papers concern machine learning methods for handling Big Data. “High-Performance Kernel Machines with Implicit Distributed Optimization and Randomization” by Vikas Sindhwani and Haim Avron proposes a framework for massive-scale training of kernel-based statistical models, based on combining distributed convex optimization with randomization techniques.
“Statistical Learning of Neuronal Functional Connectivity” by Chunming Zhang, Yi Chai, Xiao Guo, Muhong Gao, David Devilbiss, and Zhengjun Zhang looks at identifying the network structure of a neuron ensemble beyond the standard measure of pairwise correlations, which is critical for understanding how information is transferred within such a neural population. The spike train data poses a significant challenge to conventional statistical methods due to not only the complexity, massive size, and large scale, but also the high dimensionality. The authors proposed a novel structural information enhanced (SIE) regularization method for estimating the conditional intensities under the generalized linear model (GLM) framework to better capture the functional connectivity among neurons.
The last four papers cover specific Big Data problems. “Measuring Influence of Users in Twitter Ecosystems Using a Counting Process Modeling Framework” by Donggeng Xia, Shawn Mankad, and George Michailidis focuses on analyzing data extracted from social media platforms such as Twitter that are both large in scale and complex in nature, since they contain both unstructured text and structured data such as time stamps and interactions between users. The authors develop a modeling framework using multivariate interacting counting processes to capture the detailed actions users undertake on such platforms, namely posting original content and reposting and/or mentioning other users’ postings.
Profile monitoring is an important problem in manufacturing industries. “Discovering the Nature of Variation in Nonlinear Profile Data” by Zhenyu Shi, Daniel W. Apley, and George C. Runger proposes a method for exploratory analysis of a sample of profiles to discover the nature of any profile-to-profile variation present over the sample.
“Variable Selection in a Log-Linear Birnbaum-Saunders Regression Model for High-Dimensional Survival Data via the Elastic-Net and Stochastic EM” by Yukun Zhang, Xuewen Lu, and Anthony F. Desmond proposes a simultaneous parameter estimation and variable selection procedure in a log-linear Birnbaum-Saunders regression model for analyzing high-dimensional survival data.
“Online Updating of Statistical Inference in the Big Data Setting” by Elizabeth D. Schifano, Jing Wu, Chun Wang, Jun Yan, and Ming-Hui Chen develops iterative estimating algorithms and statistical inferences for linear models and estimate equations for analyzing Big Data arising from online analytical processing, where large amounts of data arrive in streams and require a fast analysis without storage/access to the historical data.