Special Issue Focuses on Computer Modeling
David M. Steinberg, Technometrics Editor
Important Terms Emulator: An empirical (statistical) approximation to (or surrogate for) a simulator. An emulator is estimated by running the simulator at a variety of input settings and then modeling the resulting output data. Gaussian process models, one of the most popular approaches for generating emulators, treat deterministic simulator output as observed values from a random process, in which output at nearby locations in the input space is highly correlated. The resulting emulator ‘predicts’ the value of the simulator at an untested input setting using standard theory for conditional distributions of a multivariate normal distribution.
Simulator: The computer model, itself, which represents the physical phenomenon of interest
Emulator: An empirical (statistical) approximation to (or surrogate for) a simulator. An emulator is estimated by running the simulator at a variety of input settings and then modeling the resulting output data. Gaussian process models, one of the most popular approaches for generating emulators, treat deterministic simulator output as observed values from a random process, in which output at nearby locations in the input space is highly correlated. The resulting emulator ‘predicts’ the value of the simulator at an untested input setting using standard theory for conditional distributions of a multivariate normal distribution.
Computer simulation has become a standard tool for attacking many scientific and engineering problems. Processes are studied using software that simulates nature via a mathematical model; a work-station or computer cluster replaces the test bench. Such computer models generate data (often large amounts) that must be analyzed. And care is needed at the design stage to determine informative simulation settings. Thus, these studies have much in common with conventional laboratory or field experiments. However, they also have some unique features that have served as the trigger to a substantial body of statistical research throughout the last 20 years.
The November 2009 issue of Technometrics is a special collection of articles on statistical problems that arise in computer modeling. The stimulus for this issue was a focus year on the topic held in 2006–2007 at the Statistical and Applied Mathematical Sciences Institute (SAMSI). Participants were invited to submit papers to the special issue; most of the articles are the invited submissions that were accepted after Technometrics’ regular review process.
The lead article, “Design and Analysis of Computer Experiments with Branching and Nested Factors,” is by Ying Hung, V. Roshan Joseph, and Shreyes N. Melkote. It was motivated by a computer experiment in a machining process. Two cutting-edge shapes were considered. One was characterized by the angle and length of the shape. The cutting-edge shape is thus a branching factor and the angle and length are nested factors, with meaning only for the one shape. Other process factors are relevant for both types of cutting edge. Challenging problems arise in the design and analysis of experiments with branching and nested factors. The article develops optimal Latin hypercube designs and kriging methods that can accommodate branching and nested factors. Application of the proposed methods led the team to optimal machining conditions and tool edge geometry, which resulted in a remarkable improvement in the machining process.
Jason Loeppky, Jerome Sacks, and William J. Welch also study a problem in the design of computer experiments in their paper, “Choosing the Sample Size of a Computer Experiment: A Practical Guide.” This paper produces reasons and evidence supporting the informal rule that the number of runs for an effective initial computer experiment should be about 10 times the input dimension. The arguments quantify two key characteristics of computer codes that affect the sample size required for a desired level of accuracy when approximating the code via a Gaussian process. The first characteristic is the total sensitivity of a code output variable to all input variables. The second corresponds to the way this total sensitivity is distributed across the input variables, specifically the possible presence of a few prominent input factors and many impotent ones (effect sparsity). The evidence supporting these properties stems primarily from a simulation study and via specific codes modeling climate and ligand activation of G-protein.
Complex high-dimensional computer models can sometimes be evaluated at different levels of accuracy. Accurate representation of a slow, but high-accuracy model may be improved by adding information from a cheap, approximate version of the model. Moreover, results from the latter version may lead to a more informed design for the accurate simulator. These are the questions studied by Jonathan Cumming and Michael Goldstein in their article, “Small Sample Bayesian Designs for Complex High-Dimensional Models Based on Information Gained Using Fast Approximations.” They describe an approach that combines the information from both models into a single multi-scale emulator for the computer model. They then propose a design strategy for the selection of a small number of evaluations of the accurate computer model based on the multi-scale emulator and a decomposition of the input parameter space. The methodology is illustrated with an example concerning a computer simulation of a hydrocarbon reservoir.
Computer models are often used for optimization of complex systems in engineering. In “Bayesian Guided Pattern Search for Robust Local Optimization,” Matthew Taddy, Herbert K. H. Lee, Genetha A. Gray, and Joshua D. Griffin develop a novel approach. By combining statistical emulation using treed Gaussian processes with pattern search optimization, they are able to perform robust local optimization more efficiently and effectively than using either method alone. The approach is based on the augmentation of local search patterns with location sets generated through improvement prediction over the input space. They further develop a computational framework for asynchronous parallel implementation of the optimization algorithm. The methods are demonstrated on two standard test problems and a motivating example of calibrating a circuit device simulator.
Assessment of risk from natural hazards also exploits computer models. “Using Statistical and Computer Models to Quantify Volcanic Hazards,” by M. J. Bayarri, James O. Berger, Eliza S. Calder, Keith Dalbey, Simon Lunagomez, Abani K. Patra, E. Bruce Pitman, Elaine T. Spiller, and Robert L. Wolpert, involves a combination of computer modeling, statistical modeling, and extreme-event probability computation. A computer model of the natural hazard is used to provide the needed extrapolation to unseen parts of the hazard space. Statistical modeling of the available data is needed to determine the initializing distribution for exercising the computer model. In dealing with rare events, direct simulations involving the computer model are prohibitively expensive. The solution instead requires a combination of adaptive design of computer model approximations (emulators) and rare event simulation. The techniques developed for risk assessment are illustrated on a test-bed example involving volcanic flow.