## NSF Funding Opportunities of the CDS&E-MSS Program

*Jia Li, NSF CDS&E-MSS Program Director*

The Division of Mathematical Sciences (DMS) at the National Science Foundation (NSF) launched a new program called Computational and Data-Enabled Science and Engineering in Mathematical and Statistical Sciences (CDS&E-MSS) in October 2011. The goal of the program is to promote the creation and development of the next generation of mathematical and statistical theories and methodologies that will be essential for addressing computational or big-data challenges in various sciences, engineering, and education. Awards have been made for the first round of competition. The submission window for the current fiscal year is November 25 to December 10.

The awarded projects tackle the focused challenges of CDS&E-MSS from diverse subject areas in mathematics and statistics; collectively, they are blended efforts on theory, methodology, application, and software. Abstracts of the funded proposals in this program can be found by searching with Element Code “8069” and NSF Organization “DMS.”

The program funded projects with approaches from pure and computational mathematics areas, as well as in statistical theory and methods. A sample of funded projects is summarized below for illustrative purposes.

Many conventional statistical methods scale poorly when the data size becomes gigantic in terms of both computational complexity and computer memory management. Several projects emphasize the scalability issue from different angles. Other projects propose novel statistical methods for a diverse collection of highly complex types of data, for instance, image, text, genomic, medical, financial, geospatial, and network data.

#### Collaborative Research: Leverage Subsampling for Regression and Dimension Reduction

This project aims to deepen the statistical theory of an innovative sampling methodology and develop high-quality numerical implementations on large real-world data. Subsampling of rows and/or columns of a data matrix has traditionally been employed as a heuristic to reduce the size of large data sets. Recently, a new sampling methodology that uses the empirical statistical leverage scores of the data matrix as a nonuniform importance sampling distribution has been proposed. Understanding the statistical properties of these algorithms is of interest for both fundamental and practical reasons.

#### Statistical Theory and Methods for D&R Analysis of Large Complex Data

This project proposes the reduction of data by exploiting parallel computing. The main idea, called divide and recombine (D&R), is to divide large complex data in some optimized way into subsets and then to apply statistical and visualization methods to each of the subsets separately. The results of each method are recombined across subsets. Comparing with parallelizing algorithms, this new analysis framework for large complex data can more easily exploit current distributed computational environments.

#### Collaborative Research: Statistical and Computational Models and Methods for Extracting Knowledge from Massive Disparate Data for Quantifying Uncertain Hazards

This project aims to improve scalability by multiple orders in several areas. The first is statistical emulation of the output of computer simulation models. A second is multiple scale stochastic models, exploiting infinitely divisible distributions for some model features to permit coupled parallel analyses at a range of scales, with coarser scales requiring less computational effort and running faster to help the finer scales reach equilibrium faster. A third area is dynamic evolution models, in which computational effort is focused on those aspects that change most rapidly, while other aspects are treated as slowly varying or piecewise-constant. All methods are applied to the same important application area, the quantitative assessment of geophysical hazard for volcanic events.

#### Coarse-to-Fine Discovery for Genetic Association

In this project, scientists in statistics and computer science join effort with biologists and medical researchers to tackle challenges in genetics. The investigators propose a new coarse-to-fine statistical framework motivated by the biomedical hypothesis that mutations contributing to a specific disease cluster in specific pathways, and in genes within these pathways. The researchers convert these heuristics into mathematics and provide a comprehensive analysis, both empirical and theoretical, of the trade-offs resulting from the introduction of carefully chosen biases about the distribution of active variants within genes and pathways. The new methods are applied to data from real genome-wide association studies (GWAS) with large cohorts to validate their utility.

#### Statistical Representations and Algorithms for Brain Connectivity

In this project, the investigators develop advanced statistical methods for brain imaging data to quantify and compare recurring patterns of connectivity of different parts of the brain for individuals and across populations. Such data are routinely collected for many individuals in functional magnetic resonance imaging and are large and complex. A key aspect is that the investigators view each brain as a sampling unit and develop statistical methods that use the entire sample of available brain images to infer common structures and variation in connectivity. These methods are generally applicable for the assessment of dependency structures for spatial processes.

#### Statistical Analysis for Partially Observed Markov Processes with Marked Point Process Observations

Electronic trading in all major world financial markets has routinely generated streams of ultra high-frequency (UHF) data. UHF data have spurred interest in empirical market microstructure and present new and interesting challenges that are essential to comprehend market microstructure, monitor and regulate markets, and conduct risk management. In this project, theoretical, computational, and implementation issues for the estimation and applications of these models will be investigated.

#### Statistical Modeling and Computations for Data with Network Structure

In many applications in diverse science and engineering disciplines, researchers are dealing with multiple networks, either as the result of temporal evolution of the data-generating process or as the result of a mixture in the data-generating process. This project aims to develop effective solutions to novel problems arising in the analysis of multiple and time-evolving network structures. Particular emphasis is placed on new theoretical techniques and computational tools for network problems. The research targets open problems in many fields, including biomedical and social science research, where network modeling and analysis plays an exceedingly important role.