Home » Additional Features, Statistical Analysis and Data Mining Highlights

New Volume Begins with Invited Perspectives

1 February 2011 1,273 views No Comment
Joseph S. Verducci, SAM Editor-in-Chief

One of the special features of Statistical Analysis and Data Mining (SAM) is the occasional invited perspective. To start Volume 4, we are fortunate to have two. The first is a vision of emergent trends that Arnold Goodman has distilled from key presentations made in 2010; the second is James Goodnight’s view of the global rise of analytics in companies, government, and foundations. Combined, these two contributors bring nearly a century of experience using data processing and statistical analysis to solve real-world problems.

One of the founding editors of SAM, Goodman also co-founded, in 1967, the Annual Symposia on the Interface of Computing Science and Statistics, which continues to attract the best and brightest. His contributions span aerospace, petroleum, government, and university consulting. Currently, Goodman is writing a book about collaboration and value creation to maximize success in solving problems, supplying products and services, or managing projects in evolving complexity.

Goodnight is the chief executive officer of SAS, a leader in business analytics software and services. At the helm since the company’s incorporation in 1976, Goodnight has overseen an unbroken chain of revenue growth—a feat almost unheard of in the software industry. According to Forbes, he is the 35th-richest person in the United States and the 105th-richest in the world.

Five of the eight research articles in this issue involve cluster analysis. These concern clustering when the number of clusters is large, the dimension is large, several clusterings need to be resolved, items and variables need to be jointly clustered, and special graphical constraints are imposed on the clusters.

In “A General Framework for Efficient Clustering of Large Data Sets Based on Activity Detection,” Xin Jin et al. consider when the number K of desired clusters is large and the dimension of the patterns also may be large. Starting with an iterative partitioning-based method similar to the K-means algorithm, they skip patterns not close to “active” centers—centers that have changed in the previous iteration. This general activity detection method works with both metric and nonmetric distances and has both approximate and exact formulations.

“On the Limits of Clustering in High Dimensions via Cost Functions” establishes a negative result for clustering in high dimensions: “[A]bove a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set.” In particular, when the cost of any partition of the rows of a data matrix is the sum of squared distances from the centroids, Hoyt Koepke and Bertrand Clark prove that if the signal to noise ratio is o(D1/2) as the dimension D increases, then results of clustering are meaningless in that all partitions incur equivalent cost.

In “Bayesian Cluster Ensembles,” Hongjun Wang et al. consider consolidating several clusterings into one robust consensus clustering. The problem is formulated as a fully discrete hierarchical/graphical model, but inference is accomplished by approximating the posterior distribution of the cluster indicator using variational methods. This method may be generalized to include information about the original features of the patterns, not just their base clusterings.

In “Improving the Performance of the Iterative Signature Algorithm for the Identification of Relevant Patterns,” Adelaide Freitas et al. consider biclustering a matrix of data (e.g., a gene x condition microarray of expression values) to identify blocks of genes that are coexpressed under certain sets of conditions. The iterative signature algorithm (ISA) can find possibly overlapping blocks of this type based on normalized means of each row and column of the matrix. The present work makes ISA more robust by using medians of absolute scores, and this modification outperforms ISA under certain conditions illustrated by real and simulated data.

“Agglomerative Connectivity Constrained Clustering for Image Segmentation” considers clustering under the constraint that data points in the same cluster must be connected according to a pre-existing graph. Such is the case in image segmentation in which points are clustered by color and a connectivity constraint is imposed to guarantee the segments are spatially connected. Jia Li formulates a new method (A3C—from the first four words of the title) that she combines with K-means clustering to achieve fast and accurate segmentation.

The last three articles deal with classification, prediction of survival, and graphic displays.

In “Exploiting Associations Between Word Clusters and Document Classes for Cross-Domain Text Categorization,” Furzhen Zhuang et al. consider the classification problem in which items from the test set are drawn from a different distribution than those obtained for the training set. In this case, the associations between word clusters (conceptual features) and document classes may remain stable across the different domains. These associations are detected via tri-factorization FSGT, of word cluster information F, document cluster information G, and stable association S between types of clusters.

In their original formulation, random forests are an ensemble method of classification using bootstrapped binary decision trees as base learners. In “Random Survival Forests for High-Dimensional Data,” Ishwaran et al. extend the method to predict survival times by averaging the Nelson-Aalen cumulative hazard functions estimated from each tree. In high dimensions, this procedure is made feasible using the concept of minimal depth for regularizing the forests. Interaction plots display conditional dependence based on output of methods and models applied to experimentally obtained data.

In “Trellis Display for Modeling Data from Designed Experiments,” Montserrat Fuentes et al. describe a framework for the visualization of dependencies among variables based directly on the raw data. Such plots remain effective, even when the data are rather sparse, as long as the experiment has been designed well enough to keep the error variability small.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.