Home » Additional Features, Statistical Analysis and Data Mining Highlights

Featured Papers Cover Application, Methods, Theory

1 November 2010 1,086 views No Comment
Joe Verducci, SAM Editor-in-Chief


The five papers in this issue range from application to methods to theory, all moving with the current of the computational and statistical sciences. The first three concern classification: situations where novel classes may arise, where the data are in the form of graphs, and where guided reduction to a subspace of the feature space substantially improves the percent of correct classification on test data. The last two papers concern techniques for discovery of monotone relationships in subpopulations and a new model selection criterion to aid with prediction.

There are thousands of serological variants (serovars) of bacteria—traditionally classified by their surface antigens—with new serovars continually evolving. In “A Machine-Learning Approach to Detecting Unmatched Bacterial Serovars,” Ferit Akova, Murat Dundar, V. Jo Davisson, E. Daniel Hirleman, Arun K. Bhunia, J. Paul Robinson, and Bartek Rajwa quickly classify colonies into known and novel serovars, based on 50 features of colony images from new light-scattering sensors.

Classification is based on a training set of known types, but also allows for novel detection. The underlying model assumes the features have class-specific multivariate normal distributions with conjugate normal and inverted Wishart priors for the mean and covariance matrix of a randomly evolved serovar. New observations are classified into one of the known classes if the maximum posterior probability exceeds a threshold parameter γ, and into a novel class otherwise. The novel class then extends the number of possible classifications. As more samples from a particular novel serovar appear, the posterior covariance of this novel class converges from a highly uncertain, nearly spherical distribution to one with more ellipsoidal contours characteristic of the serovar class.

The Bayesian novel detection algorithm that implements the approach clearly outperforms support vector domain description and maximum likelihood procedures in terms of greater area under the receiver operating characteristic curve determined by varying the threshold parameter γ in experiments in which different subsets of five of 28 known classes are treated as novel. Similarly good performance is observed for a classic “letter recognition” data set, suggesting the efficacy of the method extends beyond its use in classifying bacterial serovars.

Key features for classifying graphs take the form of discriminative frequent subgraphs. In “Discriminative Frequent Subgraph Mining with Optimality Guarantees,” Marisa Thoma, Hong Cheng, Arthur Gretton, Jiawei Han, Hans-Peter Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan and Karsten M. Borgwardt address the problem of finding a “near optimal” set S of such features for any fixed size s of the set. They build on the recursive gSpan algorithm, which first identifies frequent subgraphs using all classes. In the basic case of two classes A and B of graphs, they grade a set S negatively to the number of pairs (a,b) of graphs from A and B, respectively, that share all the common substructures in S.

This criterion, called CORK (CORespondence-based Quality criterion), is submodular, which means that appending a new feature X to a set S produces a larger change than adding that same feature to any superset S′ of S that does not contain X. The submodular property implies that greedy forward feature selection will always achieve at least 63% of the optimal quality for any set of size s. The authors extend their method to incorporate CORK into the original frequent subgraph finder (gSpanCORK) and multiclass problems (mcCork). The method is illustrated on 11 real-world best generic ativan data sets.

Subspace methods, such as principal components analysis (PCA), are a popular way to reduce dimensionality. These may eliminate noise from certain classification problems, but do not incorporate information from class labels and may be sensitive to non-normal data. In “Data Reduction in Classification: A Simulated Annealing-Based Projection Method,” Tian Siva Tian, Rand R. Wilcox, and Gareth M. James use a stochastic search algorithm (SA) to find a subspace projection that minimizes cross-validated test error for any chosen classifier. In addition to the typical tuning parameters, this SA also can tune for the number of near zero coefficients, thus producing sparse or dense solutions for the spanning vectors of the subspace.

Sparse and dense SA algorithms are compared to classical (CPCA) and robust (ROBPCA) PCA subspace methods, as well as no data reduction when used in conjunction with seven types of classifiers. Comparisons were done with both simulated experiments and actual data from microarray and fMRI studies. The SA methods show a potential for substantial improvements.

An often-important goal in data mining is the identification of local nonlinear correlation. Nonlinear paths in high-dimensional space that minimize squared distance to data points are called principal curves. Chandan K. Reddy and Mohammad S. Aziz find local nonlinear correlation by first limiting principal curves to two-dimensional subspaces comprised by pairs of features, and then using a k-segments algorithm to further restrict these curves to regions in which they display “soft” monotonicity; that is, regions where the proportion of data points allowed to be anti-monotonic with the curve is limited.

Soft monotonicity in higher-dimensional feature spaces requires soft monotonicity for all pairs of features in that space. Extension of softly monotone curves in higher dimensions is based on the property that the projections of any such curves onto lower dimensional feature spaces must be the same as those already discovered in the lower dimensional spaces. “Modeling Local Nonlinear Correlations Using Subspace Principal Curves” provides a complete development, including theory, algorithm, and examples.

In classic multivariate normal regression, the list of information criteria proposed to select a unique, minimal, true set of variables—which determine the model—has exhausted the alphabet used to name them. Some of these criteria behave badly when the number pn of the variables is large relative to the sample size n. In “Model Selection for High-Dimensional Data,” Yongli Zhang and Xiaotong Shen look at information criteria that take the form (sum of squared residuals) + λ |M| σ2, where λ is a penalty, |M| is the number of variables in the model, and σ2 is the inherent variance. They find a lower bound on the probability that the minimal true model M0 is selected by the criterion and use that to guide the choice of λ.

The optimal value produces a new information criterion RICc, similar to the risk inflation criterion, but with an additional correction term that becomes important only when pn is large. Because an exhaustive search to minimize this is not feasible, they suggest using a combination of Lasso and RICc, but with an additional adjustment for the bias of the Lasso. The resulting procedure generates the Lasso sequence of models, stopping at the model that minimizes the bias-adjusted RICc. Under simulations and with actual data, the method performs well compared to other criteria using either the Lasso or backward stepwise sequences of models.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.