Three Interrelated Papers Spotlighted
Joe Verducci, Editor, Statistical Analysis and Data Mining
Volume 3, issue 3 features three interrelated papers. The first proposes a new automatic criterion for selecting the bandwidth to be used in Gaussian kernel support vector machines (SVMs). The second proposes a sequential version of SVMs, called twin prototype SVMs (TVMs), which efficiently updates a fixed number of support vectors when training data arrives sequentially and there is limited storage capacity. The third paper also covers processing data streams, but summarization is in terms of hidden factors that link multivariate inputs and responses.
In “A Stable Hyperparameter Selection for the Gaussian RBF Kernel for Discrimination,” Jeongyoun Ahn provides a geometrical interpretation of the smoothing parameter in terms of the feature mapping implied by the radial basis function of a Gaussian kernel with bandwidth h. For small values of h, points get mapped to near uniformity on a hypersphere, whereas large values preserve the original distances between data points. Since the SVM is essentially a linear discriminator in the feature space, a natural (geometry-based) criterion is GB (h) = difference of the between-versus the within-sums of squares in the feature space, and h is chosen to maximize GB (h). This choice applies to any linear discriminator in the feature space, is computationally very efficient, has low variability under many underlying models, and tends to achieve better tuning than other methods in terms of minimizing the misclassification rate. This latter property is illustrated using nine benchmark data sets.
However, SVMs have some potential shortcomings. They can be overly sensitive to outliers and the number of support vectors needed to determine the classification boundary grows linearly with sample size. This last property is particularly troublesome when large amounts of training data are streaming and there is only a fixed, budgeted amount of storage. Sensitivity to outliers can be fixed by replacing the SVMs’s hinge loss function with a ramp loss that ignores all large deviations from boundary, but this comes at a high computational expense.
In “Online Training on a Budget of Support Vector Machines Using Twin Prototypes,” Zhuang Wang and Slobodan Vucetic propose using a fixed number of prototypes in place of support vectors. To accommodate a new example arriving near the current boundary, either the prototype farthest from the boundary is removed or two near prototypes are merged and the boundary is updated. In addition to being computationally efficient, this TVM attains comparable accuracy to the unconstrained SVM, as reported for 12 large benchmark data sets.
An interesting extension of learning from streaming data occurs when the response is not just a classification, but a real-valued vector Y, and the objective is to learn a linear regression linking to the input vector X. Giovanni Montana and Brian McWilliams tackle this problem in “Sparse Partial Least Squares Regression for Online Variable Selection with Multivariate Data Streams.”
A motivating problem is tracking multiple financial indexes, such as the S&P 100 and the Nikkei, using only a minimal number of distinct stocks. Novel techniques include regularizing the cross-covariance matrix M of X and Y to simplify partial least squares (PLS) estimation to ordinary least squares (OLS), which allows for sparse estimation by penalizing the L1 norm of the coefficients. The incremental Sparse PLS (iS-PLS) algorithm is the first to combine tracking of latent factors with variable selection in an adaptive fashion for data streams. The iS-PLS procedure allows the number of important latent factors and their weights to evolve over time; the important variables retained within each latent factor also evolve over time, but their number does not. This method is validated on both simulated and real data, including enhanced index tracking in which individual stocks are selected to outperform the indexes being tracked by a fixed percentage.
As a whole, these papers provide a snapshot of current research in classification and regression, making the procedures more self-adaptive and extending them to streaming data, either from a stable distribution or one subject to local trends, such as a market factor.