Home » Additional Features, Statistical Analysis and Data Mining Highlights

Prize-Winning Articles Featured

1 August 2010 1,520 views No Comment
Joe Verducci, Statistical Analysis and Data Mining Editor

    This issue contains an application on drug safety, two papers on clustering, and three prize-winning works.

    The application features a method to account for covariates when searching for binary risk factors. The first clustering paper compares 40 criteria for clustering, both in terms of their relative performance over 1,080 designed data sets and their agreement with an external clustering considered a gold standard. The second offers a novel method, called the “snake,” to provide visual diagnostics along a near-minimal path.

    The three prize-winning works come from two competitions: the Institute for Operations Research and Management Sciences (INFORMS) 2009 Data Mining Contest and the ASA’s Statistical Learning and Data Mining (SLDM) prizes for best student papers.

    In the opening paper, Ola Caster, Niklas G. Norén, David Madigan, and Andrew Bates propose shrinkage logistic regression as a supplement to contingency tables for discovering binary transaction patterns that may be camouflaged by covariates. When applied to adverse drug reaction data collected by the World Health Organization, the method discovers combinations of risk factors faster than methods based on frequent sets, but may fail to identify established drug safety concerns. The key is that logistic regression provides the ability to distinguish direct association from indirect association caused by confounding.

    Lucas Vendramin, Ricardo JGB Campello, and Eduardo R. Hruschka distinguish two types of criteria for validity of clusterings: optimization-like, which assign a real value to any partitioning/clustering of the objects, and difference-like, which assess the relative performance along a nested sequence of clusterings. A novel transformation of a difference-like criterion into an optimization-like criterion enables fair comparison. Five methods are used to cluster each data set. For each clustering, the Jaccard coefficient measures the agreement of the clustering with the external standard; this is compared to each criterion by calculating the Pearson correlation over the five clusterings. The criteria are judged by their average correlation over groups of data sets. Performance of the criteria varies somewhat with number of clusters and dimension. Overall, versions of the silhouette width criterion perform best over most scenarios investigated here.

    Adam Petrie and Thomas Willemain employ techniques from the traveling salesman problem to construct a near-minimal path that traverses all data points in Euclidean space. By plotting individual segment lengths versus their position along the snake path, an analyst can visually detect features such as the relative density of regions and the number of modes. The technique is illustrated on a variety of artificial and real-world data sets.

    The INFORMS 2009 data mining contest posed two problems based on hospital patient information: identify future transfers to tertiary hospitals and predict patient mortality in hospital. Jianjun Xie and Stephen Coggeshall use stochastic gradient boosting decision trees to identify key variables and arrive at their predictions, whose accuracies substantially exceed those achieved by logistic regressions, both here and as reported in similar studies. Their paper describes the practical difficulties imposed by the data and the decisions they made when implementing the procedure.

    The SLDM best student papers involve new procedures for classification. Mu Qiao and Jia Li propose a “two-way” mixture model in each class, so that observations are partitioned into components and variables are partitioned into clusters. Clustering of variables may be specific to each component or common to a class. Each cluster of variables is assumed to have a Gaussian distribution with the same mean for each variable, and perhaps a structured (e.g., diagonal) covariance. On three real data sets, classification based on the Gaussian two-way model performed as well as or better than using a support vector machine.

    The last paper, by Seo Young Park, Yufeng Liu, Dacheng Liu, and Paul Scholl, focuses on computationally efficient multicategory classification when the number of classes is large. The proposed composite least squares (CLS) method uses a convex combination of two types of squared loss functions to improve on proximal SVM and yet have its computational complexity grow linearly with the number of classes. The CLS method also produces closed-form formulas to predict class probability.

    As a whole, the works give good perspective on current research at various levels of application and methodology. Computer scientists and statisticians are finding much common ground, and it is informative to trace the roots of each through the references provided.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading...

    Comments are closed.