Home » Additional Features, Statistical Analysis and Data Mining Highlights

Special Issue Features Symbolic Data Analysis

1 June 2011 1,475 views No Comment
Lynne Billard, Special Issue Editor

    This issue of Statistical Analysis and Data Mining (SAM) is dedicated to papers from the field of symbolic data analysis (SDA), one of the emerging fields featured in the February issue of SAM in Arnold Goodman’s article, “Emerging Topics and Challenges for Statistical Analysis and Data Mining.” This field will come to dominate our thinking and our approach to statistical analyses as contemporary computational capabilities expand.

    While there are many data sets (small or large) that are naturally symbolically valued, it is in handling massively large data sets that we will see a major role for symbolic data. Typically from aggregation directed by underlying scientific question(s), the resulting data sets perforce contain symbolic data. In its most basic description, the familiar classical realization of a random variable is represented by a single point in p-dimensional space, whereas a symbolically valued realization is a hypercube or distribution in p-dimensional space.

    The concept of symbolic data was first introduced in 1987 by Edwin Diday, who at the time was engaged in research in clustering methodologies. He recognized early that summarizing data values inside a cluster retained limited information only about the cluster (e.g., cluster means), while considerable information (e.g., cluster internal variations) was completely lost. This led to the general concept of symbolic data. This was a radical concept at the time; yet, it is an expansively creative approach to thinking about data in new ways that retain more completely the knowledge and information contained therein. It also became quickly apparent that new methodologies for symbolic data realizations were essential.

    Diday’s influence pervades all the contributions in this issue. The first two papers are review papers. “Brief Overview of Symbolic Data and Analytic Issues” is what an editorial board member called a “gentle introduction” to the field, restricting its content to illustrating the major types of symbolic data (modal multivalued, intervals, and histograms) and highlighting issues that are not usually present in classical data analyses. The paper also illustrates how other forms of complex data, such as fuzzy data, are quite a different domain, ativan online order requiring different types of analytic methodologies. It concludes with a perspective of future research directions.

    “Far Beyond the Classical Data Models: Symbolic Data Analysis” begins with a more technical introduction to symbolic data, how they arise, and their structures with numerous examples. Then, the authors provide an extensive review of the current state of the art, most especially available methodologies to handle a range of statistical analyses for various situations.

    The third and fourth papers deal with aspects of principal component analysis. “Principal Component Analysis with Interval Imputed Missing Values” estimates missing values as interval values in a way that allows for the degree of uncertainty in the missingness. Then, the so-called vertices method for principal component analysis is applied. Also, new theoretical results are derived to support the methodology.
    Much of current methodology in SDA has dealt with interval-valued observations. “The Quantile Method for Symbolic Principal Component Analysis” breaks new ground by subdividing modal multivalued data and histogram data into quantiles (more general than, but reminiscent of, Tukey’s five-number summaries). The author develops monotone structures characterized by nesting joint regions; hence, traditional principal component analyses proceed.

    The fifth paper, “Clustering Large Data Sets Described with Discrete Distributions and Its Application on TIMSS Data Set,” presents two clustering methods—the adapted leaders method and the adapted agglomerative hierarchical clustering Ward’s method—for data for which each realization is a discrete distribution. The new methodology is applied to a TIMSS data set.

    Time series in the symbolic setting is a difficult problem and has received little attention. “Smoothing Methods for Histogram-Valued Time Series: An Application to Value-at-Risk,” tackles both time series and histogram data. Using the notion of a barycenter histogram, the authors develop a one-step-ahead histogram forecast applied to a financial value-at-risk data.

    The last paper, “Principal Component Analysis for Interval-Valued Observations,” expands upon the vertices method and introduces new ways to visualize and interpret the resulting principal component hypercubes by placing bounds on the contributions of each vertex; it also shows how results obtained by using classical surrogates are inadequate.

    These articles provide a small window into SDA’s dynamic future.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
    Loading...

    Comments are closed.