Iowa State Team Is Top U.S. Team in 2013 Data Mining Cup
A data analytics team in the department of statistics at Iowa State University placed fifth in the 2013 Data Mining Cup (DMC), an international competition hosted by the German data analytics company Prudsys AG, and first among all the teams from U.S. institutions.
The Iowa State team included six statistics PhD students: Cory Lanker, Fangfang Liu, Jia Liu, Ian Mouzon, Wei Zhang, and (team leader) Wen Zhou. Their participation in the 2013 DMC was motivated by their interest in applying material from a doctoral-level course in machine learning to a large real predictive analytics problem. The course, taught by Steve Vardeman, covers such topics as linear methods of prediction and classification, basis expansions and regularization, kernel smoothing methods, variance-bias trade-offs, inference and model averaging, additive models and trees, boosting, neural nets, support vector machines, prototype methods, unsupervised learning, random forests, and ensemble learning.
This year, the competition required teams to develop an algorithm to predict whether a visitor to a retail website will place an order. Customers who visit online shops carry out various “transactions” during any given session or visit. Transactions may include clicking on specific products to read more about them, adding or removing products from the shopping cart, etc. At the end of a session, the visitor may place an order for one or more products or end the session without any purchases. The goal of the 2013 DMC was to develop a method to predict whether the visitor will place an order on the basis of the transaction data collected during the session.
The strategy adopted by the Iowa State team combined a variety of methodologies …
So that teams could develop their methods, Prudsys AG made available a large dataset with historical information about transactions and outcomes (purchased/did not purchase) from a German retailer. The training data corresponded to 50,000 customer sessions and included almost a half million transactions. The test dataset included 5,111 sessions for which contestants were asked to predict whether a purchase had or had not resulted. To win the competition, a team needed to make the fewest classification mistakes on the 5,111 test sessions.
The strategy adopted by the Iowa State team combined a variety of methodologies from what is currently known as data analytics, machine learning, or statistical learning. Data analytics is not a new field, but it has grown in importance as the amount of data collected by the private sector, government, and universities has ballooned. The goal of statistical learning is to find the intrinsic patterns hidden in such data and enable researchers and practitioners to make reliable predictions and accurate forecasts. Modern statistical methods for Big Data have become critical in the successful design of management strategies and decisionmaking as well as in areas as diverse as drug discovery, climate modeling, and finance.
The winning team, from the Technical University of Dortmund, made 144 classification errors in the 5,111 test sessions (2.82% error rate). The Iowa State team made only 10 more errors than the winning team and ended the competition with an error rate of 3.01%. That is, using the transaction information from the test sessions and algorithms developed from the training data, the ISU team was able to correctly predict whether a visit to the online shop would result in a purchase almost 97% of the time.
Winners of the competition were announced in Berlin during Prudsys AG’s User Days. The top 10 teams were invited to attend the awards presentation in person and present their work, and two of the ISU team were on hand to represent the team and be recognized as up-and-coming data miners.
The competition has been held annually since 2002, and participation is limited to teams from educational institutions. In 2013, the competition included 99 teams from 77 educational institutions and 24 countries. Other teams from the United States included those from the University of California at Los Angeles, University of Southern California, and Northwestern University.