Big Data Goes to College
Robert Gould, Benjamin Baumer, Mine Çetinkaya-Rundel, and Andrew Bray
A new statistics challenge is taking college campuses by storm and engaging students in solving real-life Big Data problems more complex than those they are able to engage with in class. In just three years, the competition—called DataFest—has grown from a single event with 25 students to multiple events across 16 colleges and universities with more than 400 student-participants.
Organizers, led by DataFest founder Robert Gould, and the ASA are launching a campaign to establish DataFest events at other schools. The ASA Board recently approved a recommendation for the association to be the lead sponsor and headquarters for DataFest.
What Is DataFest?
DataFest is an annual competition held early each spring. During the event, teams of up to five students work to extract insights from a large and rich data set. This unique program takes data analysis learning beyond the time constraints typically encountered in a classroom setting. It naturally attracts statistics students, but it also draws majors in engineering, math, computer science, social science, and other fields of study.
Gould, who is also the director of the Center for Teaching Statistics at the University of California at Los Angeles (UCLA), said his motivations for creating DataFest at UCLA in 2011 were the time constraints that limit work with large data in the classroom and engaging bright students beyond the constraints imposed by finals or final projects. DataFest expanded to Duke University the following year. Last year, UCLA collaborated with several nearby schools—Pomona College, the University of Southern California, Cal State Long Beach, and the University of California, Riverside—while Duke students competed with their counterparts from two nearby Tobacco Road schools.
During the 48-hour event that begins on a Friday evening and concludes the following Sunday afternoon, teams compete head-to-head for prizes in categories that include “Best Insight,” “Best Visualization,” and “Best Use of External Data.” Student-teams work intensely during the weekend and are allowed a limited number of slides and a few minutes to present their findings to the judges—graduate students, professors, statisticians from businesses, and representatives of the organization that provides the data set.
DataFest emphasizes the art of storytelling with data. For this reason, DataFest competitors have complete autonomy regarding how they approach the analysis problem, which can be both exciting and intimidating. Because the stakes are low (e.g., no grades) and the rewards are high (e.g., prizes are awarded and DataFest participation is great on a résumé), students generate risky, yet creative ideas to solve the problem.
DataFest is a friendly competition; in fact, students are encouraged to share ideas. The competition aspect gives the students a goal and generates camaraderie among team members. Competitors also engage with professionals with statistics and data-analysis expertise. Most importantly, the students have fun.
The Big Data
Each year, the data and challenge are different, but the theme of making sense of Big Data is carried over. The data set, which is real-world data of interest to the providing organization, is not unveiled until the competition.
For the first DataFest at UCLA, the data consisted of 10 million arrest records spanning a six-year period provided by the Los Angeles Police Department. In 2012, the data set came from micro-lending site Kiva.org, and online dating service eHarmony.com provided the data last year.
This year, the data set came from GridPoint, a company that offers data-driven energy management systems (EMS) that enable customers to increase energy savings, optimize facility efficiency, and promote sustainability agendas. The data consisted of a sample of 110 U.S. businesses and included hourly energy consumption values reported by multiple onsite sensors for the period covering 2011 to 2013. It also included information about the environmental factors and energy consumption at these businesses prior to installation of the EMS. The data challenged DataFest competitors to find patterns that would help a business decide to implement energy-saving steps.
A Big Benefit
During the competition, many students strive to catch the attention of industry representatives who attend the event to offer advice and recruit students with the best analytical skills.
“DataFest is more than just a competition to students nearing graduation and the industry representatives who are seeking new statistical talent,” says Gould. “Employers come to recruit the next generation of data professionals. In DataFest’s relatively short history, numerous students have showcased their statistical skills, developed contacts with employers, and even accepted employment offers.”
Long after DataFest, student-competitors who note the event on their résumé have found potential employers keenly interested in learning about their participation and how the experience translates to the job opening for which they are interviewing, added Gould.
This year, DataFest encompassed five competitions. Following are brief summaries for each:
• Duke—112 students divided into 21 teams competed March 21–23. Students came from Duke University, The University of North Carolina, North Carolina State University, and Dartmouth College in Hanover, New Hampshire. Mentoring the teams were consultants from IBM, JMP, MetLife, Duke Energy and Carbon Offsets Initiatives, and faculty and graduate students from the participating schools.
“DataFest is an amazing opportunity for students to tackle a substantial real-world problem while honing their computational and statistical skills. Each year, the students surprise the visitors, the judges, and themselves with the variety and the quality of their analyses,” said Mine Çetinkaya-Rundel, assistant professor of statistics at Duke. For more information, visit the Duke DataFest website.
• UCLA—170 participants divided into 40 teams competed May 2–4. The students represented California Polytechnic State University San Luis Obispo, Pomona College, UCLA, the University of California at Riverside, and the University of Southern California. Consultants from Google, Hot Topic, JPL, Digital Trend Analytics, Southern California Edison, Cedars-Sinai Medical Center, e-Harmony, and Summit Consulting provided counsel and scouted for talented analysts.
“Everyone—the students, the faculty, and our VIP consultants—had a great time. While DataFest is fun, the students worked intensely and produced some amazing findings,” said Gould. For details, visit the UCLA DataFest website.
• Five College DataFest—60 participants divided into nine teams competed March 28–30. Students came from the University of Massachusetts-Amherst (UMass) and Amherst, Hampshire, Mt. Holyoke, and Smith colleges. Consultants came from MassMutual, IBM, Athena Health, and each school.
“There was a wide range of skills among the participants, but the open-ended nature of the problem provided enough flexibility for every group to contribute a different analysis,” said Benjamin Baumer, Smith College visiting assistant professor of mathematics and statistics.
UMass postdoctoral fellow Andrew Bray added, “DataFest strikes a good balance between the energy and excitement of competition with the support and esprit de corps of a broad collaboration.” For more information, visit the Five College DataFest website.
• Emory University—27 participants divided into eight teams competed April 4–6. Emory went solo in its first year, but may invite other local schools next year. Serving as consultants were Emory graduate students and faculty. The school hopes to use outside consultants in 2015.
“The energy level and unbridled enthusiasm of the undergraduates was irresistible. The students were highly motivated, eager to learn, and made tremendous gains in statistical and programming knowledge,” said Shannon McClintock, statistics lecturer. The best team name was ANOVA One Bites the Dust.
• Princeton University—46 students divided into seven teams competed March 28–30. Consultants from IBM, GridPoint, Google, and New Jersey-based Public Service Electric and Gas Company counseled the student-teams during the event.
“DataFest was a great success. We had several very creative presentations,” said Philippe Rigollet, assistant professor of statistics. “The participants were fun to be around. Some students put in 50 hours of work, but it was still pretty laid back.” For details, visit the Princeton DataFest website.
ASA and DataFest
As the competition’s new headquarters and lead sponsor, the ASA will help schools set up DataFest events, secure data providers, promote the competition, enlist the support of relevant ASA sections, and recruit judges and national sponsors. To aid the creation of new events, the ASA and DataFest organizers will develop a “how-to” kit that will include data, advice, and a discussion forum.
If you are interested in hosting a DataFest event next year, you can learn how at the 2014 Joint Statistical Meetings. A contributed panel session on DataFest, organized and moderated by Çetinkaya-Rundel and with participation by other DataFest organizers, will be held August 6 at 2 p.m. Drop by to learn more about this growing program.