A Festival of Data: Student Perspectives
A June Amstat News article by Robert Gould, Benjamin Baumer, Mine Çetinkaya-Rundel, and Andrew Bray described DataFest, an annual Big Data analysis competition for college students; see also papers by Çetinkaya-Rundel and Dalene Stangl (2013, CHANCE) and Gould (2014, ICOTS9 Proceedings). The ASA board, at its April meeting, approved a proposal from the DataFest organizers to make the ASA the national headquarters for DataFest. For this month’s President’s Corner, I invited Andrew Bray, postdoctoral research associate at the University of Massachusetts, Amherst, to write about student perspectives on DataFest. As a UCLA graduate student, Andrew helped Rob Gould, DataFest’s founder, organize the first few UCLA events. This year, he and Ben Baumer organized the inaugural Five College DataFest.
On a Friday night in March, Dana Udwin and four of her friends shifted impatiently as the elevator brought them up to the penthouse of Lederle Tower in the college town of Amherst, Massachusetts. They’d been waiting for this event for weeks and were expecting to stay out late. They knew students from other local colleges would be attending as well, so they came outfitted for success: Each young woman carried a laptop with R installed, ready to tackle any data analysis that came their way.
The event was the inaugural Five College DataFest, a weekend-long data analysis competition held at the University of Massachusetts, Amherst. More than 60 students showed up that Friday night, formed into teams of up to five students. Each team carved out a corner of the penthouse, dragged tables together, reconfigured mobile white boards, and laid claim to power outlets to make a home base for the next 40 hours. Later that evening, the data set would be revealed that would motivate the competition: detailed energy efficiency data from a company called GridPoint.
DataFest challenges students to wrestle with a data set that is far larger and richer than any data set they encounter in the classroom. What students also find challenging is the freedom they’re given in their analysis. Dana said, “I was surprised by how open-ended DataFest ended up being. We were given this data set and released to do whatever we wanted to do with it. So there was the immediate challenge of answering questions that we hadn’t even yet developed.” Guiding the teams to be sensitive to the insights the data can and cannot support is a cadre of onsite faculty and data professionals.
The teams’ primary directive was to prepare five slides and a short presentation for a panel of judges on Sunday afternoon, when they would compete for three prizes: “Best Visualization,” “Best Use of External Data,” and an overall “Best in Show.” With this goal in sight, the penthouse buzzed day and night as teams sketched out strategies; delegated tasks; and wrestled with the challenge, excitement, and frustration that comes with getting your hands into a rich data set.
Dana’s team, The p-Valuables, had a strong background in statistics and computing with data, but the event draws students with a wide range of experience. One team consisted of physics students who had never taken a statistics class. Dana remarked, “They looked at the data from an angle that my group didn’t even consider. Part of what made the weekend so exciting was seeing the diversity of approaches taken by all of the different groups.”
On Sunday afternoon, after the teams presented their findings and the judges had a chance to deliberate, The p-Valuables were awarded “Best in Show” for their clear and compelling presentation of the benefit energy monitoring can have on energy efficiency.
In an 11th-hour twist, the judges decided to replace the prize for “Best Use of External Data” with “Best Garbage Detection.” Some of the data fields from GridPoint, it turned out, had been aggregated over time, though there was no indication of this in the code book. One group picked up on this through careful data cleaning and was able to adjust their analysis accordingly. Their recognition by the judges was an important lesson in the challenges of working with real data, a lesson that is too rarely captured with the tidy data sets so often used in the classroom.
DataFest as Path Shaper
The DataFest in Amherst was one of five such events held across the country this spring. Emory and Princeton also hosted their first DataFests, while Duke hosted its third and UCLA its fourth.
The students who participated in the first UCLA DataFest in 2011 have all since graduated and moved on to careers in industry and academia. I wondered about what memories they have of the event and how it shaped their paths in statistics, so I got in touch with some of them to find out.
DataFest has both a collaborative and competitive side. What are your thoughts about how those are balanced?
Jennifer Chuu (UCLA ’12): DataFest struck a great balance between collaboration and competition. It gave us a glimpse of what industry stats is like in terms of bouncing ideas off of one another and catching one another’s mistakes, pointing out subtleties, etc. Competing with each other also gave us the drive to really be as insightful as we could.
What was the greatest challenge your group faced?
Jennifer Chuu (UCLA ’12): The hardest part was stepping back and looking at the big picture, especially when we were caught up in the details of a certain model or method. Sometimes we had to really ask ourselves why we were doing something to find out that we had veered off track a bit and had to center ourselves again.
Max Schneider (UCLA ’12): This was now three years ago, but I remember that my group had overly ambitious plans for what we could accomplish in one weekend. We were presented with a data set from the Los Angeles Police Department concerning the spatial distribution of crime and we wanted to incorporate a lot of external data. We ran into the inevitable problem that the outside data existed on levels that did not match the LAPD data and we burned a lot of time trying to merge them.
With the event now several years in the past, what have been your main takeaways from DataFest?
Mallory Wang (UCLA ’12): I think the best thing about DataFest is just allowing students to be exposed to data in its natural form. Often, in classrooms, we get toy examples and leave school thinking 1,000 rows of data with cleaned variables is Big Data when in reality we’re seeing things closer to a website that scrapes into millions of observations with no labels for variables. The exposure to that kind of data and actually creating something to present at the end of the competition, no matter what level of success, is a great accomplishment and encouragement for a future in statistical study or profession.
Elizabeth Frank (UCLA ’12): DataFest really set me up to do well professionally. It was a great chance to network, and I gained skills in analysis. I ended up with an amazing internship at the LA Times after presenting with my group. After that, when interviewing for my current position, DataFest was brought up and discussed at length. I also gained immense confidence in my statistical knowledge through talking with the professional consultants and presenting.
Ashley Chiu (UCLA ’14): I definitely made a lot of really great friends, and we’re still laughing about some of the fun, crazy things that happened at DataFest. Without a doubt, I learned a lot about data visualization and how different applied statistics is from theoretical/classroom statistics. As for interviews, I think I actually got my current job because I impressed my interviewer with the project my team had produced using K-means clustering. Everyone I’ve talked to in the statistics community has been so excited to hear about my experience and the different projects that are produced in such a short period of time.
Most importantly, I think the event has just truly amplified the rigor, opportunities, and overall awesomeness of statistics. The competition is analogous to the traditional case competitions done in other fields, and in that way, DataFest has really added to statistics’ rising presence in academia and the real world.
Good for Job Interviews
I checked in with Dana to see how things had been going since her graduation from Smith College this past May. When I asked her what her strongest takeaways were from DataFest, she talked about the value of persisting with the full data analysis process—from asking questions to communicating results—while working in a collaborative environment. As a recent graduate, she also related a more concrete takeaway from DataFest: “It has come up in every job interview I have had since. I’ve been asked, ‘Tell me about a time when you had to work on a project on a tight deadline and you didn’t know what the answer would be,’ or, ‘Tell me about a time when you had to convince an audience of something that you, yourself, were not sure of.’ These are questions that just beg to have DataFest as an answer.”