## Teaching Intro Stats Students to Think with Data

The *ASA Curriculum Guidelines for Undergraduate Programs in Statistical Science* (PDF) states, “Institutions need to ensure students entering the work force or heading to graduate school have the appropriate capacity to ‘think with data’ and to pose and answer statistical questions.” The guidelines also note the increasing importance of data science.

While the guidelines were explicitly silent about the first course, they do state the following:

- “A focus on data [should] be a major component of introductory and advanced statistics courses and that student’s work with authentic data throughout the curriculum.”
- “A number of data science topics need to be considered for inclusion into introductory, second, and advanced courses in statistics to ensure that students develop the ability to frame and answer statistical questions with rich supporting data early in their programs, and move towards dexterous ability to compute with data in later courses.”

With a year having passed since the release of the guidelines and data science programs and courses burgeoning across the United States, *Amstat News* asked several educators about how they were changing their intro stats courses so students have “appropriate capacity to ‘think with data’.”

Ben Baumeris an assistant professor in the program in statistical and data sciences at Smith College, serving as the program’s director. Baumer spent nine seasons as the New York Mets’ statistical analyst for baseball operations and is a co-author ofThe Sabermetric Revolution: Assessing the Growth of Analytics in Baseball.

**Describe the introductory statistics course(s) you teach.**

At Smith, introductory statistics courses are offered through four departments in addition to ours (psychology, economics, government, and sociology). I teach our introductory statistics course, which is for students with prior exposure to calculus. Most of my students are majoring in the sciences, with a large majority coming from engineering, biology, neuroscience, and environmental science. All those majors require this course. Students get five credits for the course, since there is a required lab meeting in addition to the three regular lecture meetings. The use of R is integrated throughout the course, but the labs provide a comfortable environment for coding that offers one-on-one and peer-to-peer instruction. We use the OpenIntro through randomization and simulation textbook. Most of the traditional topics are covered, but the curriculum emphasizes randomization and simulation, regression modeling, and statistical computation.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

Creating a student who is capable of performing coherent statistical analysis in a single semester course is challenging. We spend a fair amount of time discussing topics that may not be as useful as they once were (e.g., *t*-tests, inference for a single proportion, chi-squared tests) and not enough time building skills students are likely to use in their future research (e.g., a deeper understanding of regression, logistic regression, data visualization, and data wrangling skills).

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

In the short term, I have pushed more data science elements into the course. In addition to the “mosaic” package for R, we are now using Hadley Wickham’s “dplyr” package in our labs. The idea behind this is that students will hopefully develop some familiarity and comfort with basic data wrangling skills in R such that they can build on these skills in future courses and research. I try to always use real data and emphasize modeling.

In the medium term, I plan to explore the possibility of modernizing the curriculum.

**What technology do you use in the classroom?**

R. We have an RStudio server set up for the students, and I use R pretty much every day in the lectures and, obviously, in the labs. I also use Google’s office suite (but never Excel).

**What is your favorite classroom activity for helping students “think with data”?**

Andrew Bray and I adapted an exercise from Nick Horton for the first day of class that builds intuition about inference for categorical data through randomization. There is an episode of *MythBusters* that explores the question of whether yawning is contagious. The MythBusters conduct an experiment in which 50 participants are placed alone in a closed room and asked to wait. The response variable is whether they yawned. However, they were first randomly divided into two groups. In the experimental group, the experimenter yawns as they close the door to the room, and in the control group, they don’t. The MythBusters conclude—without any inferential statistics—that because 29% of the participants in the experimental group yawned, but only 25% of those in the control group yawned, yawning is contagious.

What we then do is have the students use playing cards to simulate a randomization distribution for the number of participants in the experimental group who yawned. Of course, it’s the first day of class and they don’t know what any of this means, so we don’t use all these words. But I hope that going through the exercise helps build intuition about statistical inference—both in terms of how it works but also why it’s useful.

**Are you assigning students real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose, and how do you deal with them?**

Students always use real data in their end-of-semester projects. Some students get data from other professors with whom they are working, but most find it on the Internet. Often, this involves various data wrangling challenges—almost always involving data cleaning and missing data, and often involving reshaping or merging. I always provide help, but try not to write code for them. The biggest challenges that come up in these projects that we don’t cover in the course include data wrangling, what to do about missing data, how to model time series (or panel) data in regression, and logistic regression.

Johanna Hardinis professor and chair of the department of mathematics at Pomona College. She participated in creating the 2014ASA Curriculum Guidelines for Undergraduate Programs in Statistical Scienceand recently co-edited an issue ofThe American Statisticianfocusing on the undergraduate curriculum (Vol 69, No 4).

**Describe the introductory statistics course(s) you teach.**

Introductory statistics at Pomona is taught using R with a focus on simulations to understand the mechanics behind traditional inference. Permutation tests serve as a way to understand ideas of sampling distributions and variability and as a mechanism for discussing when tests are more powerful and when assumptions are violated.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

The biggest challenge for me is teaching about the importance of technical conditions (e.g., normality assumption, sample size) while still giving the students skills to make conclusions from data they will encounter in the future. A study that does not perfectly conform to ideal conditions does not necessarily warrant being thrown out. There is often still information hiding inside most data we encounter!

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

One of the biggest things I’ve tried to change in my classroom is incorporating dynamic data. That is, I want the students exposed to data that changes (i.e., is not static). Some examples of such data come from weather, sports, GDP, CDC, etc. Ideally, students will learn to download (often scrape) data sets from websites kept up to date by relevant institutions. In R, Hadley Wickham’s “readr” package and Jenny Bryan’s “googlesheets” package help students easily download many new types of data.

**What technology do you use in the classroom?**

The main technology I can’t live without is RStudio. That means, of course, that we use R. With RStudio, students can produce assignments that combine code, results, and write-up in a reproducible and readable way. The mosaic package goes a long way toward easily navigating introductory statistics ideas within R. And even in introductory statistics, Shiny has been made user friendly enough to let students produce fantastic interactive graphics.

Additionally, I love using applets in class. Primarily, I use the Chance & Rossman applets that go along with *Introduction to Statistical Concepts, Applications, and Methods*. However, there has been some recent work done to recreate many of those applets using Shiny in RStudio.

**What is your favorite classroom activity for helping students “think with data”?**

At the end of my introductory course, I have often used Shonda Kuiper’s TigerSAMPLING activity, which uses a video game framework to have students sample data from hypothetical tigers in a reserve. The students are able to think about sampling bias and generalizability while running a multiple regression on the observations.

**Are you assigning students real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose and how do you deal with them?**

I use a lot of the data within *Introduction to Statistical Concepts, Applications, and Methods*—most of the data sets come from actual studies, and the studies are well documented within the textbook. I also use data scraped from the Internet. For example, students download data from Wikipedia and Gapminder. The biggest challenge is to get the students to think carefully about what constitutes a random sample or a randomized experiment—and, most importantly, what to do when most often the data do not come from any type of random process.

Mine Çetinkaya-Rundelis an assistant professor of the practice in the department of statistical science at Duke University. Her work focuses on innovation in statistics pedagogy, with an emphasis on student-centered learning, computation, reproducible research, and open-source education.

**Describe the introductory statistics course(s) you teach.**

I regularly teach two introductory statistics courses. One of them is STA 101, a large course (120 students) comprised mostly of social science majors and students who have not yet decided on a major. This is a non-calculus-based course that introduces students to statistics as a science of understanding and analyzing data. Students meet with me twice a week in a lecture setting (although lecturing is minimal in this team-based flipped course) and meet with the teaching assistants once a week for computational labs using R.

The other course I teach is an introductory data science course for a small group of students who self-select into a cluster of quantitative courses to take in their first semester at Duke. This course has a much heavier computational component than my other course, as well as a strong emphasis on data wrangling, visualization, modeling, and effective communication of results.

Both courses introduce statistical inference and modeling and use the R statistical computing language, but differ in focus and depth. For example, in STA 101, students are given custom functions for creating bootstrap intervals, while students in the data science course instead learn to write for loops and construct the bootstrap intervals themselves. Similarly, in STA 101, students are provided cleaned data sets to work with, while students in the data science course are asked to scrape data directly from the web and then clean it before performing any statistical analyses on it.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

The biggest challenge in STA 101 is that the course has an ambitious curriculum for an audience that is primarily composed of students who enroll to meet a quantitative studies or major requirement, as opposed to a pre-existing interest in statistics. Motivating these students to develop an interest in statistics can be difficult, particularly in a passive learning environment using traditional lectures. For the last few years, I have been teaching this course flipped and team-based, and while almost all students like the interactive nature of the course, the views on having to prepare outside of class and work with teammates on graded work are varied among the students.

The biggest challenge I face in my data science course is striking the right balance between the amount of class time spent on statistical and computational topics. I would like students coming out of this course to be well prepared for the next (regression) course in the major, which means they need to have a good grasp of foundational statistical concepts. Meanwhile, we also need to spend a substantial amount of class time introducing computational skills like merging and cleaning data sets, working with non-flat data, interactive visualizations, gathering data off the web, etc. I’m still tweaking the distribution of class time dedicated to these complementary, but separate, components of the course, and I likely will continue to do so as newer tools become available.

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

My data science course is only two years old, and it was designed with the new ASA guidelines in mind. I make a point of using only authentic data sets in both of my introductory courses. This helps immensely with student motivation, as they can immediately see real applications of methods they are learning. To align my STA 101 course better with the new ASA guidelines and the emergence of data science, I have been updating my computational labs to have a heavier emphasis on data wrangling and visualization. I am a huge fan of R packages like “dplyr” and “ggplot2” for accomplishing this ambitious goal with minimal lines of code and with syntax that reads more like plain English.

**What technology do you use in the classroom?**

For computation, I use R via RStudio. Instead of downloading and setting up software, students access RStudio server instances maintained by the university. This means I have complete control over software and package versioning. This approach has been an incredible time saver for getting started with computation and has definitely reduced student (and instructor) frustration. Reproducibility is a central theme for the computational labs, and hence students complete all data analysis using RMarkdown. While this might initially sound like one more thing they have to learn, it actually streamlines the data analysis process and makes it a lot easier for students to organize their work by keeping everything (code, output, and narrative) in one place.

Another technology I use and love are clickers for keeping my large STA 101 class actively engaged during lectures. Clickers have the added benefit of immediate two-way feedback on the students’ understanding of specific concepts, which allows me to adjust my lesson plans based on the skills and needs of the students and allows them to gauge their own understanding. Times when a large proportion of students incorrectly answer a question provide the opportunity for peer instruction, in which students explain their thought process to their neighbors and often discover the source of their original error. This allows for students to be more engaged with the material and each other, and they continually assess their mastery of concepts and re-evaluate their understanding while still in class.

Building on the students’ enjoyment of the interactive nature of this course, I have implemented team-based approaches that rely on a student-centric flipped-classroom structure. The course is “flipped” in the sense that content delivery happens outside of class via online videos. Each learning unit starts with a readiness assessment that students take individually (using clickers) and in teams (using scratch-off sheets). These assessments hold students accountable for the videos they are expected to watch before each unit. This frees up class time for higher-level learning and mastery of the material via problem-solving and deliberate practice. These activities encourage the students to work together and explain concepts to one another, which sparks thoughtful and passionate discussions and makes a dramatic improvement in both their attitudes and their engagement.

Since teamwork is a substantial component of my courses, I also think it is important to give students the opportunity to evaluate each other and provide constructive feedback so they are more effective teammates. I have recently started using an app called Teammates for the peer evaluations, and I love it. It is a bit tedious to set up at the beginning of the semester; however, it’s pretty simple to copy the information going forward and run another evaluation session. The best feature is you can release anonymized feedback to students with the click of a button.

**What is your favorite classroom activity for helping students “think with data”?**

In both courses I described, I do a light introduction to Bayesian inference with a dice game. This activity is designed to get students to think about a prior belief, collecting data, putting these two pieces together to calculate a posterior probability, and then updating their prior in the next round with their posterior. Here is how the game works: I have two dice, one six-sided and the other 12-sided. A “win” means getting an outcome greater than or equal to four. Since the probability of “winning” (rolling ≥ 4) is higher with the 12-sided die, this is the “good die.” I hold the six-sided die in one hand and the 12-sided die on the other hand, but the students don’t know which is which.

We start with assigning prior probabilities to the two competing hypotheses:

H1: Good die is in my right hand

H2: Good die is in my left hand

Students quickly decide on P(H1) = 0.5 and P(H2) = 0.5, since they have no reason to assign non-equal probabilities to the two hypotheses. We also discuss that if they had additional information about me, like that I tend to favor my left over my right, perhaps it would be wiser to assign a higher probability to H2.

Then we move on to data collection. Students take turns asking me to roll the die on the right or the left. I roll and only tell them whether they won or lost, but I don’t tell them the outcome. We also record their choice (right or left) and the outcome (win or lose) for each round on the board. The ultimate goal is to come to a class consensus about whether the “good die” is in my right or my left hand. They can choose to play as long as they want before they make a call; however, there is also a cost associated with playing too long (i.e., collecting too much data). I have a bag of candy and each time they “lose” (roll < 4), I take away one piece of candy. If they make the right decision at the end, we pass around the bag of candy. If not, they lose all the candy. Also, since not all students are motivated by candy, I tell them that if we take too long playing the game, we may not finish the material and that they will need to learn it on their own … Usually the class is ready to make a call after about 10 rounds of the game, and the students arrived at the correct answer each time we played the game.

Once the game is over and the candy bag is going around, we discuss how they made the decision. I ask them about how their belief changed after each round, and how that affected their decision to ask me to roll the die in my right hand or the left hand in the next round.

We then formalize this discussion with probability calculations. Using a probability tree, we calculate the posterior probabilities associated with the two hypotheses at the end of round one, and then show how we can update our prior beliefs for round two using the calculated posteriors from the first round. Once the students are comfortable with the probability calculations, I enter the data we collected into a data frame in R and use some pre-written code to visualize how the posterior probabilities changed at each round of the game.

We also discuss the cost of data collection (candy and time), how each student might value these differently, and how that would inform whether they would prefer to keep playing the game to collect more data and win the bag of candy or cut the game short to make sure there is sufficient class time to cover all the course material.

I like this activity because it shows how we naturally use conditional probabilities when making decisions. I also like that it allows for introducing the Bayes’ theorem in a real decision-making context, instead of just calculating conditional probabilities for the sake of calculating them.

**Are you assigning students real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose, and how do you deal with them?**

Absolutely all data sets students encounter in my courses are real. Some of them are quite simple—bivariate categorical data reconstructed from Gallup, Pew Research, Public Policy Polling, etc. Others are publicly available data sets like the General Social Survey, World Values Survey, Behavioral Risk Factors Surveillance System, etc. I also scrape and gather some data from the web.

For example, I have been using a movie data set for the final project in my STA 101 course for the last few semesters. To construct the data set, I start with a list of (almost) all movies released in the United States. I then take a random sample of about 600 movies and obtain data on them such as runtime, release date, critic and audience scores, etc. from the IMDb and Rotten Tomatoes APIs. Additionally, I match the movies to historical Academy Awards and box office data.

I do only a little bit of data prep before releasing the data to the students: remove observations for which no data are available and observations for which information from IMDb and Rotten Tomatoes on the same variables don’t match. This leaves about 450 movies in the sample, and this is the sample released to the students for their final project.

Students work on this project in teams, and share their results with me and the rest of the class in a poster session. The project is open ended, but they all have to build a multiple regression model for predicting audience scores and pick a new movie and do a prediction for this movie. This means they need to obtain information about a movie of their choosing for variables in their regression model. They also need to think about how to code the award and box office variables for this movie, since that information is not yet available. What I really like about this is that this task allows us to talk about uncertainty around explanatory variables and how that affects the uncertainty around their prediction. Even if the students may not have thought about this while working on their project, the poster session format allows me to ask in-depth questions about such considerations.

One reason I really like this data set is that everyone knows a little bit about movies so they can reason about the validity of their findings, and they usually have a favorite movie they want to do a prediction for. Another reason is that the response variable, audience score, has a pretty symmetric distribution, which means the conditions for techniques we introduce in the course are met.

Obviously, recognizing when the conditions are not met is an important skill. An even more important skill is understanding the repercussions of the conditions not being met on the validity and scope of the conclusions. However, I think it is important to present a new technique in a situation where it is most appropriate to use it, and then provide examples of scenarios where it might not be. Unfortunately, it can be difficult to find data sets—especially where the response variable is numerical—where conditions for traditional inference methods are met.

Another big challenge is data wrangling, especially in a course like STA 101, where students’ computational experience is limited. They tend to have difficulty with tasks like releveling a categorical variable by combining some of the levels or creating a new variable based on existing variables in the data set. Creating multivariate data visualizations is also a challenge. Teaching R with “dplyr” and “ggplot2” has certainly helped in these regards, but I think data wrangling needs to be a central focus of the course to make real strides with these challenges, which means something will need to come out of the curriculum. It’s been challenging to decide what that should be.

I’ve been trying to handle this challenge by offering lots of help in office hours and on the course discussion forum, but a better solution would be to teach students the necessary skills they need to feel comfortable accomplishing these tasks (or looking for help on the web) themselves. The real challenge is figuring out how much to teach. For example, I had a team of students who wanted to add a variable to the movies data set identifying whether the movie was based on a book. Should we be teaching STA 101 students how to scrape data from the web to be able to do this automatically, or should we be telling them “that’s a neat idea, but it’s beyond the scope of this course”? Two years ago, I would have said the latter is the right answer, but I could be convinced of the former today, since modern tools that make such tasks pretty simple are readily available.

Josephsaid:It’s great to hear of these creative approaches towards teaching statistical thinking. I’ve encountered a lot of students who end up frustrated and perplexed in “traditional” introductory stats courses that mostly focus on rote formula memorization. I hope more curriculums adopt these practices that truly fit in with the modern computer age…

Allan Stewart-Oatensaid:Very interesting and helpful. I hope these attitudes and ideas will spread and grow. A minor grumble is the mantra about “real data”. Of course we want it. But sometimes fake data will illustrate a point more easily. Regression to the mean suggests tall people have children shorter than themselves, and short people have the reverse. So we’ll all end up the same height. Of course, if you run it backward, tall people had parents shorter than themselves, and …, so we all used to be the same height. But suppose generation 1 has 3 short people, 4 middle and 3 tall; of the short people, 2 have short children but 1 has a middle child; of the middle people, 1 has a short child, 2 have middles and 1 has a tall; and of the tall people 1 has a middle child and 2 have talls. Thus generation 2 looks the same as generation 1, despite regression to the mean.