## Teaching Intro Stats Students to Think with Data

The ASA *Curriculum Guidelines for Undergraduate Programs in Statistical Science* (PDF) states, “Institutions need to ensure students entering the work force or heading to graduate school have the appropriate capacity to ‘think with data’ and to pose and answer statistical questions.” The guidelines also note the increasing importance of data science. While the guidelines were explicitly silent about the first course, they do state the following:

- “A focus on data [should] be a major component of introductory and advanced statistics courses and that student’s work with authentic data throughout the curriculum.”
- “A number of data science topics need to be considered for inclusion into introductory, second, and advanced courses in statistics to ensure that students develop the ability to frame and answer statistical questions with rich supporting data early in their programs, and move towards dexterous ability to compute with data in later courses.”

With a year having passed since the release of the guidelines, publication of a *The American Statistician* special issue on undergraduate education, and spread of data science programs and courses across the United States, *Amstat News* asked several people for input about how they were changing their intro stats courses so students have “appropriate capacity to ‘think with data’.”

Nicholas Horton, professor of statistics at Amherst College, has research interests in missing data methods and statistical education and has co-authored more than 150 papers and a series of books on statistical computing. He is an ASA fellow and chair of the Statistical Education Section.

**Describe the introductory statistics course(s) you teach**

We teach several flavors of introductory statistics: One is a general course with no prerequisite, while the second has a calculus prerequisite for those with more extensive quantitative background. We see an increasing number of students who’ve completed AP Statistics in high school and have adapted our intermediate statistics course (regression and design) to allow them to dive in as early as their first semester. All courses incorporate computation early and often, feature the use of modeling as a way to make sense of data, and introduce aspects of multivariate thinking.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

My biggest challenge is to help students see the potential for statistics to help “make decisions in the face of uncertainty” (as ASA President Jessica Utts has so eloquently stated), rather than develop a collection of methods that they apply cookbook fashion.

Students often have trouble seeing the big picture. As an example, we want them to be able to interpret a confidence interval, rather than just mechanically perform a test. This challenge has led us to prune some of the topics that have traditionally been at the core of the intro course (such as probability, derivation of different tests for different situations, and use of tables).

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

At Amherst, we’ve continued to work to find more ways to introduce students to the excitement of statistics as the science of learning from data. We’ve brought multivariate thinking into the heart of the intro course, exposed students to data wrangling skills (through an end-of-semester group project that involves fitting and interpreting a multiple regression model), and focused on developing the capacity to communicate results and findings.

**What technology do you use in the classroom?**

We have students using a cloud-based version of RStudio Server Pro beginning on the first day of class. This is free software for academic use that provides a simplified interface to R. All this requires from the students is a web browser. We use the “mosaic” package and its modeling language to calculate summary statistics, display graphical visualizations, and estimate and assess models. R Markdown is used to help structure their analyses. This is an attractive and workable environment since all the commands we use in the course fit on a single piece of paper (Randy Pruim calls this “Less Volume, More Creativity”). I can’t imagine teaching statistics without access to an RStudio server, as it dramatically reduces the friction of introducing new technology to students.

**What is your favorite classroom activity for helping students “think with data”?**

My colleague Susan Wang has developed a great activity, titled “Visualization as the Gateway Drug to Statistics in Week One.” After a short background lecture that introduces several univariate, bivariate, and multivariate displays, students are turned loose in groups on a data set. With some assistance from the instructor, they create and share (via the free RPubs service) their graphical displays and interpretation. This is a wonderful way to get students “thinking with data” and beginning to develop statistical judgment and language. They also quickly realize that this isn’t like most math courses!

**Are you assigning students to use real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose, and how do you deal with them?**

The *Journal of Statistics Education* Datasets and Stories Department is a wonderful source of data sets. (I particularly like Albert Kim’s set of profiles from 59,946 San Francisco OkCupid users.) Hadley Wickham’s data packages for R also provide great fodder for the classroom. I’m particularly fond of the “nycflights13” package, with data on all flights from NYC airports in 2013 (n= 336,776 rows), and “fueleconomy,” with data for all cars sold in the United States from 1984 to 2015 (n=33,442 rows).

None of these are big (or even “medium” data), but they get students thinking about bigger issues and serve as precursors to future exploration. We let students pick their own data sets for projects, which gets them thinking about how to answer statistical questions of interest to them. It also demonstrates that real-world problems don’t generally present themselves as neat and well-characterized rectangular arrays with no missing data.

For all these data sets, some data preparation is needed. Our focus in the first course is to provide students experience with statistical practice (where the instructors and a group of peer tutors assist with the technology). In later courses, we take a backseat, with the students taking the lead on data management.

Peter Brucefounded the Institute for Statistics Education in 2004 with courses on introductory statistics, data mining, and resampling; it now offers 100+ courses. Bruce authoredIntroductory Statistics and Analytics: A Resampling Perspectiveand co-authoredData Mining for Business Analytics.

**Describe the introductory statistics course(s) you teach**

The course I designed and help teach is called “Introductory Statistics for Credit,” a fully online course that starts every month at the Institute for Statistics Education at Statistics.com. The institute fields about 100 online courses in statistics and data science, most aimed at working professionals, but this introductory course attracts many students seeking to satisfy a requirement, hence “for credit” in the title. The course is based on my own book (*Statistics and Analytics: A Resampling Perspective*), which we provide online for our students. A team of instructors, led by Michelle Everson at Ohio State, is supported by online teaching assistants.

As this course has evolved, it has taken on a data science perspective gained, in part, from my other work guiding the expansion of Statistics.com into analytics and programming courses (R, Python, Hadoop, SQL, SAS) and data and text mining courses. I am also a co-author of *Data Mining for Business Analytics*.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

Right now, it is how to incorporate the use of R for a larger number of students, without having to turn the course into a “learn how to program with R” course. At the same time, given our student corpus and their needs, it is not appropriate to require all students to use R. We maintain our students’ ability to choose software to use in the course and support them when they have questions or difficulties.

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

We have had ASA and GAISE guidelines in mind for about a decade as we have developed and modified our courses. Michelle Everson, our lead instructor, is very involved in the statistics education community, as a previous editor for CAUSEweb, MERLOT, and the *Journal of Statistics Education*. I would sum up our approach as asking, at every stage of text and curriculum development and at the most-detailed level, “Is this scenario, issue, problem, approach, etc. something about which a statistically innocent but otherwise educated professional would say ‘yes, I can see how this relates to my professional world.’”

With the rapid growth in importance of data science, all our materials now place the methods being taught in the context of the two key communities in statistics: researchers and data scientists. As methods are illustrated, we show how the method fits into the needs of each community.

We also rely heavily on resampling and bootstrapping for the inference components of the courses. It is better understood this way and fits into the algorithmic orientation of data science.

Finally, we use realistic data and scenarios that fit the data science world (see below).

**What technology do you use in the classroom?**

We allow students the choice of using R, Statcrunch (web-based), Resampling Stats, or Box Sampler (the latter two are Excel add-ins). Our online platform offers a mix of videos, discussion forums, auto-graded quick quizzes, and human-graded exercises and projects.

**What is your favorite classroom activity for helping students “think with data”?**

It’s actually a coin-flipping exercise. We ask students to mentally “invent” 50 coin flips, and then actually flip a coin 50 times and report the results on a shared Google spreadsheet. The actual coin flips invariably have the longer runs of heads or tails, which spearheads a discussion about how the human mind over-interprets randomness. It sets the foundation for the whole machinery of inference.

**Are you assigning students to use real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose, and how do you deal with them?**

We use an anonymized and modified set of real customer purchase data from a software company. We were able to arrange for its use because of my connection to the software company, which I recognize is an unusual circumstance. Other data we use is binary outcome data based on realistic A-B tests that an eCommerce firm would do (e.g., click ratios for two web headlines). We also use some data from published studies (e.g., the relationship between cotton dust exposure and lung disease).

Real data is messy, and we do most of the data prep. You face a choice: you spend a lot of class time on data handling issues or you teach the analytics. At Statistics.com, we have other short courses that focus on the data munging, and those classes focus on a single tool, since programming facility is the key. So we have individual introductory courses in R, Python, SQL, and SAS, where data munging is more the focus than the statistics.

Stacey Hancock, assistant teaching professor at the University of California at Irvine, is focusing on research primarily in statistics education, with additional interests in time series analysis and environmental statistics.

**Describe the introductory statistics course(s) you teach**.

The introductory statistics course at UCI, Stats 7, covers a fairly traditional set of topics: descriptive statistics and plots, sampling and experimental design, some probability, sampling distributions, and one- and two-sample inference for means and proportions. We use *Mind on Statistics* by Jessica Utts and Robert Heckard, which introduces statistical inference early in the textbook through the chi-squared test for 2×2 tables and confidence intervals for one proportion.

Though we have some statistics minors, the primary audience is a wide variety of undergraduate majors ranging from biology and psychology to dance and international studies. We have 220 students in each lecture that meets 50 minutes three times per week. Statistics graduate student teaching assistants lead discussion activities with 55 students in each discussion section. Activities include randomization tests and simulations, often preceded by tactile simulation using cards, tickets, or plastic pigs. UCI is on the quarter system, so our course only lasts 10 weeks.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

Getting students excited about statistics! The majority of students take introductory statistics because it is a required course, so the biggest challenge is to convince them that statistics is important and applicable to their daily life. Too often, students leave an introductory statistics course thinking statistics is just comprised of the normal distribution and t-tests. My goal is to teach students that, in this data-driven age, statistical literacy and statistical thinking are vital skills and statistics is applicable wherever we use data to make decisions.

**How are you adapting your introductory course(s) in light of the new ASA guidelines and the emergence of data science?**

I have been introducing more multivariable thinking in our introductory statistics course. We do not cover multiple regression, but I continuously challenge them to think about what other variables, confounders, or effect modifiers may be present in the study.

Two of my favorite data sets for multivariable thinking are the 1973 Berkeley graduate admissions data and average SAT score and expenditure data for each state. Both data sets exhibit Simpson’s Paradox. For example, when we plot average SAT score against pupil expenditures, we see a negative relationship. Does this mean we should take money away from the schools? Well, if we investigate further and stratify by the percent taking the SAT in each school, we see a positive relationship between average SAT score and pupil expenditures within each group. We still can’t conclude cause and effect since it is an observational study, but students are exposed to a scenario in which a marginal association does not match the direction of a conditional association.

There is now less focus on formal inference and more focus on exploratory data analysis and the scientific process as a whole in our course. On the first day of class, I have a “data discussion” using real-time data from gassbuddy.com and the U.S. Energy Information Administration Gasoline and Diesel Fuel Update. (I took this idea from Rob Gould’s October 2014 “Data Discussion” webinar.) This data discussion takes most of our 50-minute “lecture,” and students guide the exploration, comparing the pros and cons of different graphical displays and different sources of data. The goal is to instill curiosity. As Rob Gould says, “Data are begging to be questioned!”

**What technology do you use in the classroom?**

For statistical analyses, we use R Commander, a graphical user interface for R. We use a variety of online applets to build conceptual understanding of sampling variability, *p*-values, and confidence intervals through simulation. Additionally, we use clickers to encourage participation and discussion and to reinforce important concepts in the classroom.

**What is your favorite classroom activity for helping students “think with data”?**

One of my favorite classroom activities is a fairly simple activity I adapted from an activity Jessica Utts used in her introductory statistics course. The class is divided into teams of 3–4 students. Each team is given a team sheet, one sheet of colored paper, and either an overhead transparency sheet (if the classroom has an overhead projector) or another sheet of paper (if the classroom has a document camera). Each team comes up with a hypothesis about two binary variables of their choice. Then, the colored sheets of paper are used as tally sheets and are circulated around the room. Students then summarize and graph their data, assess evidence of their hypothesis, and present it to the class.

We use this activity in the first week of class, before they have seen sampling variability and hypothesis testing. It provides an opportunity to ask the question, “Could this have happened by chance?” and lead them through some informal inference ideas.

**Are you assigning students to use real data in your course? If so, where do you get it? Do you prep it? What unique difficulties do the data pose, and how do you deal with them?**

Almost all data sets students encounter in our introductory statistics course are real, but all of them can be opened easily in a spreadsheet. Many data sets are taken from our textbook or other data repositories. I regularly use data that are making headlines, such as when the World Health Organization deemed that bacon causes cancer, and that are hopefully also of interest to the students. In particular, news stories that also offer graphics are fantastic classroom discussion material.

Though we use real data from real studies, due to the large class size, we do not assign projects where students either develop their own scientific research question; collect or find data to address the question; visualize, summarize, and analyze the data; and then write a scientific report. We also do not present students with messy data, such as data that involve text or geographic coordinates, or missing data. I believe both projects and exposure to messy data are valuable, and I am working on ways we can incorporate these into large courses.

Jennifer Bryan, associate professor at the University of British Columbia, Vancouver is jointly appointed in the statistics department and the Michael Smith Laboratories. She’s a biostatistician specializing in genomics and takes a special interest and delight in data analysis and statistical computing. She teaches STAT545 and is academic director for a master’s of data science program.

**Describe the introductory statistics course(s) you teach**.

I teach a data science course aimed at graduate students from all across campus, STAT545. I presume some prior statistical coursework, but the goal is to not teach them new statistical methodology. Rather, I aim to help the students become effective at applying the stats they know (or will soon learn in other courses) to wild-caught data sets, such as their own thesis data.

**What do you see as your biggest challenge as an instructor of an introductory statistics course?**

Minimizing boredom and frustration. It reminds me of my kids’ struggle to enjoy reading. You have to push through an awkward phase where the kind of stories you find compelling also happen to be way beyond your reading level. I try hard to find data sets and analytical goals that are tractable, but still reveal interesting stories.

The changing educational climate has emboldened me to teach what actually takes most of my time and psychic energy as a data analyst: data cleaning, visualization, and wrangling. Ten years ago, I was much more sheepish about this—I worried it wasn’t “statistical” enough. Now I have no shame.

**What technology do you use in the classroom?**

We spend most of our class time live-coding R together in RStudio. I also make heavy use of Git for version control and the hosting site GitHub. All the course material is available there, and all student work is kept in GitHub-hosted repositories.

**What is your favorite classroom activity for helping students “think with data”?**

We work a lot with the Gapminder data set, which contains life expectancy and GDP per capita for hundreds of countries over time (among many other variables). I like to provide concrete instruction on an interesting-but-doable analysis for one country, and then leave it to the students to scale that up to many countries and do high-level inspection of those results to identify countries with interesting data. Then, we drill back down to take a closer look at these countries, visually and numerically. I like them to discover how iterative and nonlinear real data analysis can be.

As mentioned above, we use data from Gapminder. This year, I also had them do some cleaning and analysis of a colleague’s survey on Halloween candy preferences.

I have thoroughly prepared the Gapminder data and made it into a proper R data package. As the course progresses, we travel back in time—interacting with the Gapminder data in dirtier and dirtier forms until finally we arrive at the Excel spreadsheets it came from. Data cleaning requires more sophisticated knowledge of R data structures and looping patterns than exploring and plotting a clean data set. So we work our way back to the raw data from the clean.

The candy survey data was very dirty, and I didn’t have a beautiful version tucked away to reveal to them. And we still don’t! But we made some progress. It was a good complement to the more cut-and-dried Gapminder data.

In the past, I have made the mistake of trying to use freshly caught data each year, and that is the path to madness. It’s too hard to anticipate all the cleaning challenges and predict whether the data holds interesting stories.

FURTHER READING:

See what other educators had to say about their intro to stats courses in February’sAmstat News.