Home » A Statistician's View, Departments

Like College Students, ChatGPT Finds Statistics Difficult

1 April 2023 2,101 views No Comment

Norman Matloff is a computer science professor at the University of California, Davis. He was one of the founding members of the department of statistics there and works in both fields. His current research interests are fairness in machine learning and statistical disclosure control.

The startling success of OpenAI’s large language model, ChatGPT, has caused considerable hand-wringing among academics. They fear, for instance, students could use ChatGPT to do homework, rendering the assignment of homework meaningless.

As a statistician, I wondered whether the app was really up to the task of doing statistics at the level of insight we demand of our students. I put a number of statistical questions to it and found that though it is indeed capable of doing impressive work, it will not replace homework in statistics courses any time soon.
I gave it both specific homework problems and open-ended questions. Let’s take a look at the latter first.

General Questions for ChatGPT

I began by asking a personal favorite: How can public opinion polls be accurate if they are based on only a small fraction of the population? How can that be representative?

Statisticians are often asked this during election years. What did ChatGPT say about it?

The app generally does an excellent job of parsing questions and writes its answers in well-flowing, grammatically correct English. And it did make some good points in this case. It notes that “it’s important to be aware of the margin of error associated with the results and to consider other factors that may impact the accuracy of the poll, such as response bias and non-response bias.” But it largely evades the question put to it.

ChatGPT states, for example, “A larger sample size will generally result in a smaller margin of error, but even with a large sample, there is still some uncertainty associated with the results.” But the (hypothetical) questioner was worried about small samples.

The app, noting the word “representative” in the question, offers stratified sampling as the explanation. It points out that polling organizations take various measures “… to ensure that the sample is representative of the population in terms of demographic characteristics such as age, gender, race, and education level.”

But this would be unlikely to satisfy the questioner. Say 1,200 people are sampled, as is common, and we stratify by race. That would mean we design the sample to include about 160 African Americans. But anyone who wondered about the small sample overall would be even more puzzled as to why 160 African Americans is “representative” among the tens of millions of Black people in the US.

I also asked, “In a linear regression analysis, how can one determine which predictor variables are important?” Its answer began with the following:

Statistical Significance: One can assess the importance of each predictor by examining the p-values of the estimated regression coefficients. If the p-value for a predictor is below a certain significance level (e.g., 0.05), it indicates that the predictor is significantly associated with the response variable and can be considered important.

So, the app is ignorant of a fundamental concept in statistics: the distinction between statistical and practical significance.

Another question I put to ChatGPT was what is the meaning of interaction terms in linear regression models? Here, ChatGPT gave a circular response, using the term being queried as the answer, in effect saying “interaction is interaction”:

In linear regression models, interaction terms represent the effect of the interaction between two or more predictor variables on the outcome variable. … For example, in a study examining the relationship between education and income, you might include an interaction term to capture the effect of the interaction between education and experience on income.

The answer does note one can define an interaction effect as the product of the two predictor variables. This might have been a good start, but it never got close to addressing the interpretability issue, by discussing how the impact of Xi on Y might be different at different levels of Xj.

And again, the answer failed to distinguish between statistical and practical significance:

However, it’s important to carefully consider the interpretation of interaction terms and to ensure that they are meaningful and statistically significant before including them in your model.

ChatGPT Does Statistics Homework

I teach a calculus-based course in probability and statistics for computer science majors. I gave ChatGPT actual homework problems from the class.

Say X, Y, and Z are indicator random variables, with success probabilities p, q, and r, and they are independent. Derive Var(XYZ) in terms of p, q, and r.

As noted, ChatGPT does parse questions well, and all that would remain here would be applying various known algebraic relations. And indeed, that is how it started, but at some point it did so incorrectly.

ChatGPT had similar, and more complex, problems with a homework question on what I called a “double geometric” distribution, extending over all integers, positive and negative. Again, the question asked students to derive the variance. I’ll omit the details of the problem specification, but ChatGPT presented incorrect application of known probabilistic relations yet again.

I also asked ChatGPT to write some simulation code: Buses arrive at a certain stop at random times, with the interarrival times having a U(0,b) distribution. Successive interarrivals are independent. ….Write R simulation code sim(nbuses,b,p,w) to find the long-run average wait times for passengers who arrive to find i passengers already present, i = 0,1,2,3,…,w…

ChatGPT’s solution was impressive in many ways. The tool clearly understood the question, in spite of the length and complexity, and a quick skim-through of the code shows ChatGPT apparently learned quite a bit from seeing similar problems on the web somewhere. But the code was wrong, severely so. For instance, the code ended by returning a certain quotient, with both numerator and denominator being conceptually incorrect.

Discussion

Perhaps we critics are being overly harsh on ChatGPT. After all, it learns from materials on the web and if, for instance, it does not understand the difference between statistical and practical significance, this is likely a reflection of the lack of such a distinction on the part of statistics instructors. ChatGPT is merely the messenger.

And one novel use of the tool might be to ask students questions of a nature, “Where is ChatGPT’s answer wrong?” Might be fun for the students, though probably challenging in many cases.

At any rate, statistics instructors are still safe in assigning homework, albeit possibly of a more sophisticated nature.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.