Home » President's Corner

Bigger Isn’t Always Better When It Comes to Data

1 May 2017 149 views No Comment

Barry D. Nussbaum

Life is more than just a sample size. I think in my 40-year career with the U.S. Environmental Protection Agency, the single question I was asked most was, “What sample size do I need?” Frequently, the question was asked without any further explanation of the project, evoking my response of, “Sit down, let’s talk.” I was fascinated, and admittedly disappointed, that the premier reason people consulted with me was merely to ascertain the size of the sample. In fact, usually the sample size was the least of the problem. Let me give an example.

I was involved in a case that is quite old now, but demonstrates the premise so well. In the late 1970s, EPA ordered Chrysler Motors to recall a class of vehicles for excessive emissions of carbon monoxide. These were rather large vehicles with 360- and 400-inch3 displacements. Even then, they were considered “muscle cars.” But here comes the problem: How to measure the emissions from these cars, since they were already “in use.” That is, instead of being on the assembly line, motorists like you and me were driving these. So how would you acquire a sample of them? That is just the beginning of the questions involving this sampling. In fact, let’s start at the beginning and try to specify the population. Thinking back to college, this is not a simple case of picking balls out of an urn.

The population is: “Carbon monoxide emissions from well maintained and used 1975 model Chrysler vehicles under 100,000 miles not sold in California or Denver as measured on the Federal Test Procedure.” That is a mouthful. Let’s examine it a bit. The actual data we want are the carbon monoxide emissions as measured on the Federal Test Procedure (FTP). What is the FTP? It is a test in which the car is on a dynamometer (think of it as a treadmill for cars) under controlled temperature and humidity conditions and following a vehicle trace that mimics the average morning commute in Los Angeles. (Just discussing the FTP would consume enough President’s Corner articles for the next six months. I will spare you.)

What is a well-maintained and used vehicle? One whose owner has dutifully followed the basic required maintenance, such as changing oil, and has not used the vehicle as a race car, on off-road terrain, to tow a boat, etc. Why stop at 100,000 miles? That was considered the “useful life” back then. What’s wrong with California and Denver? California had its own set of emission standards, and Denver vehicles have different settings to adapt to high-altitude driving.

With these concerns in mind, we must figure out how to obtain a random, representative set of vehicles. EPA had a testing lab in Livonia, Michigan, so we used the Wayne County vehicle registration list to randomly select vehicles. If your vehicle was selected, we phoned several times at different hours of the day, sent registered mail, etc. to reach you. No substitution allowed for a convenience sample. But even if you were reached, we didn’t know if the car was well maintained and used until we administered a rigorous questionnaire concerning your driving and maintenance habits. Assuming that was satisfied, why would you volunteer your car for government testing? We provided the following three-part incentive: a fully insured loaner vehicle, your car returned with our mechanics setting it to factory specifications, and a $50 U.S. Savings Bond (remember, this was the 1970s). So, after going through all this, the vehicle was submitted to testing.

Notice a subtle nuance here. Normally, you have a population and you sample elements from the population. Here, we really didn’t know if the vehicle’s emissions belonged to the population, due to the maintenance and use restrictions, until we administered the questionnaire after the vehicle had been randomly selected.

Now, with all these considerations satisfied, we finally come to the problem of sample size. Here the issue took an odd twist. Carbon monoxide emissions are clearly a continuous measurement. The most applicable probability distribution is the log normal distribution. However, this was a legal enforcement case and the Clean Air Act only considers a vehicle meeting or exceeding the applicable standards. So, perhaps counterintuitively, we used a binomial model! And how many did we sample? 10. Yes, just 10.

Prior ancillary data had indicated these vehicles were likely to exceed standards, and in fact all 10 did exactly that. Think about your binomial model. If the vehicles really did meet standards, the probability of observing all 10 failing is less than 1/1000. I suspect readers of this column can understand that low probability, but this case went to an administrative court hearing. The biggest concern was telling it succinctly to a judge, who was going to decide whether we won or lost the case. (Talk about a binomial outcome!)

Many of you have heard my mantra and the subject of my March President’s Corner: “It’s not what we said, it’s not what they heard, it’s what they say they heard.” In this situation, it was crucial that the judge understood us and said what he heard correctly. I am happy to report Judge Edward Finch did precisely that. We won the case.

The point of all this is that administering the proper sampling was more difficult than just specifying a sample size. And if the sampling were defective, a higher sample size would not remedy the situation. Sure, larger sample sizes are typically better than smaller ones, but the real problem is in the sampling.
By the way, why are larger sample sizes just “typically” better? There is a paper I wrote in conjunction with Nagaraj Neerchal and Pepi Lacayo (“IS a Larger Sample Size Always Better?”, American Journal of Mathematical and Management Science, Vol. 28, Nos. 3&4, 2008) that shows this is not necessarily the case for some discrete distributions.

Of course, there is a flip side to this problem. Once I specified a sample size of 10 in the automobile emissions recall example, most of the engineers assumed 10 was always the answer. Heck, if it worked for Chrysler, why not for the others? Of course, this depends on what we know from prior ancillary data. This reminded me of a YouTube video clip many of you may have seen in which a scientist and statistician try to collaborate. The scientist is hung up on a sample of size three since that is what was always used. The clip is humorous, and in reflection, sad as well.

While I am thinking in terms of humorous determinations of sample size, I sometimes suggest we should stop at samples of size one, since otherwise variance starts to get in the way. I always have a smile, but sometimes think the audience is taking me seriously. (Note to Barry: Be careful here!)

Why my concern with a rather old case involving a small sample? There are two reasons. I think because we currently have a fascination with Big Data—large volumes, velocity, variety, and, hopefully, veracity—we sometimes forget the beautiful basic utility of inferential statistics: getting a lot of information from small, but well-constructed, samples. Second, again With Big Data in mind, I wanted to set straight the importance of the raw data, datum by datum. This does not go away when we are deluged with data. I will write about my thoughts regarding Big Data in a future column.

So, when all is said and done, I am a bigger believer in the quality of data, not the quantity.

Significantly forward,


1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.