Home » Columns, Master's Notebook

The Paradox of Choice: Statistical Software Packages

1 March 2016 573 views One Comment

Kevin_Putschko_Mastersnb_thumb
Kevin Putschko is a statistical consultant at Experis Manpower in Portage, Michigan. He graduated in 2015 from the professional science master’s program with an emphasis in biostatistics at Grand Valley State University.

 

I borrow the title of this article from Barry Schwartz’s bestselling book, The Paradox of Choice: Why More Is Less, because it is apt in our modern era of statistical computing. While this book was written with regard to everyday anxieties that arise when considering the seemingly limitless number of options we have for just about everything, I think the idea translates well into the field of statistical computing.

The 2015 O’Reilly Data Science Salary Survey included 116 data science software tools data scientists are using. Many of the programs included in the survey have varying degrees of functionality, sometimes with little to no overlap between them. Yet, there remain a great number of programs whose functions and available options are highly similar. With such a vast array of programs available, it is no wonder so many students in my graduating cohort felt uneasy when thinking about which programs they would be asked to use in a professional setting.

At Grand Valley State University, we primarily used base SAS throughout both my undergraduate and graduate studies. We also worked with, to a lesser degree, programs such as SPSS, JMP, R, SAS Enterprise Miner, and SAS Enterprise Guide. The purpose of exposing students to this wide array of statistical programs was not to instill the expectation of proficiency, but to convey the availability of such programs and some understanding of the capabilities of each.

When considering commercial products, just a handful of tools have remained dominant for decades. SAS has long been considered the gold standard in business, pharmaceutical, and industry settings, while SPSS and Stata have been the preferred tools among the social sciences. There is a host of reasons why these programs have persisted with such high regard, whether the ease of use within the GUI of SPSS or SAS Enterprise Guide, the sheer magnitude of scope and depth when using the syntax of base SAS, the power and simple drag-and-drop tools of SAS Enterprise Miner, or the customer support and documentation of every procedure included in these programs. These benefits have been well worth the cost of subscription in an established corporate environment.

In recent years, the popularity of open source programs has greatly increased. According to the 2015 Annual Software Survey by KD Nuggets, “This year, 91% of voters used commercial software and 73% used free software. About 27% used only commercial software, and only 9% used free-software. For the first time a majority of 64% used both free and commercial software, up from 49% in 2014.”

The results of this survey show that more people are finding a balance between what they prefer to do with commercial and open-source products. The recent upswing could be attributed to the growing number of programs that continue to address long-cited issues such as documentation, syntax readability, support, and speed. A major move recently in the open source world is the acquisition of Revolution Analytics by Microsoft and the formation of the R Consortium—an organization supported by companies including Microsoft, Google, and Hewlett-Packard. These moves indicate a move of open source statistics out of the realm of “pure academia” and into the “heavyweight” arena of industry application.

As expected, there are learning curves with any new language or program, and some curves are steeper than others. It is easy to understand how a student could become frustrated while learning an additional new language, especially after engaging in and working with a language as rich as SAS. The hope, though, is to encourage and develop curiosity and competence when faced with a new program and language.

Regardless of a program’s popularity, devoting the time and effort to understanding its functionality and overcoming the learning curve are necessities many students may not be able to do until they are on the job. When I was asked to exclusively use R for my internship, I had no prior experience with the software, and I was obligated to turn to the Internet to begin my quest for understanding. It took much time to begin to feel comfortable with the language, but I am now thankful for the opportunity to spend some time familiarizing myself another statistical program.

There are countless online resources for learning a new language, with websites such as DataCamp, the SAS Institute, Kaggle, Coursera, Code School, or the StackOverflow forums. StackOverflow in particular has been incredibly beneficial, with a thriving community ready to assist with questions of varying levels of complexity and answerers tailored to your unique inquiries. Of course, nothing can beat learning in a classroom with personalized instruction, which is why there are numerous opportunities for fee-based, in-class trainings hosted by the program companies themselves, at conferences, or at university seminars. With all these options, independently learning something new is not as daunting a task as it may have once been.

Although a master’s program is expected to educate a student in many ways, it simply cannot teach a student all they will need to know in life post-graduation. Instead, students should graduate feeling confident in their ability to follow their curiosities and in their ability to take the initiative to understand whatever new ideas or skills are required of them.

Further Reading

Adams, William C., D. Lind Infeld, and C. M. Wulff. 2013. Statistical software for curriculum and careers. Journal of Public Affairs Education 19(1).

King, John, and Roger Magoulas. 2015. 2015 data science salary survey: Tools, trends, what pays (and what doesn’t) for the data professionals. http://oreil.ly/20BRm7e.

Piatetsky, Gregory. 2015. R leads RapidMiner, Python catches up, big data tools grow, spark ignites. KD Nuggets. http://bit.ly/1RnSjgQ.

Schwartz, Barry. 2004. The paradox of choice: Why more is less. Harper Collins: New York.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

One Comment »

  • Sachit Ganapathy said:

    One of the best articles about statistical software that I have read. It was not only informative but also made me look at this whole system from a new perspective. Thank you!