Statisticians Have Large Role to Play in Web Analytics
Vincent Granville is chief scientist at a publicly traded company and the founder of AnalyticBridge. He has consulted on projects involving fraud detection, user experience, core KPIs, metric selection, change point detection, multivariate testing, competitive intelligence, keyword bidding optimization, taxonomy creation, scoring technology, and web crawling.
Web and business analytics are two areas that are becoming increasingly popular. While these areas have benefited from significant computer science advances such as cloud computing, programmable APIs, SaaS, and modern programming languages (Python) and architectures (Map/Reduce), the true revolution has yet to come.
We will reach limits in terms of hardware and architecture scalability. Also, cloud can only be implemented for problems that can be partitioned easily, such as search (web crawling). Soon, a new type of statistician will be critical to optimize “big data” business applications. They might be called data mining statisticians, statistical engineers, business analytics statisticians, data or modeling scientists, but, essentially, they will have a strong background in the following:
- Design of experiments; multivariate testing is critical in web analytics
- Fast, efficient, unsupervised clustering and algorithmic to solve taxonomy and text clustering problems involving billions of search queries
- Advanced scoring technology for fraud detection and credit or transaction scoring, or to assess whether a click or Internet traffic conversion is real or botnet generated; models could involve sophisticated versions of constrained or penalized logistic regression and unusual, robust decision trees (e.g., hidden decision trees) in addition to providing confidence intervals for individual scores
- Robust cross-validation, model selection, and fitting without over-fitting, as opposed to traditional back-testing
- Integration of time series cross correlations with time lags, spatial data, and events categorization and weighting (e.g., to better predict stock prices)
- Monte Carlo; bootstrap; and data-driven, model-free, robust statistical techniques used in high-dimensional spaces
- Fuzzy merging to integrate corporate data with data gathered on social networks and other external data sources
- Six Sigma concepts, Pareto analyses to accelerate software development lifecycle
- Models that detect causes, rather than correlations
- Statistical metrics to measure lift, yield, and other critical key performance indicators
- Visualization skills, even putting data summaries in videos in addition to charts
Handbook of Natural Language Processing by Nitin Indurkhya and Fred J. Damerau
Collective Intelligence by Toby Segaran
Handbook of Fitting Statistical Distributions with R by Zaven A. Karian and Edward J. Dudewicz
Statistics for Spatial Data by Noel Cressie
Computer Science Handbook by Allen B. Tucker
Data Mining and Knowledge Discovery Handbook by Oded Maimon and Lior Rokach
Handbook of Computational Statistics by James E. Gentle, Wolfgang Härdle, and Yuichi Mori
Handbook of Statistical Analysis and Data Mining Applications by Robert Nisbet, John Elder, and Gary Miner
International Encyclopedia of Statistical Science by Miodrag Lovric
The Princeton Companion to Mathematics by Timothy Gowers
Encyclopedia of Machine Learning by Claude Sammut and Geoffrey Webb
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Numerical Recipes: The Art of Scientific Computing by William Press, Saul Teukolsky, William Vetterling, and Brian Flannery
An example of a web analytics application that will benefit from statistical technology is estimating the value (CPC, or cost-per-click) and volume of a search keyword depending on market, position, and match type—a critical problem for Google and Bing advertisers, as well as publishers. Currently, if you use the Google API to get CPC estimates, Google will return no value more than 50% of the time. This is a classic example of a problem that was addressed by smart engineers and computer scientists, but truly lacks a statistical component—even as simple as naïve Bayes—to provide a CPC estimate for any keyword, even those that are brand new. Statisticians with experience in imputation methods should solve this problem easily and help their companies sell CPC and volume estimates (with confidence intervals, which Google does not offer) for all keywords.
Another example is spam detection in social networks. The most profitable networks will be those in which content—be it messages posted by users or commercial ads—will be highly relevant to users, without invading privacy. Those familiar with Facebook know how much progress still needs to be made. Improvements will rely on better statistical models.
Spam detection is still largely addressed using naïve Bayes techniques, which are notoriously flawed due to their inability to take into account rule interactions. It is like running a regression model in which all independent variables are highly dependent on each other.
Finally, in the context of online advertising ROI optimization, one big challenge is assigning attribution. If you buy a product two months after seeing a television ad twice, one month after checking organic search results on Google for the product in question, one week after clicking on a Google paid ad, and three days after clicking on a Bing paid ad, how do you determine the cause of your purchase?
It could be 25% due to the television ad, 20% due to the Bing ad, etc. This is a rather complicated advertising mix optimization problem, and being able to accurately track users over several months helps solve the statistical challenge. Yet, with more user tracking regulations preventing usage of IP addresses in databases for targeting purposes, the problem will become more complicated and more advanced statistics will be required. Companies working with the best statisticians will be able to provide great targeting and high ROI without “stalking” users in corporate databases and data warehouses.