Home » President's Corner

Making a Statistical Impact with Text Data

1 November 2019 No Comment

One of the great pleasures—and benefits—of belonging to the ASA is the opportunity to hear about problems on which we can make an impact. In this column and the next, I’ll mention some of these, which will include progress on the 2019 initiatives (impact, disinformation, diversity), pointing us to new challenges to pursue.

One big challenge to which statistics can contribute solutions concerns text data. Surveys and questionnaires prefer specific-choice questions (e.g., “On a scale of 1 to 10, rate …”) to avoid open-ended responses (e.g., “Other (specify) ….”). But in many areas, text responses may be preferable or even essential. For example, a physician may ask the patient to describe the level and location of discomfort or the times when pain is greatest. The data from such responses can be important for evaluating a treatment remedy under consideration (e.g., under what patient conditions is the treatment likely to be “successful,” where “success” also must be quantified and may depend on the patient).

In more complicated situations, a clinician must mark on a schematic of the body the location and size of bruises, lacerations, and other physical aberrations, which are translated into text (e.g., “upper right arm”). How do we use these data as features for classifying patients into categories of severity of physical condition?

This problem recently came to my attention from a professor in our school of nursing who often is asked to testify about the features associated with attempted strangulation of an assault victim. The forensic nurse’s testimony is important because nonlethal strangulation may result in a felony conviction in most states. If the nurse testifies that the features found on physical examination of the victim are consistent with nonlethal strangulation, the judge may order the person charged to remain in jail until trial. So, these features can be crucial evidence to support a conviction (otherwise the perpetrator could be released, free to strike again). Electronic health records generally have these data issues in multiple formats.

The challenges of text data arise with personal health and recommender systems also. Increasingly, mobile devices are collecting personal fitness data, and users want to know how the data can be associated with how they feel emotionally and physically.

Combinations of machine learning and natural language processing algorithms have been applied to such data, and both offer opportunities for statisticians to evaluate their accuracies and uncertainties.

Another example arises in experiments to assess the accuracy of eyewitness identification. In such experiments, a “mock eyewitness” (lab participant) views a video of a crime, and then sees a lineup of, say, six photos of possible perpetrators. Experimenters believe strongly that confidence in the identification is related to accuracy, so they ask their participants, “How confident are you in your identification?” Participants’ choices are “0% confident,” “20%,” “40%,” “60%,” “80%,” or “100% confident.” One’s answer of “100%” may be someone else’s (like a statistician’s) “60%” (statisticians are not accustomed to claiming “100% certain on anything).

Moreover, such experiments are often conducted with university psychology students who may have experience with quantifying their confidence with such a scale, but police officers know the typical victim or eyewitness asked to identify a perpetrator from a lineup will have no clue what “60% confident” means. So, in real life, the officer can only ask the eyewitness to describe in words, “How confident are you?” The response may depend not only on factors specific to the crime and eyewitness (e.g., time between the crime and identification process, difference in ages and races between eyewitness and perpetrator, the eyewitness’s tendency toward conservative versus confident responses), but also on the officer’s interpretation of the response. One officer might interpret, “I’m pretty sure” as “80% confident,” while another officer might interpret it as “90%” or “100% confident.” Further, the quantitative translation of that response may depend on other comments from the eyewitness during the identification process.

How do we extract useful information from such responses so we will know whether to advise police officers to trust or dismiss such responses about “confidence”? We need methods to extract the relevant information quickly, process text data mathematically, summarize the data, and ultimately interpret it in the context of other features. And we will need to translate our findings into language others can understand!

Some of the same issues related to processing text data arise in classifying news articles as either genuinely informative, mostly informative apart from unintentional reporting errors, or intentionally inaccurate. Such is the task of the “disinformation” initiative being co-chaired by former ASA President Jessica Utts and Duke Computer Scientist and Associate Chair Jun Yang. How “robust” are news stories to misinterpretations? In fact, how do we even “measure” degrees of misinterpretations so we can minimize the errors when the news story is read? What “loss functions” or “penalty terms” apply to such misinterpretations of text data?

Members of this task force are collecting resources related to documented instances of “disinformation” and open problems related to the classification of news items from multiple sources. They will be turning that information into a white paper with a list of possible research directions ripe for collaboration between computer scientists and statisticians. They also are creating a website of resources pertaining to studies on “fake news,” which will be made available to ASA members. The translation of findings for nonstatisticians is even more critical here, as it affects not just our research collaborators, but also the general public. A subset of the task force is investigating mechanisms for effectively educating the public about how to recognize attempts at fake news.

Social media data are another source of text data, and the analysis of such data raises additional complications. In a 2019 American Journal of Public Health paper, Quynh Nguyen and her colleagues used geotagged Twitter data to study associations between sentiments and health behaviors expressed in the tweets and aggregate county-level health outcomes such as rates of mortality, obesity, and substance abuse.

Analyses of such data raise issues we need to consider. First, Steven Piantadosi and his colleagues warned us in their 1988 American Journal of Epidemiology paper of “The Ecological Fallacy” about drawing inferences from aggregate data for individual outcomes. Second, users of social media in a given county are not likely to be representative of the county’s population (the authors acknowledge “only 23% of all internet users and 20% of the US adult population use Twitter”). The need for the spatial origin of the tweets was a further limitation. The authors note from previous studies that only 1–2 percent of tweets may contain GPS information and that “users who enable geotagging of their tweets differ demographically from those who do not.”

Nonetheless, such analyses may provide ideas for designing and analyzing future studies. (Incidentally, representativeness may possibly undermine some experiments in eyewitness identification. Often, they use online platforms to collect voluntary participants as their “mock eyewitnesses.” How many of you have signed up to participate in an online experiment for a modest compensation for your time?)

On the enormous quantities of data in the life sciences, Dave Dunson said in an article titled “Boston University to Hold Symposium on Statistics and Life Sciences,” published in the October 2019 issue of Amstat News, “Statistics has had a fundamental impact on this paradigm shift in the way life science is being conducted; there is no use in collecting such data unless we have reliable and reproducible methods for analysis and interpretation. The development of ‘big data’ statistics has freed up scientists to be creative in developing and exploring new sources of data.” With those developments come the challenges of characterizing their accuracy in translation and minimizing the costs associated with misinterpreting them.

What other broad areas are ripe for statistical analysis? David Williamson is chairing the Impact Initiative, which includes a “challenge” to our community to propose areas in which statistics is well positioned to make a big impact. For example, 30 years ago, one might have proposed genomics and proteomics; 10 years ago, perhaps it might have been the application of statistical methods and design to validating forensic evidence. Today, perhaps “disinformation” or processing text data generally might make the list. If you offer your proposal to Impact Initiative committee members, you may see a community of statisticians making an impact in this world by working on it in the years ahead. Keep those ideas coming—we look forward to seeing them!

EDITOR’S NOTE: Barry Graubard, David Hoaglin, and Jessica Utts contributed to this column.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Comments are closed.