Home » Member News, Special Interest Group

TAIG Contest Winners Tell of Experience

1 June 2021 No Comment
Qiuyi Wu, Enshuo Hsu, and the Text Analysis Interest Group

    During JSM 2020, ASA Text Analysis Interest Group (TAIG) award committee members systematically evaluated a large body of research in the growing field of text analysis (e.g., text mining, natural language processing, computational linguistics, web scraping, sentiment analysis, topic modeling, GAN text generation, automated translation). Subsequently, awards were presented to Qiuyi Wu and Enshuo Hsu.

    Wu and Hsu were invited to elaborate and follow up on their prize-winning research this year at the Data Science DC (DSDC) Meetup, which has more than 13,000 registered members. In their own words, they tell about their research here.

    Qiuyi Wu

    Qiuyi (“Queenie”) Wu

    University of Rochester
    “Naive Dictionary on Musical Corpora: From Knowledge Representation to Pattern Recognition”

    This presentation was based on [my] master’s degree thesis back in 2018, supervised by my adviser, Dr. Fokoué, who is a passionate, enthusiastic, and brilliant scholar. Inspired by clearly identified strong analogies between the building blocks of music and literature, I sought to utilize statistical machine learning concepts, methods, and tools for the analysis of these two human experiences. The statistical analysis of literary documents had been developed by text mining like topic modeling.

    I transformed the music notes into matrices for statistical analysis and data mining. Specifically, each song was regarded and treated as a text document consisting of a bag of “musical words.” One way to represent these musical words is to segment the song into several parts based on the duration of each measure. Then, the words in each song turn out to be a series of notes in one measure. I employed the created matrices in topic modeling to detect the potential connections between musicians and latent topics.

    I presented this work in many conferences already, and the most recent ones are JSM 2020 and the follow-up virtual meetup in February 2021 organized by Data Science DC and ASA TAIG. Every time, the audience was intrigued and fascinated by the underlying thought-provoking idea of a homomorphism between music and literature. Along this journey of music, I am fortunate to have made a lot of friends and talents in both statistical and musical fields, who are generous enough to offer me their ideas and [comments] that can possibly push this research forward.

    Enshuo Hsu

    Enshuo (“David”) Hsu

    The University of Texas Medical Branch
    “Combination of Optical Character Recognition and Natural Language Processing to Identify Patients with Sleep Apnea in Electronic Health Record (EHR) Data”

    I initiated this research project, “Deep Learning–Based Natural Language Processing (NLP) Data Pipeline for EHR Scanned Document Information Extraction” (originally “Combination of Optical Character Recognition and Natural Language Processing to Identify Patients with Sleep Apnea in EHR Data”), at The University of Texas Medical Branch in 2019. The motivation was to design an artificial intelligence (AI)–powered data pipeline for extracting laboratory result values from scanned sleep study reports.

    Using open-source tools and internet resources, I put together an image preprocessing module, an OCR engine (for processing text in images), and a deep learning–based text classifier to build a functional system. I was fortunate to be able to present the preliminary works [at] JSM 2020. Afterward, I continued to develop the data pipeline by examining different NLP models, including the state-of-the-art BERT model. I improved the image preprocessing and model evaluation processes [and] also collaborated with faculty in The University of Texas Health Science Center at Houston School of Biomedical Informatics for methodological insights. It was a pleasure to participate and share my updates in the April 2021 virtual meetup.

    I presented the latest results in publication-ready tables and figures. It was a nice experience having face-to-face discussions within a community that is interested in and has some degree of familiarity with NLP. I collected helpful feedback, including alternative options of optical character recognition (OCR) engines, data pipeline management comments, and language model training suggestions. I look forward to future events from TAIG!

    TAIG members are looking forward to the upcoming JSM 2021 contest, as well as new members, volunteers, and initiatives. Contact the TAIG Executive Committee with questions.

    Editor’s Note: The views expressed are the authors’ and do not necessarily represent those of their organizations.

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

    Comments are closed.