Yin Aphinyanaphongs1, Armine Lulejian1, Duncan Penfold Brown2, Richard Bonneau3, Paul Krebs1
1NYU Langone Medical Center
2New York University Social Media and Political Participation lab
3Simons Center for Data Analysis
Email: firstname.lastname@example.org, Armine.Lulejian@nyumc.org, email@example.com, firstname.lastname@example.org, email@example.com
Pacific Symposium on Biocomputing 21:480-491(2016)
© 2016 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.
Rapid increases in e-cigarette use and potential exposure to harmful byproducts have shifted public health focus to e-cigarettes as a possible drug of abuse. Effective surveillance of use and prevalence would allow appropriate regulatory responses. An ideal surveillance system would collect usage data in real time, focus on populations of interest, include populations unable to take the survey, allow a breadth of questions to answer, and enable geo-location analysis. Social media streams may provide this ideal system. To realize this use case, a foundational question is whether we can detect ecigarette use at all. This work reports two pilot tasks using text classification to identify automatically Tweets that indicate e-cigarette use and/or e-cigarette use for smoking cessation. We build and define both datasets and compare performance of 4 state of the art classifiers and a keyword search for each task. Our results demonstrate excellent classifier performance of up to 0.90 and 0.94 area under the curve in each category. These promising initial results form the foundation for further studies to realize the ideal surveillance solution.