Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature

Arjun Magge1,2, Davy Weissenbacher3, Abeed Sarker3, Matthew Scotch1,2,*, Graciela Gonzalez-Hernandez3


1College of Health Solutions, Arizona State University
2Biodesign Center for Environmental Health Engineering, Arizona State University
3Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania
*Corresponding author
Email: Matthew.Scotch@asu.edu

Pacific Symposium on Biocomputing 24:100-111(2019)

© 2019 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.


Abstract

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nu- cleotide sequence repositories such as GenBank motivates the use of natural language pro- cessing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disam- biguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.


[Full-Text PDF] [PSB Home Page]