Mining terminological knowledge in large biomedical corpora

Liu H, Friedman C

Department of Medical Informatics, Columbia University, New York, NY 10032, USA.

Pac Symp Biocomput. 2003;:415-26.


Abstract

Terminological knowledge of the biomedical domain is important for natural language processing (NLP) and information retrieval (IR) applications, and a number of terminological knowledge sources, such as LocusLink, GeneBank, and the UMLS, already exist. However, because of the tremendous amount of research activity in the field, new terms and symbols are continually being created, many of which are published in the literature, but are not available in any of the other resources. Therefore, effective mining of the literature for new terminology is critical for furthering NLP and IR applications. Abbreviations are widely used in the biomedical domain, and the understanding of abbreviations requires a terminological knowledge base that consists of abbreviations with their associated senses. In previous work, several methods have been developed for automatic construction of abbreviation knowledge bases from parenthetical expressions. However, these methods pair abbreviations and their expansions based on manually crafted patterns or rules. In this paper, we propose an automatic method, which is not based on patterns or rules but is based on the use of collocations, to extract a set of related terms from parenthetical expressions including abbreviations associated with their expansions and other types of related terms such as synonyms, or hyponyms etc. Our method is based on the observation that terms associated with parenthetical expressions i) are usually related, and ii) are often collocations because they tend to co-occur more often than expected by chance. Our method was applied to the collection of MEDLINE abstracts. The method and the results were evaluated using two collections: Berman's handcrafted abbreviation list and the LocusLink collection.


[Full-Text PDF] [PSB Home Page]