Mining Gene-Disease Relationships from Biomedical Literature: Weighting Proteinprotein Interactions and Connectivity
Gonzalez G, Uribe JC, Tari L, Brophy C, Baral C
Department of Biomedical Informatics, Ira A. Fulton School of Engineering,
Computer Science and Engineering Department, Ira A. Fulton School of Engineering,
Center for Metabolic Biology, Department of Kinesiology,
Arizona Sate University
Tempe, Arizona 85281, USA
Pac Symp Biocomput. 2007;:28-39. |
|
Abstract
Motivation: The promises of the post-genome era disease-related discoveries and
advances have yet to be fully realized, with many opportunities for discovery hiding in
the millions of biomedical papers published since. Public databases give access to data
extracted from the literature by teams of experts, but their coverage is often limited and
lags behind recent discoveries. We present a computational method that combines data
extracted from the literature with data from curated sources in order to uncover possible
gene-disease relationships that are not directly stated or were missed by the initial mining.
Method: An initial set of genes and proteins is obtained from gene-disease relationships
extracted from PubMed abstracts using natural language processing. Interactions
involving the corresponding proteins are similarly extracted and integrated with
interactions from curated databases (such as BIND and DIP), assigning a confidence
measure to each interaction depending on its source. The augmented list of genes and
gene products is then ranked combining two scores: one that reflects the strength of the
relationship with the initial set of genes and incorporates user-defined weights and
another that reflects the importance of the gene in maintaining the connectivity of the
network. We applied the method to atherosclerosis to assess its effectiveness.
Results: Top-ranked proteins from the method are related to atherosclerosis with
accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates
are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not
extracted from text. Thus, though the initial gene set and interactions were automatically
extracted from text (and subject to the impreciseness of automatic extraction), their use
for further hypothesis generation is valuable given adequate computational analysis.
|