Large-Scale Testing of Bibliome Informatics Using Pfam Protein Families

Maguitman AG, Rechtsteiner A, Verspoor K, Strauss CE, Rocha LM

School of Informatics, Indiana University 1900 East Tenth Street, Bloomington, IN 47408
E-mail: anmaguit@indiana.edu, rocha@indiana.edu


Pac Symp Biocomput. 2006;:76-87.


Abstract

Literature mining is expected to help not only with automatically sifting through huge biomedical literature and annotation databases, but also with linking bio-chemical entities to appropriate functional hypotheses. However, there has been very limited success in testing literature mining methods due to the lack of large, objectively validated test sets or “gold standards”. To improve this situation we created a large-scale test of literature mining methods and resources. We report on a specific implementation of this test: how well can the Pfam protein family classification be replicated from independently mining different literature/annotation resources? We test and compare different keyterm sets as well as different algorithms for issuing protein family predictions. We find that protein families can indeed be automatically predicted from the literature. Using words from PubMed abstracts, of 3663 proteins tested, over 75% were correctly assigned to one of 618 Pfam families. For 90% of proteins the correct Pfam family was among the top 5 ranked families. We found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself (in combination with the vector space model) is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership. Finally, we show that Shannon’s entropy can be exploited to improve prediction by facilitating the integration of the different literature sources tested.


[Full-Text PDF] [PSB Home Page]