Mining Patents Using Molecular Similarity Search

Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P

IBM, Almaden Services Research, San Jose, CA 95120, USA
E-mail: jjrhodes, sboyer, kreulen, yingchen @ us.ibm.com; ordopa1 @ umbc.edu


Pac Symp Biocomput. 2007;:304-315.


Abstract

Text analytics is becoming an increasingly important tool used in biomedical research. While advances continue to be made in the core algorithms for entity identification and relation extraction, a need for practical applications of these technologies arises. We developed a system that allows users to explore the US Patent corpus using molecular information. The core of our system contains three main technologies: A high performing chemical annotator which identi- fies chemical terms and converts them to structures, a similarity search engine based on the emerging IUPAC International Chemical Identifier (InChI) stan- dard, and a set of on demand data mining tools. By leveraging this technology we were able to rapidly identify and index 3, 623, 248 unique chemical struc- tures from 4, 375, 036 US Patents and Patent Applications. Using this system a user may go to a web page, draw a molecule, search for related Intellectual Property (IP) and analyze the results. Our results prove that this is a far more effective way for identifying IP than traditional keyword based approaches.


[Full-Text PDF] [PSB Home Page]