An Introduction to Information Extraction

Jun-ichi Tsujii
University of Tokyo, Tokyo, Japan

Sophia Ananiadou
University of Salford, Salford, UK

Tutorial description

The goal of this tutorial is to provide an overview of the field of Information Extraction (IE) for biology applications. IE systems analyse unrestricted, unstructured collections of texts and extract specific types of information. The information extracted is in the form of templates.

IE has been intensively studied in the Message Understanding Conference Proceedings (MUC) and mostly has been applied to news texts. Applying IE for biology applications requires adaptability but also complex domain knowledge.

In this tutorial we provide an overview of the IE technology used till now; we describe the basic IE tasks such as named entity recognition, coreference, event recognition, template building, template filling and how these tasks are affected by the specific domain.

We also examine a topic related to IE i.e. automatic term extraction and how this technique can contribute to a domain adaptable IE, i.e. tuning of lexicons. We examine the basic approaches to automatic term extraction, statistical, rule based and evaluate the results. We draw our examples from the field of molecular biology.

Target Audience

This tutorial aims to inform the participants about the state-of-the-art of IE, how this NLP task can be used for biology applications, what are the benefits drawn from exploiting textual information, what are the limitations given the current state of the NLP tools and techniques used. There are no pre-requisites for this tutorial.


Jun-ichi Tsujii is Professor of Natural Language Processing at the Department of Information Science of the University of Tokyo, Japan.

Positions Held:

Professor, Graduate School of Sciences, University of Tokyo, Japan
Research Professor, Department of Language Engineering, UMIST, England
Member of International Committee of Computational Linguistics (ICCL)
President of Asian Association of Machine Translation
President of Association of Natural Language Processing

Research Activities:

Prof. Tsujii has been involved in a project supported by JSPS (Japan Society of Promotion of Science), the aim of which is to develop NLP tools for IE. His group is particularly interested in applying the tools for IE from bio-chemical texts (Genia Project). He is also interested in using linguistically well founded grammar formalisms for IE. He gave invited talks at many international conferences such as international ACL (Association of Computational Linguistics), Coling, etc.

Sophia Ananiadou is Senior Lecturer at the Department of Computer Science, the University of Salford, UK.

Research Activities: Dr. S. Ananiadou has been engaged in research of Computational Terminology and is currently involved in an EUREKA project, the aim of which is to develop knowledge acquisition tools for bio-chemical texts. She was the organizer of many international workshops such as Cross-lingual IR workshop (1999) at Singapore, Computational Terminology for Medical and Biological Applications (2000) at Patras, etc. She was vice president of European Association of European Terminology.

Back to the PSB tutorial page

Back to the main PSB page