Information extraction in knowledge management

The ability to quickly scan reports looking for key bits of information is one of those skills you pick-up somehow through necessity. Unfortunately, scanning reports is not systematic - many important bits of information can be missed - and it can become tedious if you undertake it on large numbers of reports. Information extraction aims to help in this kind of activity by providing technology for "reading" reports and picking out the bits of information that are needed. See Box 1 for "What is information extraction?"

Within focused domains, information extraction can perform with high accuracy. This is possible if the report being handled conforms to some identifiable pattern. For example, looking at a number of articles on mergers and acquisitions, it is possible to see a pattern emerging in the kinds of bits of information that would normally be extracted by a user. This can be used to define a "template" - a table with slots that can be instantiated with the bits of information that can be extracted from a given article. The aim of an infomration extraction system is then to fill in such templates.

So for acquisitions, we could have a template with slots for buyer, seller, acquisition, value of bid, state of bid, business area of acquisition, and so on. The template therefore lists the things we are interested in, though a given article does not necessarily instantiate every slot in the template.

Much business news can be reduced to instantiated templates. Consider a copy of the Financial Times where many of the articles are on mergers or acquisitions. Further templates could be defined for announcement of annual results. investment plans, appointment of key personnel, and announcement of new sales. In each case, an article, particularly shorter articles, can be summarised by an instantiated template.

With the intention of delivering this functionality, there has been a substantional research effort over the past ten years in North America and Europe, building and publically testing prototype systems that are aimed at focused domains. A key driver in this research effort has been the US Department of Defense who have set goals, and performance criteria, for information extraction systems and funded the development of test-sets of large corpa in focused domains for comparing systems.

Some of the early trials of information extraction systems were on corpa of news reports with domains including mergers and acquisitions, and terrorism in South America (remember the defence influence!). Each corpus had about 1300 articles. To compare the systems, the notions of precision and recall were adapted for templates. More recent trials have worked with more complex reports and the templates contain much more difficult kinds of information that need to be sought in the articles.

Obviously, information extraction systems do not produce perfect results, though they can be quite impressive in relatively focused domains. Also, missing and erroneous results tend not to arise in all slots for an article, and so failure can be described as partial rather than absolute.

Diverse application areas now being explored for information extraction systems include supporting financial trading, where it can be critically important to quickly handle enourmous numbers of market news reports from news feeds, and strategic management where market and competitor intelligence has to be undertaken using much text-based information from heterogeneous sources.

Other potential application areas include support for scientific research and development projects by monitoring technical literature, and in health care delivery where information extraction has been investigated for auditing and processing insurance claims, by extracting diagnoses, symptoms, test results, and therapies, from patient records.

So how does information extraction work? Well it is not based on one underlying technology break-through. Rather a number of underlying technologies in artificial intelligence and computational linguistics have been brought to the stage where they can be integrated into viable information extraction systems. See Box 2 for "How does information extraction work?"

There is also no unanimity in what combination of underlying technologies should be used in a given information extraction system. However, we can see some common features occuring in these systems - in particular the use of shallow parsing in many systems. Shallow parsing does not attempt to fully parse and "understand" the text, but rather looks for information that is useful and can be easily obtained.

With companies now marketing information extraction systems, there is increased confidence in the potential in various application domains. See Box 3 for "Orchestrating information extraction products". The emphasis in these systems is on the use of rules for identifiying the information to be extracted.

Statistical techniques are also increasingly common in information extraction systems. This includes tagging words in a sentence according to the part of speech. This helps in resolving many ambiguities that arise in parsing since many words having multiple roles. For example, in the text To bank the cheque, go to the bank, we have bank taking on two roles, and these two different roles can be identified by the statistics of the surrounding words. So from sufficient data, we may identify that whenever we have the pattern of an ambigous word preceeded by to and succeeded by the, then that word is a verb.

Statistical techniques are now permeating into a number of further areas of the natural language processing. Key advantages of this are the increased robustness of the resulting system and that vast amounts of data available that can be harnessed for building systems. Analysing large corpa is recognised as critically important given that what you might believe is a well known word may occur only once in say ten thousand news articles.

In terms of commercial exploitation, information extraction is only just starting. See Box 4 for "What is the value for knowledge management?" Given the potential of the technology and more importantly the range of business tasks that can be substantially improved by this technology, it seems set for very substantial growth over the next ten years.

Anthony Hunter is a lecturer in computer science at University College London. He can be contacted at: a.hunter@cs.ucl.ac.uk

Box 1: What is information extraction?

Information extraction is a technology for "reading" reports and picking out the bits of information that are needed by users. Consider the following newspaper article:

A new powerhouse in the internet market was created last night when Spanish service provider Terra Networks agreed a $12.5bn (Ł8.4bn) deal to acquire Lycos, the American web company.

Here there is prose, not all of it particularly useful, to present some information about a takeover bid. In the article the key bits of information include the buyer (Terra Networks), the acquisition (Lycos), the current state of the bid (agreement between buyer and seller), value ($12.5bn). Further bits of information include the business area of the acquisition (a web company), the location of the acquisition (USA), the location of the buyer (Spain), and the date of the report (last night).

If you have a number of articles on mergers and acquisitions, then you can see a pattern emerging in the kinds of bits of information that would normally be extracted. This can be used to define a "template" - a table with slots that can be instantiated with the bits of information that can be extracted from a given article.

So for acquisitions, we could have a template with slots for buyer, seller, acquisition, value of bid, state of bid, business area of acquisition, and so on. The template therefore lists the things we are interested in, though a given article does not necessarily instantiate every slot in the template.

Box 2: How does information extraction work?

Suppose we are handling news reports on mergers and acquisitions. One of the obvious starting points for processing it is to go through it looking for proper nouns. By pattern matching with an appropriate lexicon, people's names, geographical names, and most importantly, company names can be identified. Similarly, dates and financial values can be easily picked out.

Once this information has been picked out, some structuring is called for to help determine the overall meaning of the text. In contrast to say information retrieval, the use of very common words such as "the","of", and "from" can be very important in determining the meaning of phrases. Consider for example, how the following phrases have the same keywords (after removing suffixes and stop words) but probably different interpretations: Harvey the dog; The dog from Harvey; and Harvey's dog.

Much can be achieved with relatively simple grammatical knowledge. Consider the sentences Acme Inc has bought Bloggs Ltd and Acme Inc has been bought by Bloggs Ltd. Both have the same companies involved, and the same verbs relate them, but in the first example we have the active form of the verb whereas in the second we have the passive. Identification of this kind of grammatical information is important in instantiating a template.

To address this, phrase-structure rules are used that break down phrases into smaller phrases, such as noun phrases and prepositional phrases, and ultimately into words. However, given that most words are open to multiple interpretation, it can be computationally expensive deciding the role of each word and phrase. Often recourse is made to rule-based techniques to decide.

Other kinds of ambiguity surround issues of co-ordination in a sentence and between sentences. For example, and can connect a wide variety of phrases, and deciding what it is connecting can be difficult to determine. Similarly resolving what each pronoun refers to can be difficult particularly between sentences. Again, rule-based techniques can be used to resolve these.

However, syntactic and simple semantic rules are not always sufficient to resolve ambiguities, and deeper domain knowledge is required. For example, in trials on a Latin American terrorism corpus, some systems incorporated a database containing statistics on how many terrorist kidnaps had occurred in which countries. When it was unclear by syntactic structure in which of two countries the abduction had occurred, the system would consult the database. It would make a decision based on the available statistics on which was the more likely place of kidnap.

Use of semantic knowledge becomes even more important when taking the parsed text and trying to complete the template. Consider the verb closed in the following two sentences: On Friday, Acme Bank closed early for the holidays; and On Friday, Acme Bank closed a new flexible working deal with the key banking unions. The sense of the two occurrences of the verb are similar but should result in quite different instantiated templates. Part of the process of deciding which templates to complete depends on resolving the ambiguity surrounding words such as the main verb. Again rule-based techniques can be used to reason with the context, to decide what information to extract and how.

Box 3: Orchestrating information extraction products.

Whilst much of the underlying technology for information extraction has eminated from university research, commerical R&D projects have produced substantial prototype systems. One of the commercial ventures that has been spun out of this work is Cymfony (www.cymfony.com), based in Buffalo, in upstate New York, which is developing tools for implementing information extraction systems for clients. Other US-based companies with similar technology in development include GTE Labs (www.gte.com), AT&T Labs (www.research.att.com), and MITRE (www.mitre.org).

Box 4: What is the value for knowledge management?

Information extraction is a new technology waiting to be exploited. The potential for fruitful applications in knowledge management seems enourmous. Vast amounts of textual information in emails, text databases, and word documents is potentially amenable to this approach.

Consider how numerous organizations are setting up topic-based email groups for exchanging information and experiences. For example, one of the major photocopier manufacturers set up one for use by their services engineers. In this company, there are many different models containing complex and expensive components, and these models come to market with short lead-times, often without service engineers getting the opportunity to see them before they are installed on client sites. Furthermore, there is signficant pressure on the service engineers to minimize downtime. This creates a situation where service engineers are constantly having to catch up with developments, and is an application where the email exchanges have proved very useful. However, with information extraction techniques the repository of emails could become even more useful, allowing for better quering and consolidation of the knowledge.

The key characteristics of an application that is likely to benefit from information extraction technology is one where the is a substantial number of relatively short text documents or messages that are written on a relatively narrow domain with a relatively limited vocabulary, and with a relatively constrained phraseology. A typical example is news reports on mergers and acquisitions obtained from news wires. Other applications range from handling medical records to handling political news reports.