NL Database Interfaces

Natural Language Interfaces to Databases

Using a foreign language is challenging for most of us. Clearly there are many benefits of knowing languages for when we travel, but finding the time to learn them seems elusive. The situation with relational databases is seemingly similar. We know that so much information is now stored in databases in an organization, that it is tempting to think that we could do our jobs better if only we had the computer skills to work more directly with them. Unfortunately, getting better database querying skills can involve a significant investment of time.

Most databases now are based on relational database technology (for an explantion see Box 1). The idea behind a relational database is very simple: It is just composed of tables of information, and so absorbing information from a small database is often easy. Each row is an example on which we have information, and the information is given in the entry for each column. In this way, a table in a relational database is no different to a table we might draw on a piece of A4 paper.

Whilst using simple relational databases is straightforward, problems arise when we have larger tables – ie. tables that do not fit on a manageable number of pages. Consider a telephone directory. We might consider this as a table. Each subscriber is a row in the table, and the columns are for name, address, area, and telephone number. In a sense such a big table is manageable because the rows are sorted according to the alphabetical order of the names. Having such an order allows us to quickly find the number of someone in the books. But consider trying to find a particular number in a telephone book, or trying to find all the people who live in particular road. And these are relatively simple queries you could ask.

The formal language for querying relational database is usually SQL (for an explanation see Box 2). There are simpler querying facilities for querying and manipulating data in many databases packages such as "query by example" and the ability to select a column and then order all the items in the table according to the ordering in that column. That would solve our problem of finding a person by their telephone number in the previous paragraph. But it is only part of the answer. SQL provides the most expressive way of querying and manipulating data in a database.

Obviously English (or indeed any natural langauge) is more expressive than SQL. But the problem with English is that it is too expressive. There are many ways of saying the same thing using different words, abbreviations, phraseology, and grammatical constructions. Furthermore, in using English there are many ways of saying things in indirect ways. For example, using metaphor, analogy, euphemisms, and humour.

Yet it seems very appealing to be able to query a relational database using natural language. If we accept that we need to use relatively restricted subset of English, then it is possible to develop useful natural language interfaces with some degree of flexibility and robustness (See Box 3).

A natural language interface for a database needs to incorporate both syntactic knowledge and semantic knowledge. The syntactic knowledge is relatively well-established for this kind of task and includes grammar of the language, conjugations, declensions, rules of agreement and rules of word order. It is normally knowledge that is common to all database interfaces, and so is relatively stable knowledge. It can be built into the software tools by the software developers, and normally doesn’t need amending.

In contrast, most semantic knowledge is dependent on the application. Simple examples are the ability to deal with synonyms in a context sensitive way. For example, in a database on names and addresses, we don’t want to treat the uses of bank in Barclays Bank and Kew Gardens Seed Bank as symonymous in questions like "Give me the names of banks and building societies that have branches in the Borough of Richmond".

Whilst handling semantic knowledge is a complex and open-ended problem. There have been plenty of promising experimental natural langauge database interfaces developed that combine syntactic and semantic knowledge. Unfortunately these developments have not led to overwhelming commercial success so far.

As long ago as 1985, Ovum ( an independent IT consultancy) published a report predicting that "By 1987 a natural language interface should be a standard option for users of database management systems and ‘Information Centre’ type software, and there will be a reasonable choice of alternatives". With hindsight this prediction appears incorrect because the advantages of graphical and form-based query techniques were under-estimated. However, as the limits of these other approaches are reached, it is likely there will be increased interest in natural language querying.

Currently, there are products on the market for building natural language interfaces for pc-based relational databases. Though it is still very much a niche market with suppliers being very small companies such as Elf (www.elfsoft.com) with its Access ELF product for Microsoft.

The Access ELF software incorporates syntactic knowledge together with a general-purpose lexicon. When configuring the software for a particular database, the lexical knowledge is updated. Some is by the system analysing the entries in the tables and analysing queries to the database, and some is by the users directly updating the lexical knowledge. Using this knowledge, the software is able to understand and process quite complex queries. For example, for a typical customer orders database it can handle the following types of query.

"Which customers have placed more orders than average?"

"Which customers have placed fewer than 10 orders?"

"Which customers have ordered tofu but not Longlife tofu?"

"Make a list of the clients that use Federal Shipping exclusively"

"Which customers ordered tofu in the week of 5 April?"

In general getting software to handle such queries can be very useful. But it is not difficult to find queries that can confound an interface. As a result natural langauge interfaces for databases are still regarded as a niche technology. But it is likely that this will change rapidly over the next few years with developments in artificial intelligence and computational linguistics providing more intelligent and more robust solutions.

In particular, there has been much research into developing semantic knowledge using techniques such as semantic networks and automated learnimg technology. Currently, much of this new technology is being directed at natural language search engines such as AskJeeves (www.askjeeves.co.uk) , and given the current economics of the web, it is not surprising that some technology start ups, such as EasyAsk (www.easyask.com), have shifted their focus from selling natural langauge interfaces to selling natural language search engines.

But as the web drives information technology forward, it likely natural language interfaces to databases will become better and more widely used.

Anthony Hunter is a lecturer in computer science at University College London and can be contacted on a.hunter@cs.ucl.ac.uk.

Box 1: What is a relational database?

Relational databases are the most commonly used kind of database. In a relational database, information is stored in tables. Each table has one or more columns, and one or more rows. We can consider each row as an example and each entry in a column in the row gives details about the example. For example, we can have a customer database where each row represents an individual customer, and each entry in the row gives details about that customer such as name, address, and telephone number. The basic idea is simple and yet so much business information easily falls into this format. This largely explains the ubiquity of relational databases in commercial information systems.

Box 2: What is SQL?

SQL stands for Structured Query Language. To illustrate SQL, let us consider a table of customers that has columns Name, House number, Road, Town, County, Postcode, Telephone Number, Fax number, and Email. We call this table Customers. Now consider the following SQL. This will retrieve all rows from the Customers table.

SELECT *

FROM Customers

Now consider the following SQL query. This will retrieve all rows where the entry for the County column is East Sussex.

SELECT *

FROM Customers

WHERE County = East Sussex

So far so good. But the language does quickly get much more complicated. This is not a criticism of SQL. Indeed the language is very clear if you are prepared to invest the time to learn it. However it is by necessity very precise so as to avoid the ambiguities that arise in natural language. Furthermore, it does not directly handle synonyms and common-sense abbreviations. For example, consider the following query which many would regard as equivalent to the query above.

SELECT *

FROM Customers

WHERE County = E Sussex

Here, common-sense would dictate that East Sussex and E Sussex are equivalent. But in SQL this equivalence needs to be made explicit. Making all the desirable semantic knoweldge explicit can be challenging.

Box 3: What kinds of natural language query can be handled?

Within the focussed domain of an SQL database, natural language interfaces can be constructed that can understand and respond correctly to a wide range of queries. To illustrate, let us consider the Customer table that we introduced in Box 2. We start with the simplest query:

SELECT *

FROM Customers

Even trying to handle a simple query such as this could be posed in many different ways such as:

What customers do we have?

List all customers?

Please give me all customers?

Who are our customers?

I want a list of all our customers?

Can I have list of all cutomers?

This list could easily grow to a list of a few hundred different ways of asking the same question! But by using appropriate knowledge of the recursive nature of grammar, it is possible to have a system that can "understand" the vast majority of possible versions of this query. In the same way, this can be scaled up to handle more complex queries such as:

List all customers who live in Brighton or Hove

I want a list of customers who live in London and have a fax number.

I need to know who lives in London but doesn’t have an email address.

Furthermore, synonyms and abbreviations can be handled so for example clients can be as a synonym for customers, and BT as an abbreviation for British Telecommunications Plc.