Dealing with text
|
Nowadays, instead of using structured constructs to query the given database,
you first need to find out where the data in question is
stored and then deduce the relevant facts from it.
While searching for documents, you can hardly foresee where and
what information will turn up, and so you need some flexibility in
posing your questions. Mostly you are interested in texts concerning
a specific topic, thus you can classify all documents in scope
according to certain criteria like word occurrence etc. This is
the task of information retrieval, text categorization and data
extraction applications.
From a given set of documents, information retrieval strives to find
a subset that corresponds best to the query. In one way or another,
such a program searches all documents for occurrences of the
user-provided keywords and returns the relevant ones.
In the beginning of IR history, documents that existed on paper
were stored in machine-readable form and manually indexed.
Performing the indexing in the computer's memory allows for
much larger and more precise indexes, but is also more time-consuming
and therefore expensive. So it would be preferable to skip the
indexing at all and operate not on pre-made indexes but rather
on the whole texts. This is the idea behind free-text searching
which soon became popular.
Nevertheless, critics argued that free-text
search might turn out not to be as successful as indexed searching.
Indexing is not merely taking words from the document and putting
them in a structured database but making a choice as to the quality
of the selected keywords, too. Not all words have equally important
roles in a text - e.g., conjunctions as "and" and "or" or articles
like "a" are so ubiquitous that using them as search keywords would
be very inappropriate, as nearly all texts include them.
Eventually it turned out that automatic free-text search has no
worse results than manual indexing, and consequently this information
retrieval technique experienced intensified research.
These models plainly leave out any syntactic or semantic aspects,
they work "almost entirely at word level". The artificial intelligence
community tried to prove that analyzing the text with more sophisticated
natural language processing methods can give better results.
In the news service market, several providers perform automatic text
categorization ("sorting texts into fixed categories"). While there
was usually a large human work factor involved in this task, NLP systems
can reliably take over and minimize both costs and inconsistencies
while maximizing speed.
A similar task is data extraction. Here, you are given a text; from
this you try to extract whatever information is enclosed and put it
in a database available for later querying.
<---
|