Chapter 1

Overview of Research Project

Spatial information can be stored in computers, but it is difficult to search for required information based purely on thevisual information.It is necessary to describe this spatial information in order to facilitate searching.One aspect of this project is tobecome familiar with the problems involved with the storing andrepresentations of this type of data.
In every day usage items are described using subjective wording.For example a person’s face might be described by saying that they have a bignose, a light complexion and a wide mouth.The terms "big", "light” and "wide" are not strictly defined, but there is a community agreementon their meaning.The second aspect of this project is to use fuzzylogic and/or modal logic to create a system where pictures are described by community-agreed subjective terms.The hope is that this will allow better searching for spatial information given a query that allowsthe use of subjective wording.
Queries that use such subjective wording are not easily stated in SQL.The third aspect of the project is to automate the translationof natural language queries using this subjective language into SQLstatements. This is the portion of the project addressed by this report.

The complete system, should allow a user to ask aquestion such as "Find me a man with a big nose, long hair, dark complexionand blue eyes" and have the system return all the pictures that meetthose criteria.

Chapter 2

Natural Language to SQL basics

Introduction

The purpose of a Natural languageInterface for a database system is to accept requests in English and attempt to “understand” them. A natural language interface usually has its own dictionary.This dictionary contains words related to a database and its relationships.In addition to this, the interface also maintains a standard dictionary (e.g. Webster’s dictionary).A natural language interface refers to words in itsown dictionary,as well as to the words in the standard dictionary, in order to interpret a query. If the interpretation is successful, the interface generates a SQLquery corresponding to the natural language request and submits it to the DBMS for processing;otherwise, a dialogue is started with the user to clarify the request.

The purpose of this paper is to present an overview of Natural Language Interfaces. The PRECISE and MASQUE interfaces are discussed. TheELF (English Language Front End) interface is used to translate the natural language question into SQL for the project described in this paper. The project makes this SQL available to the fuzzy database research group which enables them to make changes to the SQL and use it for the fuzzy queries.

The area of NLP research is still very experimental and systems so far have been limited to small domains, where only certain types of sentences can be used. When the systems are scaled up to cover larger domains, NLP becomes difficult due to the vast amount of information that needs to be incorporated in order to parse sentences. For example, the sentence: “The woman saw the man on the hill with a telescope” Could have many different meanings. To understand what the intended meaning is, we have to take into account the current context, such as the woman is a witness, and any background information, such as there is a hill near by with a telescope on it. Alternatively the man could be on the hill, and the woman may be looking through the telescope. All this information is difficult to represent, so restricting the domain of an NLP system is a practical way to get a manageable subset of English to work with.

The standard approach to database NLP systems is well established. This approach creates a ‘semantic grammar’ for each database, and uses this to parse the English question. The semantic grammar creates a representation of the semanticsof a sentence. After some analysis of the semantic representation, a database query can be generated in SQL or any other database language.

The drawback of this approach is that the grammar must be tailor-made for each database. Some systems allow automatic generation of an NLP system for each database, but in almost all cases there is insufficient information in the database to create a reliable NLP system.Many databases cover a small domain so that an English question about the data within it can easily be analyzed by an NLP system. The database can be consulted and an appropriate response can be generated.

The need for aNatural Language Interface (NLI) to databases has become increasingly important as more and more people access information through web browsers, PDA’s and cell phones. These people are casual users and it is necessary to have away that they can make queries in their own natural language rather than to first learn and then write SQL queries. But the important point is that NLI’s are only usable if they map natural language questions to SQL queries correctly.

2.1 Fundamental steps involved in the conversion.

The transformation of a given English query to an equivalent SQL form requires some basic steps. The workings of all Natural language to SQL software packages deal with these basic steps in some manner.

First there is a dictionary, where all the words that are expected to be used in any English question are declared. These words consist of all the elements (relations, attributes and values) of the database and their synonyms. Then these wordsaremapped to the database system. This implies that the meaning of each word needs to be defined.

They may be called by different names in different systems but these two things (i.e. definition and mapping of thewords) form the basis of the conversion. These are domain dependent modules and have to be there.

These are the three basic steps for NL to SQL conversion.

Figure 2.1[Ref 5]

Architecture of transformation process

The architecture of the transformation process is shown in Figure 2.1.Note that the domain dependent modules (the lexical dictionary, semantic dictionary and interface with data) are dependant on the data contained in the database. Below is the detailed explanation of each of these modules.

Lexical dictionary: This holds the definition of all the words that may occur in a question. The first step of any system is to parse an English question and identify all the words that are found in the lexical dictionary. The lexical dictionary also contains the synonyms of root words.

Semantic dictionary: Once the words are extracted from the English question using the lexical dictionary, they are mapped to the database. The semantic dictionary contains these mappings. This process transforms the English question to an internal language (LQL for the architecture shown in Figure 2.1) which is then converted to SQL.

During the mapping process words are attached to each other or to the entities or to the relations. So the output of this step is a function. For example, consider the question “What is the salary of each manager”? Here the attributes, salary and manager are attached and so the output is the function has_salary (salary, manager).

Interface with data:The next step is the conversion of the internal query developed above to an equivalent SQL statement. This is done somewhat differently by different systems. This step may be combined with the step above to directly get a SQL statement. There are some predefined rules (depending on the interface) that change the internal language statement into SQL and so Interface with data contain all those rules.

2.2 MASQUE/SQL

Introduction:

There are many commercial Natural language Query Interfaces available, MASQUE/SQLis one of them. MASQUE/SQL is a modification of the earlier product MASQUE, which answered questions by generating Prolog queries [5].MASQUE (Modular Answering System for Queries in English) was developed at the Artificial Intelligence Applications Institute and the Department of Artificial Intelligence of the University of Edinburgh.Complex queries were answered by MASQUE in a couple of seconds. But since it answered queries in prolog, existing commercial databases were unable to use it and so MASQUE/SQL was developed.

MASQUE/SQL can answer user’s questions almost as quickly as the original MASQUE. MASQUE/SQL answered most test questions in fewer than six seconds CPU time, including the time taken by the DBMS to retrieve data from the sample database [5].One important point about MASQUE/SQL is that its performance is not significantly affected when using larger databases. This is because it transforms each user question into a single SQL query, and the relational DBMS is left to find answers to the query, utilizing its own optimization and planning techniques. Thus the full power of arelational DBMS is available during the question answering and therefore the system is expected to scale up easily to larger databases.The drawback of larger databases is that dictionary gets bigger and complicated. The parsing time also increases.

Working of MASQUE/SQL:

MASQUE/SQL follows the same basic architecture as MASQUE. It starts with the building of the lexicon. This is done using a built-in domain-editor. The built-in domain-editor helps the user to declare words expected to be in a question.

The second step, as per the general architecture, is building the semantic dictionary. This is also done by the built-in domain editor. It allows the user to define the meaning of each declared word in terms of a logic predicate.The logic predicate defines the mapping of a word to the database.An example is given in Figure 2.2

PersonFeature

customerStaffnumeric non-numeric

salespersonmanager technician age salary name address

Figure 2.2 (is-hierarchies)

Figure 2.2 shows two is-hierarchies. The one on the left is fora set of entities and the right hand one tells about the properties of entities. In the database there is a relationship defined between these properties and entities. For instance, if salary is related to staff and age to person, then there are logic predicates such as has_salary(Salaries,Staff) and has_age(age,Person). There are also predicates such as is_manager(M) to tell whether M is a manager or not. Now when the noun “manager” (as in the query “Is PG a manager?”)is encountered, it is mapped to the predicate is_manager(M) and similarly the meaning of the noun “salary” (as in “What is the salary of each manager?”) is expressed as the predicate has_salary(Sal,Stf).Here the predicate has two attributes. The first is salary, and since only staff can have salary (from the relationship in the database),‘Stf’(signifies staff) is the second attribute. Anyone who is not a staff cannot replace this attribute. So ‘customer’ cannot be used.

There are two types of predicates used in MASQUW/SQL.

  1. Type A - Predicates that show relationship (mapping) between individual entities. For example has_salary links Staff and Salary.
  2. Type B - Predicates that express relations linking entity-sets to entities. For example the meaning of “average” as in “What is the average age of the managers?” is av(Aver, Set), where set stands for a set of numeric entities, and aver stands for a numeric entity corresponding to the average of Set.

These logic predicates form the subparts of a Prolog-like LQL (Logical Query Language).This LQL is the internal language which will ultimately be converted to SQL.For example: “What is the salary of each manager?” is translated as follows in LQL:

Answer ([S, M]):- is_manager(M), has_salary(S,M)

The process is as follows. Whenan English query is entered, the parser extracts the wordsthatoccur in the lexical dictionary (in this case “Salary” and “Manager”). Next the semantic dictionary gives the logic predicates for these two nouns, is_manager(M) for “manager” and has_salary(S,M) for “Salary”.

The last step is to convert theLQL into a SQL query that can be executed by the DBMS. This conversion process is explained with the help of the example given in figure 2.3.

Given the English Query “What is the average age of the managers?”, the following LQL is generated:

Figure 2.3

The translation algorithm uses a bindingstructure, which maps LQL variables to SQLcode fragments. Whenever a new LQL variable is encountered, a suitable SQL fragment is stored as the binding of that variable. When the variable is reencountered,its binding is used to create a WHERE condition. Returning to the example, the type-A predicate is_manager(Mng) is translatedinto the SQL shown in Figure 2.4.

Figure 2.4

and rel1.arg1 becomes the binding of Mng.Then has_age(Age,Mng) is processed. The

binding of Mng (i.e. rel1.arg1) is used tocreate a WHERE conditionand rel2.arg1 becomes the binding of Age (see figure 2.5).

Figure 2.5

To obtain the final SQL query both of these sub queriesneed to be combined.To do this conjunction list is formed. To translate this conjunction list, the FROM and WHERE parts of the translations of the conjuncts are merged. In the example, the second argument of setoff(Figure 2.3) is as shown in Figure 2.6

Figure 2.6

The processing of the overall setof instancegives the binding of Ages list shown in Figure 2.7.

Figure 2.7

The SELECT part from Figure 2.7 is generated byobserving the _rst argument of setof (Age:Mng) and by using the bindings of Age(rel2.arg1) and Mng (rel1.arg1).The type-B predicate avg is linked to thepseudo-SQL query given in Figure 2.8.

Figure 2.8

Consulting the SELECT part of the bindingof Ages list, the first can be associated to

rel2.arg1, and the second to rel1.arg1. Processing the avg instance causes the binding of

Aver to become as shown in Figure 2.9.

Figure 2.9

where the FROM and WHERE parts are thesame as in the binding of Ages list. Thetranslation of the full LQL query is the binding of Aver.

2.3PRECISE

PRECISE is a natural language interface developed at UniversityofWashington, Seattle,WA. The PRECISE Natural Language Interface to Databases is designed on the principle that it should guarantee the correctness of its output, or else indicate that it does not understand the input question. PRECISE is the first system that has formal guarantees on the soundness & completeness of a NLI [2].Ademo of PRECISE is available at the University of Washingtonwebsite [1]. The databaseused for this demo is an Airline Travel Information Service (ATIS). In this demo, Precise answers questions about the ATISdatabase. This database contains information on flight schedules, fares, ground transportation, airports, and airplanes. For example asking,”Show me all flights from Seattle to Boston”, will give the result is given along with the SQL generated.However the PRECISE interface is not available for commercial use and so it was not tested for this project.

A database is made up of three types of elements: a relation, attribute and value. An attribute is a particular column in a particular relation. Each value element is the value of a particular attribute. The words that usually appear in questions (what, which, where, who, when) are known as Wh-words.These words help in identifying attributes in questions. Theset of word stems that matches a database attribute are called Tokens.

For Example: {require, experience} and {need, experience} could refer to the attribute Required Experience. There are two types of tokens -value tokensand attribute tokens. If the token corresponds to a value in the database then it is a value token. Similarly if the token corresponds to a attribute in a database then it is a attribute token.This token system helps match database elements with values.

A situation where there is a set of tokens such that every word in the question appears in exactly one token is known as complete tokenization of a question. PRECISE uses Attachment functionwhich maps pairs of tokens to TRUE or FALSE (tells whether two given tokens are attached).For example, consider the question “What French restaurants are located downtown?” In this example the tokens located and downtown are attached, while the tokens what and downtown are not.

A valid mapping from a complete sentence tokenization to a set of database elements has the following characteristics:

- each token matches a unique database element
- each attribute token is attached to a unique value token
- a relation token is attached to either an attribute token or value token

A question is semantically tractable if it has at least one complete sentence tokenization that has only distinct tokens, and has at least one value token that matches a Wh-word, and the question results in a valid mapping. Examples are given in Figures 2.10 and 2.11.

Figure 2.10

Figure 2.11

How PRECISE works

Given an English question, PRECISE determines whether it is semantically tractable i.e. it the process to has one complete tokenization and valid mapping. If it is, PRECISE generates the corresponding SQL query or queries.

PRECISE uses a max-flow algorithm for graph matching problems to find a valid mapping from the complete sentence tokenization to database elements. For Example, consider“What are the HP jobs on a UNIXsystem?” Given a Job relation with attributes Description, Platform, & Company, PRECISE produces a complete tokenization of the question as shown in Figure(2.12) .The Syntactic markers (no impact on interpretation of the question) are: are, the, on, a.The Value tokensare: what, HP, UNIX. The only Attribute tokenissystem. The only Relation token isjob

Figure 2.12 [Ref 4]

An attribute-value graph is constructed as shown in Figure 2.13.