JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

PROBABILISTIC CONTEXT FREE GRAMMAR: AN APPROACH TO GENERIC INTERACTIVE NATURAL LANGUAGE INTERFACE TO DATABASES

1 P. R. DEVALE, 2ARATI DESHPANDE

1, 2 Department of Information Technology,Bharati Vidyapeeth Deemed University

College Of Engineering, Pune-46

ABSTRACT: Information is playing a crucial role and one of the major sources of information is databases. Almost all IT applications are storing and retrieving information from databases. The Structured Query Language (SQL) is used in almost all languages for relational database systems. The idea of using Natural Language instead of SQL has prompted the development of new type of processing called Natural language Interface to Database. The natural language interface takes natural language questions as input and gives textual response. Shadow parsing used in natural language processing can analyze a wide range of questions with high accuracy and produce reasonable textual responses. The unawareness of structure of database in non expert users led to the development of Intelligent Database System ( IDBS). Many systems such as LUNAR and LADDER have been developed using the concept of natural language processing . The drawback of these systems is that grammar must be tailor-made for each given database and they cover only a small domain of the English language questions. Generic Interactive Natural Language Interface to Databases (GINLIDB) system is generic in nature given the appropriate database and knowledge base It is designed by the use of UML and developed using Visual Basic.NET-2005. An overview of NLIDB subcomponents is given, NLIDB architectures and various approaches are also covered.

Keywords: Natural Language Database Interface, Probabilistic Context Free Grammars, Natural Language Interface To Databases (NLIDB), Structured Query Language (SQL), Unified Modeling Language (UML), ShallowParsing, Intelligent Database Systems (IDBS), LUNAR, LADDER, Databases, Database Management System (DBMS)

ISSN: 0975 –6760| NOV 10 TO OCT 11 | VOLUME – 01, ISSUE - 02 Page 1

JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

1. INTRODUCTION:-

Information is playing a very important role in our lives today and one of the major sources of information is databases. Databases are gaining prime importance in a huge variety of application areas both private and public information systems. Databases are built with the objective of facilitating the activities of data storage, processing, and retrieval etc. all of which are associated with data management in information systems. In relational databases, to retrieve information from a database, one needs to formulate a query in such way that the computer will understand and produce the desired output [4]. Retrieval of information from the databases requires the knowledge of databases language like the Structured Query Language (SQL). The SQL norms are based on a Boolean interpretation of the queries. But not all users are aware of the Query language and not all the queries of user’s seeking information from the database can be answered explicitly by the querying system. It is due to the fact that not all the requirement’s characteristics can be expressed by regular query languages. To simplify the complexity of SQL and to manipulate of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of SQL has led to the development of new type of processing method called Natural Language Interface to Database systems (NLIDB) [12, 16]. The NLIDB system is actually a branch of more comprehensive method called Natural Language Processing (NLP). In general, the main objective of NLP research is to create an easy and friendly environment to interact with computers in the sense that computer usage does not require any programming language skills to access the data; only natural language (i.e. English) is required. One drawback of previous systems is that the grammar must be tailor-made for each given database. Another drawback is that many NLP systems cover only a small domain of the English language questions. The system is generic in nature given the appropriate database and knowledge base. This feature makes our system distinguishable. Natural language computing (NLC) is becoming one of the most active areas in Human-Computer Interaction. The goal of NLC is to enable communication between people and computers without resorting to memorization of complex commands and procedures [19]. Via computers around the globe the people accumulate and manipulate huge amount of data every second of the day. These huge amounts of data are located in private personal computers or remote location (i.e. the internet). Mostly, the data is stored in some kind of databases or data warehouses. Data in database are usually managed by DBMS. SQL is used to access the database. The NLIDB system is actually a branch of more comprehensive method called Natural Language Processing (NLP). The main objective of NLP research is to create an easy and friendly environment to interact with computers in natural language (i.e. English) [16]. In recent times, there is a rising demands for non-expert users to query relational databases in a more natural language encompassing linguistic variables and terms, instead of operating on the values of the attributes. Therefore the idea of using natural language instead of SQL has prompted the development of new type of processing method called Natural Language Interface to Database systems (NLIDB). NLIDB is a step towards the development of intelligent database systems (IDBS) to enhance the users in performing flexible querying in databases.

2. EARLY SYSTEMS

The early efforts in the NL interfaces area started back in fifties. Prototype systems were introduced in the late sixties and early seventies. Many of these systems relied on pattern matching to directly mapping the user input to the database . Formal LIst Processor (FLIP) is an early language for pattern-matching based on LISP structure works on the bases that if the input matches one of the patterns then the system is able to build a query for the database. In the pattern matching based systems, the database details were inter-mixed into the code, limited to specific databases and to the number and complexity of the patterns. Natural language processing was one of the approach, where the user interactively is allowed to interrogate the stored data.

2.1 LUNAR system

The system LUNAR is a system that answers questions about samples of rocks brought back from the moon. The meaning of systems’ name is that is in relation to the moon. The system was informally introduced in 1971. The LUNAR system uses two databases; one for the chemical analysis and the other for literature references which helps it to achieve its functions. The LUNAR system uses an Augmented Transition Network (ATN) parser and Woods' Procedural Semantics. According to [13], the LUNAR system performance was quite impressive; it managed to handle 78% of requests without any errors and this ratio rose to 90% when dictionary errors were corrected. The system was not put to intensive use because of its linguistic limitations.

2.2 LADDER

The LADDER system was designed as a natural language interface to a database of information about US Navy ships. According to, the LADDER system uses semantic grammar to parse questions to query a distributed database. The system uses semantic grammars technique that interleaves syntactic and semantic processing. The question answering is done via parsing the input and mapping the parse tree to a database query. The system LADDER is based on a three-layered architecture [14]. The first component of the system is for Informal Natural Language Access to Navy Data (INLAND). This component accepts questions in a natural language andgenerates a query to the database. The queries from the INLAND are directed to the Intelligent Data Access (IDA). This component which is the second component of LADDER builds a fragment of a query to IDA for each lower level syntactic unit in the English Language. Input query and these fragments are then combined to higher level syntactic units to be recognized.The combined fragments are sent as a command to IDA at the sentence level. IDA would compose an answer that is relevant to the user’s original query in addition to planning the correct sequence of file queries. The third component of the LADDER system is for File Access Manager (FAM). The FAM finds the location of the generic files and handles the access to them in the distributed database. LISP was used to implement the LADDER system.

2.3 CHAT-80

The system CHAT-80 is one of the most referenced NLP systems in the 80’s. Prolog was used to implement this system. The CHAT-80 was an impressive, efficient and sophisticated system. The database of CHAT-80 consists of facts (i. e. oceans, major seas, major rivers and major cities) about 150 of the countries world and a small set of English language vocabulary that are enough for querying the database. An English language question is processed in three stages [15]. The system translates the English language question by the creation of a logical form as processes of three serial and complementary functions where:

  1. Logical constants are used to represent words.
  2. Predicates are used to represent Verbs, nouns, and adjectives with their associated prepositions. The predicates can have one or more arguments.
  3. Conjunctions of predicates are used to represent complex phrases or sentences.

Figure 1 Chat-80 Processing Scheme

These functions are being; parsing, interpretation and scoping. Grammatical structure of a sentence is determined by the parsing module and the interpretation andscoping consist of various translation rules, expressed directly as Prolog clauses. The basic method followed by Chat-80 is to attach some extra control information to the logical form of a query in order to make it an efficient piece of Prolog program that can be executed directly to produce the answer. The generated control information comes into two forms:

  1. The order in which Prolog will attempt to satisfy the query is determined by the Order of the predications for that query.
  2. Separates the over all program into a number of independent sub problems to limit the amount of backtracking performed by Prolog.

In this way, Prolog is led to answer the queries in an obviously sensible way and the Prolog compiler can compile the transformed query into code that is as efficient as iterative loops in a conventional language.

3. VARIOUS APPROACHES TO NATURAL LANGUAGE PROCESSING INTERFACE

Natural language is the topic of interest from computational viewpoint due to the implicit ambiguity that language possesses [1]. Several researchers applied different techniques to deal with language. Next few sub-sections describe diverse strategies that are used to process language for various purposes.

3.1Symbolic Approach (Rule Based Approach)

Natural Language Processing appears to be a strongly symbolic activity. Words are symbols that stand for objects and concepts in real worlds, and they are put together into sentences that follow the well specified grammar rules. Knowledge about language is explicitly encoded in rules or other forms of representation. Language is analyzed at various levels to obtain information. On this obtained information certain rules are applied to achieve linguistic functionality [1]. As Human Language capabilities include rule-base reasoning, it is supported well by symbolic processing. In symbolic processing rules are formed for every level of linguistic analysis.

3.2 Empirical Approach (Corpus Based Approach)

Empirical approaches are based on statistical analysis as well as other data driven analysis, of raw data which is in the form of text corpora i.e. a collection of machine readable text. The approach has been around since NLP began in the early 1950s. The empirical NLP has emerged as a replacement to rule-based NLP only in the last decade. Corpora are primarily used as a source of information about language and a number of techniques have emerged to enable the analysis of corpus data. Syntactic analysis can be achieved on the basis of statistical probabilities estimated from a training corpus. Lexical ambiguities can be resolved by considering the likelihood of one or another interpretation on the basis of context. Given the positive results shown by the Empirical approach several different symbolic and statistical methods have been employed, but most of them are used to generate one part of a larger information extraction system.

3.3 Connectionist Approach (Using Neural Network)

Since human language capabilities are based on neural network in the brain, Artificial Neural Networks (also called as connectionist network) provides on essential starting point for modeling language processing [2, 18]. Instead of symbols, the approach is based on distributed representations that correspond to statistical regularities in language. Though there has been significant research on the use of applying neural networks for language processing little has been done in the field of using the sub-symbolic learning in language processing. SHRUTI system developed by [3] is a neurally inspired system for event modeling and temporal processing at a connectionist level.

4. NATURAL LANGUAGE DATABASE INTERFACE BASED ON PROBABILISTIC CONTEXT FREE GRAMMAR:

All the earlier tools translates a natural language query into an intermediate language similar to first-order logic and then into SQL using a set of definitive rules. The intermediate language expresses the meaning of the query in terms of high-level concepts that are independent of database structures. It is lack of robust in the translating process, since premises of rules must be matched with phrases of the query exactly or else the rule will be rejected. The purpose of CFG is to overcome this problem by constructing the most probable grammar tree and analyzing the non-terminals (phrases) in the grammar tree to collect the parameters which will be used in SQL.

4.1 Overview of Natural Language Database Interface

The overall system architecture of our natural language database interface is given in Figure 2. The original input to the system is a query expressed in natural language. To process a query, the first step is part of speech tagging; after this step each word of the query is tagged [6, 7]. The second step is parsing the tagged sentence by a PCFG. The PCFG parser analyzes the query sentence according to the tag of each word, and calculates the probability of all possible grammar trees. The result of the analysis forms a grammar tree with the maximum probability. Finally, the SQL translator processes the grammar tree to obtain the SQL query by a series of dependency rules.

Figure 2 NLDBI Architecture

4.2 Probabilistic Context Free Grammar

CFG has a formalization capability in describing most sentence structures. Also, CFG is so well formed that efficient sentence parser could be built on top of it [19]. Probabilistic context free grammar (PCFG) inherits most advantages of CFG, and it is the simplest statistical model to analyze natural language. Usually natural language sentences are transformed into a tree structure through PCFG, and the grammar tree is analyzed according to user’s requirements. A probabilistic context free grammar consists of the following:

A terminal set: {wk}, where wk is a word, corresponding to a leaf in the grammar tree;

A non-terminal set: {Ni}, Ni which is a sign used to generate terminals, corresponding to a non-leaf node in the grammar tree;

The grammar is weakly equivalent to the Chomsky Norm Form grammar (CNFG) which consists of rules of only two forms which is the most concise grammar. With n non-terminals and v terminals, the number of parameters for each of the four forms is shown in table 1.

s

Table 1

Consider a sentence w1m that is a sequence of words w1 w2 w3……wm (ignoring punctuations), and each string wi in the sequence stands for a word in the sentence. The grammar tree of w1m can be generated by a set of pre-defined grammar rules. As wi may be generated by different non-terminals, and this situation also appears when generating a non-terminal (a non-terminal may be generated by different sets of several non-terminals), usually more than one grammar tree may be generated. Take sentence “ate dinner on the table with a fork” for example, there are 2 grammar tress corresponding to the sentence as shown in figure 3.

Figure 3

Two possible grammar trees that may be generated for the sentence “ate dinner on the table with a fork”

4.3 SQL Translator

SQL translator is used translating the leaves of the tree to the corresponding SQL. Actually the process is collecting information from the parsed tree. Two techniques may be used to collect theinformation: dependency structure and verb sub categorization [8, 9, 10]. These techniques are also used in disambiguation, since a PCFG is context-free and ancestor-free, any information in the context of a node in the parsed tree needs not be taken into account when constructing the parsed tree. A dependency structure is used to capture the inherent relations occurs in the corpus texts that may be critical in real-world applications, and it is usually described by typed dependencies. A typed dependency represents grammatical relations between individual words with dependency labels, such as subject or indirect object. Verb sub categorization is the other technique [11]. If we know the sub categorization frame of the verb, we can find the objects of this verb easily, and the target of the query can be found easily. Usually this process can be done after scanning the parsed tree.

5. THE GENERIC INTERACTIVE NATURAL LANGUAGE INTERFACE TO DATABASE SYSTEM ARCHITECTURE:

The architecture of the GINLIDB system consists of two major components:

1. Linguistic handling component

2. SQL constructing component.

The natural language query correctness as far as the grammatical structure and the possibility of successful transformation to SQL statement is concerned, is controlled by the first component.