GI retrieval based on Natural Language Processing
Adrian Zafiu1,2, Tiberiu Boros1,
1Research Institute for Artificial Intelligence, Romanian Academy,
Bucuresti, Romania,
2University of Pitesti, Faculty of Electronics, Communications and Computers,
Pitesti, Romania,
Lately, natural language processing (NLP) has benefited from technological advances and increased interest from important research groups and large companies. One of the reasons is that NLP analysis can be used to convert unstructured information (e.g. text) into structured information. Running simple queries on search engines, shows that the Internet contains a lot of unstructured Geographic Information (GI). To prove our idea we investigate the data that can be extracted from a simple passage in an article downloaded from Wikipedia:
Bucharest is the primary entry point into Romania. (…) Known in the past as "The Little Paris" Bucharest has changed a lot lately and (…) Finding a 300 year old church near a steel-and-glass building that both sit next to a communist style building is commonplace in Bucharest. (…)
Extracted from http://wikitravel.org/en/Bucharest#Understand.
Our analysis will cover only the first sentence from our passage due to the abstract size restriction. After running a typical NLP analysis (part of speech tagging, chunking, parsing, named entity recognition etc.) we find the following relations forming between entities (figure 1):
Figure 1 - Rule based analysis of the first Sentence
“Bucharest” and “Romania” represent named entities, “the primary entry point”, “Bucharest” and “Romania” are also locations and the words “is” and “into” represent relations between elements. From a GIS point of view, this information may not seem useful but when properly stored and structured they can provide answers to questions like: “what is the primary entry point of Romania?”, “what is Bucharest?” etc. Also, if we have the exact shape and location of the areas represented by Romania and Bucharest, we can extrapolate the answer to “where is the primary entry point into Romania?”. Again this answer may be ambiguous if we think from the spatial perspective and we try to locate the exact location of the entry point into the perimeter that defines Romania. However, other sources may contain more accurate information from the GIS point of view and this becomes a problem of choosing the correct source of information. We consider this to be a hybrid NLP – GI problem.
Figure 2 – Domain entities
Figure 3 – Domain entities
This paper focuses on using natural language processing (NLP) techniques for retrieving geographic information (GI) from text. We also introduce and evaluate an ontology (figure 2) that defines the basic entities and relations (figure 2) that are used to store the information retrieved from plain texts. We are currently developing a system for GI extraction from text sources, that combines statistical and rule based NLP methods with available GIS information. We split our GI entities into referenced (known data, e.g.: locations that are stored in a database such as “Bucharest) and unreferenced (unknown data, e.g.: a 300 year old church). NLP analysis enables us to create references for some unreferenced data and we can also determine the usability of the information retrieved from texts (e.g.: the data can be used for spatial representation of objects or it has a different level of usability - question answering etc.)