Different approaches ofWord Sense Disambiguation in Hindi Language

Dr. Pankaj Kumar1, Atul Vishwakarma2 and Ashwani Kr. Verma3

1Assitant Professor, Computer Science dept., Shri Ramswaroop Memorial Group of Professional Colleges,

Lucknow, UP, India

2Graduate Scholar, Computer Science dept., Shri Ramswaroop Memorial Group of Professional Colleges,

Lucknow, UP, India

3Graduate Scholar, Computer Science dept., Shri Ramswaroop Memorial Group of Professional Colleges,

Lucknow, UP, India

1

Abstract

Hindi is National Language of India spoken by about 500 million people and ranking 4th among the majority spoken language in the world. But the ambiguities present in this language create hindrance in usage of Information technology for native users. So, there is the need of effective measures to perform natural language processing thereby making the native users utilize these technologies to the fullest. Language translator is important tool to resolve this problem. Word sense disambiguation is an important concept that is to be evaluated for performing machine translation and a tool to perform disambiguation. In this research paper the different approaches ofword sense disambiguation in Hindi Language are discussed and their comparative study is made to reach the conclusion.The main idea is to compare the context of the word in a sentence with the contexts constructed from the Wordnet and chooses the winner. The output of the system is a particular synset number designating the sense of the word. The mentioned Wordnet contexts are built from the semantic relations and glosses, using the Application Programming Interface created around the lexical data.

Keywords

Word Sense Disambiguation (WSD), Natural language processing (NLP), Machine Translation (MT), Hindi Wordnet.

  1. Introduction

1.1 Word Sense Disambiguation-

The task of selecting the correct sense for a word is called word sense disambiguation, or WSD. A word can have number of senses, which is termed as ambiguity. Something is ambiguous when it can be understood in two or more possible ways. This word sense disambiguation is an intermediate task, but rather is necessary at one level to accomplish most natural language processing tasks. In this way, word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Here the sense inventory comes from a dictionary or thesaurus. Many words have more than one possible meaning. E.g.

हर वर्गकेलोगकीमतों में वृद्धिसे पीड़ित है|

Here ‘वर्ग’is interpreted as ‘class’.

सात का वर्गउनचासहै|

Here ‘वर्ग’ is interpreted as ‘square of the number’.

यह 5cmकाएकवर्गहै|

Here ‘वर्ग’ is interpreted as ‘square-shaped figure’.

So in this way there is ambiguity for ‘वर्ग’.

1.2 Machine translation-

Machinetranslation (MT) is an application of computers tothe task of translating texts from one naturallanguage to another. Machine translation (MT) isalso known as “Automatic Translation” or

“Mechanical Translation”. MT ismultidisciplinary field of research. It uses theideas from linguistics, computer science, artificialintelligence, statistics, mathematics, philosophyand many other fields. There are at least twostages:

1) Understanding the source language and

2) Generating sentences in the target language.

WSD is required in both stages since a word in

the source language may have more than onepossible translation in the target language. Forexample, the English word “drug” can betranslated into Turkish as “ilaç” for its sense of“medicine” or as“uyuşturucu” for its sense of“dope” depending on the context. In order to beable to correctly translate a text, we need to knowwhich sense is intended in the text.

1.3 Role of Word Sense Disambiguation in Machine Translation-

The sense disambiguation is essential for the proper translation of words such as the Hindi ‘सोना’, which, depending on the context, can be translated as ‘Gold’, ‘Sleep’, ‘Sona (the name)’ etc.[1]

Example-

सोना सोनाचाहता है|

It can be translated as-

Sona wants gold.

or

Sona wants to sleep.

or

Gold wants to sleep.

or

Gold wants sona. etc.

So, in this way there is ambiguity for the word ‘सोना’ because it has different senses.

2.Related work

Some of the methods and their approaches forword sense disambiguation will be discussed. Wewill discuss works done by various researchers inthis particular area and problem.

"Unsupervised word sense disambiguation

rivaling supervised methods", Yarowsky,

D. (1995), this paper presents an unsupervisedlearning algorithm for sense disambiguation. The

algorithm is based on two powerful constraints -one sense per discourse and one sense percollocation- exploited in an iterativebootstrapping procedure. Tested accuracyexceeds 96%.

Dekang Lin. (1997) in this paper "Twodifferent words are likely to have similarmeanings if they occur in identical localcontexts" is adopted in this paper.Disambiguation is done based on syntactic

dependency and sense similarity.

Rigau et al. (1997) it correctly states that mostWSD algorithms have been developed as standaloneand investigate the possibility of combiningthem. The methods in the study include thoseused by Pedersen et al. and some baselinemethods such as using the most frequent sense.Test results indicate approximately 8 % increasein precision for the combination ofdisambiguation methods.

“Word Sense Disambiguation by WebMining” Peter D. TURNEY has developedthe NRC (National Research Council) Wordsense Disambiguation (WSD) system, which isapplied to English Lexical Sample (ELS). Inwhich, we used the Supervised approach formachine learning problem. Familiar tools areused such as the Weka machine learning softwareand Brill‟s rule-based part-of-speech tagger. Theyrepresented as features like semantic features andsyntactic features. The main motive in the system

is the method for generating the semanticfeatures, based on word co-occurrence probabilities.

“Word Sense Disambiguation forVocabulary Learning” Anagha Kulkarni,Michael Heilman, Maxine Eskenazi andJamie Callan (2006) have developed the wordsense disambiguation for vocabulary learning. Itis designed to assist English as a SecondLanguage (ESL) student to improve their Englishvocabulary, to operate at the level of the word meaningpairs being learned and not just thewords being learned, for several reasons. Thesupervised and unsupervised approaches wereused. Supervised approaches were consistentlymore accurate than using unsupervisedapproaches.

3.Wordnet Principle

Wordnet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory [2]. Each word meaning can be represented by a set of word-forms known as synonym sets or synsets. Synsets are created for content words, i.e., for Noun, Verb, Adjective and Adverb.

2.1 Lexical Matrix-

The following table- called Lexical Matrix- is an abstract representation of the organization of lexical information [3]. Word–forms are imagined to be listed as headings for the columns and word meanings as headings for the rows. Rows express synonymy while columns express polysemy.

Table1:Illustrating the concept of Lexical Matrix

S.No. / Word Meaning / Word Forms
F1 F2 F3 ..... Fn
1. / M1 / E1,1 E1,2
2. / M2 / E2,2
3. / M3 / E3,3
...... / ...... / .....
N. / Mn / Em,n

For example, the synset {कलम, पेन, क़लम, लेखनी} gives the meaning उपकरण ?जसक? सहायता से कागज़ आ?द पर िलखते ह?. कलम belongs to a synset whose members form a row in the lexical matrix, and the row number gives a unique id to the synset. कलम has another meaning- पेड़ क? वह टहनी जो दूसर? जगह बैठानेया दसरेू पेड़ म? पैबंद लगानेके िलए काट? जाए- which comes in the column headed by this word.

2.2 Semantic Relations in Wordnet-

Hindi Wordnet design is inspired by the famous English Wordnet [4]. The basic semantic relations are as follows:

Table 2: Illustrating the nature of the relations in Wordnet

s.no / Relation / Meaning
1. / Hypernym/Hyponym / Is-A (kind-of)
2. / Entailment/Troponymy / Manner-of (for verbs)
3. / Meronymy/Holonymy / Has-A

For instance, we have the synset {घर, गृह}. The hypernymy relation (Is-A) of it links to {आवास, िनवास}. Its meronymy relation (Has-A) links to {आँगन} {बरामदा} and {अ?ययनक?} and hyponymy relation to {बाड़?}, {सराय} and {झोपड़?}.

Figure 1: A small part of the Hindi Wordnet

2.3 Wordnet Application Programming Interface-

The WSD task needs various information from the Wordnet, which in turn calls for the availability of an Application Programming Interface to the Wordnet. Figure 1 shows the organization of the API. To take a particular example, findtheinfo() function receives input arguments as word form, syntactic category, search type (e.g., hypernymy) and sense number. This will return the search type (i.e., hypernymy) output in a buffered form.

These APIs are meant to do followings: (1) Morphological Processing (2) Database Searching (3) Utilities. Morphological processing routines extract the stem from the word. Database search functions are used retrieve information from the Wordnet. Utilities are useful in other operations which might be useful to process words.

3. Methodology: Our Approach to WSD

We describe a statistical technique for assigning senses to words in Hindi [5]. A word is assigned a sense with the use of (i) the context in which it has been mentioned (ii) the information in the Hindi Wordnet and (iii) the overlap between these two pieces of information. The sense with the maximum overlap is the winner sense.

WSD Algorithm: Finding the word’s Correct Sense

1. For a polysemous word w needing disambiguation, a set of context words in its surrounding window is collected. Let this collection be C, the context bag.

2. For each sense s of w, do the following

(a) Let B be the bag of words obtained from the

(I) Synonyms

(II) Glosses

(III) Example Sentences

(IV) Hypernyms

(V) Glosses of Hypernyms

(VI) Example Sentences of Hypernyms

(VII) Hyponyms

(VIII) Glosses of Hypernyms

(IX) Example Sentences of Hypernyms

(X) Meronyms

(XI) Glosses of Meronyms

(XII) Example Sentences of Meronyms

(b) Measure the overlap between C and B using the intersection similarity measure.

3. Output that the sense s as the most probable sense which has the maximum overlap.

Figure 2: Extracting semantic relations from Wordnet and building context from the text for WSD

4. Components in the System-

4.1 Parameters

Wordnet relations: We have used hypernymy, hyponymy and meronymy relations. Since,

these relations are semantic in nature; we obtain the synsets, their glosses and example sentences. We call the collection of words from words from Wordnet as the Semantic Bag.

• Word Context Size: The current sentence in which w is forms the most important context. We add to this the previous and the following sentences too. We call the collection of context words as the Context Bag.

4.2 Implementation Modules-

  • BuildContext: This module builds the context bag from the input document.
  • NounSemanticExtractor: This module builds the semantic bag by exploiting the semantic relations in the Wordnet. Input to this module is the polysemous word.
  • Tokenizer:This module finds the unique tokens from the input document. This is an intermediate module required by BuildContext and NounSemanticExtractor.
  • Intersection: This computes the overlap between the two input bags.
  • Rank: This ranks the senses according the amount of intersection.

5. Applications

Word sense disambiguation a task of removing the ambiguity of word in context, is important for many WSD applications using NLP such as:

Information retrieval

Machine translation

Speech processing and part of speech

Tagging

Text Processing

5.1 Information retrieval-

Asproposed by WSD helps in improving term

indexing in information retrieval has proved that

word senses improve retrieval performance if the

senses are included as index terms [6]. Thus,documents should not be ranked based on wordsalone, the documents should be ranked based onword senses, or based on a combination of wordsenses and words.

For example: Using different indexes forkeyword “Java” as “programming language”, as“type of coffee”, and as “location” will improveaccuracy of an IR system. Apart from indexing,WSD also helps in query expansion. Shortqueries are expanded using words that belong tosame sets. Retrieval using expanded queries givesbetter results than original queries. Thus, WSD iscrucial for improving accuracy of IR as iteliminates irrelevant hits.

5.2Speech Processing and Part Of

Speech Tagging-

Speech recognition i.e. when processing homophones words which arespelled differently but pronounced the same way.For example: “base” and “bass” or “sealing” and“ceiling”.

5.3Machine Translation-

WSD isimportant for Machine translations. It helps inbetter understanding of source language andgeneration of sentences in target language. It alsoaffects lexical choice depending upon the usagecontext.

5.4 Text Processing-

Text to Speechtranslation i.e. when words are pronounced in more than one way depending on their meaning.For example: “lead” can be “in front of” or “typeof metal”.

6. Conclusion

In this paper we have used the Hindi Wordnet for a fundamental NLP task, viz., disambiguation of Hindi words. To our knowledge, this is the first attempt at automatic WSD for an Indian language and is a significant step towards Indian language processing.

As can be seen, our accuracy values range from about 40% to about 70% [7]. The performance can surely be improved if morphology is handled exhaustively. The system currently does not detect the underlying similarity in presence of morphological variations. Since Indian languages are rich in morphology, exhaustive pre-processing for morphology is crucial in the whole WSD process.

Our system currently deals with only nouns. Work is on to include words of other parts of speech. The obstacle there is the shallowness of the lexical network for non-noun words. With the enrichment of- for example, the verb hierarchy [3] - the system performance is expected to be very impressive.

References

[1] Mark Stevenson, “Word Sense Disambiguation: Natural Language Processing Group”, University of Sheffield, UK.

[2] Mitesh M. Khapra, “Word Sense Disambiguation”.

[3]Rada Mihalcea and Ted Pedersen,“Slides from AAAI tutorial- advances in Word Sense Disambiguation.”

[4] Sin-Jae Kang, “Corpus based Ontology for Word Sense Disambiguation”.

[5] Eneko Agirre and Oier Lopez de Lacalle and Aitor Soroa, “Knowledge-Based WSD on Specific Domains:Performing Better than Generic Supervised WSD”, Informatika Fakultatea, University of the Basque Country

20018, Donostia, Basque Country.

[6] Avneet Kaur, “Development of an Approach forDisambiguatingAmbiguous Hindi postposition”, Department of Computer Science, Punjab University,India.

[7] Parul Rastogiand Dr. S.K. Dwivedi, “Performance comparison of Word Sense Disambiguation (WSD)

Algorithm on Hindi Language Supporting Search Engines”, BabaSaheb BhimRao Ambedkar University Lucknow, UP, India

Author Biographies

Dr. Pankaj Kumarhas received his Doctor of Philosophy (PhD) degree in Computer Application from Integral University, Lucknow.His Area of Expertise is Parallel Computing and Memory Architecture of Parallel Computer. Many of the valuable research papers of Dr. Pankaj Kumar have been published in various national/international journals and IEEE proceeding publication in the area of “Parallel Computing. He is reviewer for six different International Journal and member of editorial board for two different International Journals.

Atul Vishwakarma is pursuing his Bachelor of Technology in Computer Science and Engineering. Currently he is working on final year project “Word Sense Disambiguation in Hindi Language using Web Mining.” His areas of interest are Data Base Management Systems (DBMS), Operating Systems and Data Structures.

Ashwani Kr. Verma is pursuing Bachelor of Technology in Computer Science and Engineering. For his final year project he is working on “Word Sense Disambiguation in Hindi Language using Web Mining” with his project partner Atul Vishwakarma under the Guidance of Dr. Pankaj Kumar. His areas of interest are Operating Systems and Data Base Management Systems (DBMS).

1