Building Vowel Sandhi Viccheda System for Sanskrit

Rupali DeshmukhVarunakshi Bhojane

Dept. of Computer Engineering,Dept. of Computer Engineering,
Mumbai University,Mumbai University,
Maharashtra, India. Maharashtra, India.

ABSTRACT

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics, concerned with the interactions between computers and human (natural) languages. The conjunction of two immediate sounds, that means union, is called as sandhi formation. Sandhi-splitting describes the process by which one letter is broken to form two words. Sandhi-splitting is one subtask for complete analysis of input text in NL. Proposed system is an enhanced version, which recognizes sandhi word from input text in Sanskrit; splits sandhi word into its original form and return what type of sandhi it is.

Keywords

NLP, sandhi, sandhi-splitter, sandhi-viccheda, Sanskrit

INTRODUCTION

Natural language can be any language which human can understand, like English, Marathi, Punjabi, Tamil, Hindi, etc. and computer only understand machine language. So if we want computer to understand human language, we have to convert natural language into machine language. Natural Language Processing (NLP) will help us to interact between human and computer. NLP is very important topic in todays world of internet. In this age there is lot of inforamtion on the web in the form of text. It is very important concern in todays world to obtain information from this text and use it for various purposes. This is the motivation for understanding NLP.NLP requires many preprocessing stages to analyze, understand and generate natural language. Each of these may form a subtask, which itself can be used in various applications. Natural language system can build interfaces for travel agent navigation and can even build user friendly interfaces for blind people with speech recognition.

For all languages in the Indo European family in India, the roots can be traced to Sanskrit. Sanskrit is the ancient language of India. Most of the Indian languages are derived from Sanskrit. It is considered as the mother of all languages. The language is also known for its clarity and beauty. Sanskrit literature is the richest literature in the history of humankind. Famous Vedas in Sanskrit are the Rig-Veda, Yajur-Veda, Sama-veda and the Atharva-veda.The origins of Ayurveda are found in Atharva-veda. Great knowledge of medical science is encoded in Sanskrit language.The two great epics were Ramayana and Mahabharata. As Sanskrit is a heritage language, there is need to digitize, transfer and preserve ancient texts. Sanskrit is the classical language for Indian and Panini is the founder for this language. Panini, the Indian grammarian, has developed grammar for Sanskrit language. From those, well known Sanskrit grammar is The Ashtadhyayi. Panini has described around 4000 sutras, construction of sentences and classes of basic elements like noun, verb, vowel, consonant etc. In

Paninian grammar, he has used “samhita” word instead of “sandhi” word.

Sandhi is used to make easy pronunciation. In sandhi formation technique, sounds at word boundaries may change, i.e. last sound of first word and first sound of second word get replaced by particular sound according to rules. Sandhi splitting is exactly opposite of sandhi formation. Sandhi splitter is important because it will simplify the text. Example of sandhi splitting: तथास्तु = तथा + अस्तु

EXISTING SYSTEMS

There are some existing systems available which can split sandhi words and also some software on generation of sandhi words.

  1. SANSKRIT ANALYSIS SYSTEM(SAS) [1]

This project is developed by Girish N. Jha, Sudhir Mishra, R. Chandrashekar, Priti Bhowmik, Subash, Sachin Mendiratta, Muktanand at Jawaharlal Nehru University, New Delhi.System takes input from keyboard or Devnagari Unicode converter. System gives output with sandhi type. This system is for sandhi splitter as well as sandhi generator. They have used Paninian Rules for generating reverse computation of sandhi rules. Often it gives multiple results at a time.

  1. TDIL: SANDHI SPLITTER [10]

This system is developed by Consortium of 7 Institutes:University of Hyderabad; Jawaharlal Nehru University; IIIT-Hyderabad; Sanskrit Academy, Hyderabad; Poornaprajna Vidyapeetha, Bangalore; Rashtriya Sanskrit Vidyapeetha, Tirupati; JRR Sanskrit University, Jaipur. They have not stated any document suggesting which approach or algorithm they have used. This system can take input in multiple encoding like Unicode Devnagari, Itrans-5.3, Velthuis, WX-alphabetic, KH, and SLP.This system can’t take whole input file or not even single sentence. This system can take only one word as input. It does not state type of sandhi.

SANDHI SPLITTER

Sanskrit is the best example that unites the words to form a compound word. Every morphological analyzer requires words as input to analyze, but there is no word boundary marker in continuous string. Before morphological analysis, the text needs to split sandhi and identify each and every distinct word. Thus sandhi analyzer needs to recognize all meaningful words.

In Sanskrit, there are three important classifications of ‘sandhi’. They are 1) ‘Ach sandhi’, 2) ‘Hal sandhi’ 3) ‘Visarga sandhi’. ‘Ach sandhi’ and ‘Visarga sandhi’ works with vowels and ‘Hal sandhi’ works with consonants

Fig. 1 Classification of sandhi in Sanskrit

Proposed system will take input text in Sanskrit, recognize vowel sandhi words, split those words according to specified rules and will generate output. Proposed system is developed for vowel sandhi only. Proposed system’s overview will be as follows.

  • Input Text:

Here system will take input text file and pass to next block for further processing.

Input:तथापिकृष्णचन्द्रःसन्तुष्टःनआसीत्|

Output:तथापिकृष्णचन्द्रःसन्तुष्टःनआसीत्|

  • Viccheda Eligibility Tests (Pre-Processing): Sandhi pre-processor will mark and normalize the input text for punctuations (‘|’, ‘,’etc.); before the text are processed for sandhi segmentation.

Input: तथापिकृष्णचन्द्रःसन्तुष्टःनआसीत्|

Output: तथापिकृष्णचन्द्रःसन्तुष्टःनआसीत्

  • Categorization:

In this block, it will tag each word whether noun, avyaya or averb. For this purpose this system will use word list.

Input: तथापिकृष्णचन्द्रःसन्तुष्टःनआसीत्

Output: तथापिकृष्णचन्द्रः(noun)सन्तुष्टः (verb)न(avyaya)आसीत्(verb)

  • Sandhi Recognizer: After removing all avyaya, verb and nouns we will get words; those we can process further. With the help of exception dataset we can filter out.

Input: तथापि

Output: तथापि (not exception)

  • Sandhi Splitter (Sandhi Viccheda): Reverse computational rules of Panini’s rules are created and stored in sandhi rulebase.Here in this module, it will search for those markers where split can possible and list down all those markers.

Input:तथापि

Output: markersा ि

  • Sandhi Rule Base: After pre-processing the system will check the sandhi rule-base in the database to mark the resultant sandhi sounds (marker) for potential splitting and to identify the sandhi patterns for Viccheda, corresponding to the marked sound.

The rulebase consist of three things; first is marker and second is its corresponding pattern and third is respective sandhi name. Here ‘marker’ is nothing but that sound where sandhi split is possible and ‘pattern’ is resulting sandhi rule like which sound will add or replaced at last position of first word and first position of last word. Example: ा=ा+आ is a reverse rule of dirgha sandhi.

Here (ा) is a marker and (ा+आ) is the corresponding pattern. The markers and patterns in the rule base are based on Paninian grammar of generative sandhi,but they are not exactly reverse to the forward sandhi formalism.

For example, in forward yan sandhi, /इ/ or /ई/ are changed to /य्+/that vowel/,but in reverse rule base,/्य/ has been stored as a marker which will replace by its pattern /ि/+/अ/

Fig. 2Design of sandhi splitter for Sanskrit

The rule base has been built up in the following format:

Leftsearch will search for markerfrom left side. If found marker, then according to the rule, marker will get replacedby its corresponding pattern.

add suffix add prefix

v i c c h e d a p a t t e r n

(left_search) (marker_found)

Illustration: input word तथापि

Step 1: Search marker from the left side= ा, ि

Step 2: Search for corresponding pattern to each marker one by one.

Step 3: At each stage of pattern replacement, the segmented words will be validated by lexical check.

Step 3.1: If both the words are found in the dictionary. Then the system will return them as outputs.

Else if both the words are not found, then the program will look for the next marker Go to Step 1 till the segmentation of the word is validated.

Step 4: In case no segmentation is validated, the input will be returned as it is.

Table 1: Outline of reverse yan sandhi(यणसन्धि)

Marker / pattern
्य / ि+अ
्या / ि+आ
्यु / ि+उ
्यू / ि+ऊ
्ये / ि+ए
्यै / ि+ऐ
्यो / ि+ओ
्यौ / ि+औ
्यृ / ि+ॠ
्यु / ी+उ
्य / ी+अ
्या / ी+आ

Likewise many rules are there for different kinds of sandhi.

Input: तथापि

Output: तथ+अपि, तथ+आपि, तथा+अपि, तथा+आपि

  • Search in Dictionary: It will check whether obtain words are present in the dictionary. If present thenthese words are valid.
  • Output: System will give generated output.

Output:तथा+अपि (दीर्घ सन्धि)

Proposed system will give output in split form with type of sandhi.

RESULTS AND DISCUSSION

The proposed sandhi splitter system is tested for multiple documents in devnagari script. As shown in below figure 3, it is giving general idea of result analysis of three documents.

Fig. 3 Result analysis of different documents

Actual result analysis of proposed system in terms of precision, recall and accuracy is shown in figure4.The possible classification cases are as follows:

  • True Positives (TP):number of positive examples, labeled as such.
  • False Positives (FP):number of negative examples, labeled as positive.
  • True Negatives (TN):number of negative examples, labeled as such.
  • False Negatives (FN):number of positive examples, labeled as negative.

Fig. 4Result analysis of Sandhi splitter system

Precision is nothing but how many of the returned documents are correct.

Precision = TP / (TP+FP)

Recall is nothing but how many of the positives the systemwill return.

Recall = TP / (TP+FN)

Accuracy = (TP+TN) / (TP+FP+FN+TN)

As recall increases precision decreases and vice-versa.

CONCLUSION

Sanskrit is the communicator of an unbroken knowledge tradition from the vedic times to the present times. Modern Indian languages can benefit from profound knowledge of the texts of Indian intellectual tradition. Therefore automatic translation from Sanskrit to Indian languages is highly desirable. And no automatic translation from Sanskrit is possible without building such analysis tools. Unfortunately, the tools for Sanskrit either do not exist or in developing stage. And for developing a big tool based on Sanskrit language, there must be smaller one. So, sandhi Splitter plays very important role because it is a pre-processing step in any Natural Language Processing applications for Indian languages.

One of the key tasks in sandhi splitting is identifying the correct split position. This system is using reverse sandhi rules for sandhi splitter. The rule based approach mostly makes use of hand written transfer rules to analysis of sandhi. For developing sandhi analysis and splitting system rule based approach, which is most effective due to ease of building system. But it has some limitations like it requires correct linguistic rules formulated by experts.

Accuracy of this system is totally based on reverse sandhi rules and used datasets. So, one can improve this accuracy by adding datasets and effective rules.

REFERENCES

[1]Girish N. Jha, Sudhir K. Mishra, R. Chandrashekar, Priti Bhowmik, Subash, Sachin Mendiratta, Muktanand “ Developing a Sanskrit Analysis System for Machine Translation ”, Special Center for Sanskrit Studies, Jawaharlal Nehru University, New Delhi-110067

[2]Joshi Shripad S. “ Sandhi Splitting of Marathi Compound Words ”, International Journal on Advance Computer Theory and Engg, Vol. 2 Issue 2, 2012

[3]Akshar Bharati, Amba Kulkarni, V Sheeba “ Building a Wide Coverage Sanskrit Morphological Analyzer: A Practical Approach ”, Rashtriya Sanskrit Vidyapeetha (Deemed University), Tirupati,2006

[4]Raj Dabre, Archana Amberkar, Pushpak Bhattacharyya “ A Way to Break Them All: A Compound Word Analyzer for Marathi ”,ICON, Noida, India, 18-20 December, 2013

[5]Ravi Pal, Dr. U. C. Jaiswal “Design & Analysis of an Exhaustive Algorithm for Sandhi Processing In Sanskri ”, International Journal of Engineering Research and Development, Volume 4, Issue 8,November 2012

[6]Gerard Huet. “Towards Computational Processing of Sanskrit ”,

[7]Girish N. Jha , Priti Bhowmik, Sudhir K. Mishra, R.Chandrashekar, Subash, Sachin Mendiratta, Muktanand “ Towards a Computational Analysis System for Sanskrit ”, Special Center for Sanskrit Studies,Jawaharlal Nehru University, New Delhi-110067

[8]Prof. Deepak Mane, Aniket Hirve “Study of Various Approaches in Machine Translation for Sanskrit Language”, International Journal of Advancements in Research & Technology, Volume 2, Issue4, April‐2013

Websites

[9]

[10]

[11]