JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
tOWARDS iMPROVEMENT IN gUJARATI TEXT INFORMATION RETRIEVAL BY USING EFFECTIVE GUJARATI STEMMER
1 MR. K. A. CHAUHAN, 2 MR. R. S. PATEL, 3 PROF. H. J. JOSHI
1M.TECH.[Web Technology] Student, Department Of Computer Science, Gujarat University, Ahmedabad, Gujarat
2M.TECH.[Web Technology] Student, Department Of Computer Science, Gujarat University, Ahmedabad, Gujarat
3Asst.Professor Department Of Computer Science, Gujarat University, Ahmedabad, Gujarat
, ,
ISSN: 0975 –6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 222
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
ABSTRACT:. Stemming is a well known Information Retrieval (IR) Technique which is used to enhance the effectiveness of text Retrieval Systems. In this paper we carried out stemming experiment for Gujarati- Ad-hoc Monolingual Information Retrieval Task. In our methodology we select a class of words that share common suffix of given character and replace each of them by their original root. Depth analysis of Gujarati News corpus was made and various possible word suffixes were identified. We select a rule-based approach as per Gujarati Morphology. We Evaluate Result with Mean Average Precision (MAP). The experiment results show that our Gujarati Stemmer helps to improve the results.
Keywords —Automatic Indexing, Corpus, Gujarati Information Retrieval, Mean Average Precision, Gujarati Stemmer.
ISSN: 0975 –6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 222
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
1 INTRODUCTION
Stemming is the Technique to reduce derived words or inflected words to their stem, base or root form, generally a written word form. The stem is not suppose to be identical to the morphological root for the all words, it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
Our Aim is In Gujarati language stemming for nouns and proper names, is made to obtain stem of a Gujarati word and then the various rules for noun and proper name stemming have been generated. with those stem words we will evaluate Mean Average Precision(MAP) value For Ad-Hoc Monolingual Gujarati Information Retrieval Task[7].
II STEMMER BACKGROUND & RELATED WORK
Julie Beth Lovins in 1968 developed English stemmer.[13] The Porter stemming algorithm (Martin Porter, 1980) is the most widely used and most Effective stemming Algorithm for English Language.[19] This Both Algorithms are mainly use in English. There are Many Others stemmer For English Languages as well.[10][5]
For Indian Languages, The earliest work reported by (Ramanathan and Rao, 2003) used a hand crafted suffix list and performed longest match stripping for building a Hindi stemmer.[21] Majumder 2007 developed statistical approach YASS: Yet Another Suffix Stripper which base on uses a clustering based approach which is based on string distance measures and requires no linguistic knowledge For Any Particular Language. Conclusion For YASS Stemmer is that stemming improves recall of IR systems for Indian languages like Bengali.[17] Dasgupta and Ng, 2007 worked on morphological parsing for Bengali.[6] These are few motivated stemming invention for Indian Languages.
III INFORMATION RETRIEVAL
Information retrieval (IR) is a technique which representing, searching which has large collections of electronic text data in unstructured form. IR is the discipline that deals with retrieval of unstructured data or partially structured data, generally in form of textual documents, in response to a set of query which also known as topic statement(s), itself may be unstructured as well. [9] Interaction between a user and an IR system can be modelled as the user submitting a query to the system; the Effectiveness of System Evaluate from Number of relevant documents. Effective IR System Retrieve Maximum Number Of Relevant Documents Or Minimum Of Non-Relevant Documents
It is required to develop effective Technique of automated IR due to the tremendous explosion in the size of text documents and growing value of document sources on the Internet. Over the last few years, there has been a significant growth in the amount of text documents in Indian languages. Researchers have been performing IR tasks in English and European languages since many years [2], [25], efforts are being made to encourage IR tasks for the Indian Languages[7], [24].
IR research community uses resources(Data Set) known as test collection [22] . Since 1990's, TREC [25] is conducting evaluation exercises using test collections. The classic components of a test collection are :
a) Document ID: collection of documents; each document is identified uniquely by docid.
b) Query id: A set of queries which also referred as topics; each query is uniquely identified by a qid.
c) Qrels: A set of relevance judgments Which also referred as qrels that consists of a list of (qid,docid) pairs detailing the relevance of documents to topics.
In this paper, we have described the ad hoc monolingual IR task performed For Gujarati language collection by using Effective Stemmer. In ad hoc querying, the user formulates any number of arbitrary queries but applies them to a fixed collection. [7] Gujarati is spoken by nearly 60 Million people over the world and is an official language for the state of Gujarat. [15]
IV EXPIRIMENTAL DATA
1) Corpus (Database)
The test collection used for experiment is the collection made available by FIRE in 2011. [7] Details of Gujarati Collection are mentioned in Table I. The collection was created from the daily newspaper, “Gujarat Samachar” from 2001 to 2010. Each document represents a news article from “Gujarat Samachar”. The average tokens per document are 445.
The corpus is coded in UTF-8 and each article is marked up using the following tags:
<DOC> : Starting of the document
<DOCNO> </DOCNO> : Unique identifier of the document
<TEXT> </TEXT> : Contains the document text
</DOC> : Ending tag of the document
Particulars / QuantitySize of Collection / 2.7 GB
Number of text Documents / 3,13,163
Size of Vocabulary / 20,92,619
Number of Tokens / 13,92,72,906
Table 1 Statistics Of Gujarati Corpus
2) Queries
IR models were tested against 50 different queries in Gujarati language. Following the TREC model [12], every query is divided into three sections: the title (T) which indicates the title, the description (D) that gives a one-two sentence description and the narrative part (N), which specifies the relevance assessment criteria in brief. Example of a single query in the collection of queries is as below.
<top>
<num> </num> : Unique identifier of the Query
<title> </title> : Title of the Query
<desc> </desc> : Short description of the Query
<narr> </ narr> : Detailed Query in narrative form
</top>
V OUR APPROACH FOR STEMMER
The Gujarati phoneme set consists of eight vowels and twenty-four consonants. Gujarati is rich in its morphology, that means, grammatical information is encoded by the way of affixation rather than independent freestanding morphemes.
For Our Experiment We created list of Gujarati suffixes that contains postpositions and inflectional suffixes respectively for verbs, nouns, adjectives and adverbs that use in our approach for the stemmer. suffix-stripping is done with this technique.
In Our Experiments we first take results without using stemmer. Than we take results with using stemmer. In Both Experiments we remove 280 stopwords. Words that frequently occur in a document but carry less significant meaning are called stopwords. It is also Technique which improve MAP for Gujarati Information Retrieval.[13]
In Table 2 We Shown Example Of Few Suffix Removal. As shown in in Table-2 We almost Stem 30 Suffixes Which affect in many of words. After Removing These Suffix, Indexing size is also Reduced[18] and IR System become more effective for Retrieve Relevant Documents.
Suffix To Remove / Gujarati Word (for Example) / Stem WordO ઓ / Fal-o ફળો / Fal ફળ
Thee
થી / Gujarat-thi
ગુજરાત-થી / Gujarat
ગુજરાત
E
એ / Aakash-e
આકાશ-એ / Aakash
આકાશ
Ne
ને / Bharat-ne
ભારત-ને / Bharat
ભારત
Ee
ઈ / Amadavad-ee
અમદાવાદી / Amadavad
અમદાવાદ
Vana
વાના / Atakav-vana
અટકાવવાના / Atakav
અટકાવ
Ni(nee)
ની / University-nee
યુનીવર્સીટી-ની / University યુનીવર્સીટી
Na
ના / Vidhyathi-na
વિદ્યાર્થી-ના / Vidhyarth
વિદ્યાર્થી
Ona
ઓના / Karmchari-ona
કર્મચારીઓના / Karmachari
કર્મચારી
Ma
મા / Utrayan-ma
ઉતરાયણ-મા / Utrayan
ઉતરાયણ
Na
ના / Rel-na
રેલ-ના / Rel
રેલ
Othee
ઓથી / Suvidha-othee
સુવિધા-ઓથી / Suvidha
સુવિધા
TABLE 2 Example Of Few Suffix Removal Words
VI IR MODELS
In this experiment, we have compared various models that are commonly used in FIRE test collection evaluation work. We considered classical models like Term Frequency Inverse Document Frequency (TF-IDF) model, language models Hiemstra Language Model (Hiemstra_LM) [11], probabilistic models like Okapi-BM25 [1], Divergence from Randomness (DFR) group of models like Bose-Einstein model for randomness that considers the ratio of two Bernoulli's processes for first Normalization, and Normalization 2 for term frequency normalization (BB2), The DLH hyper-geometric DFR model (DLH) and its improvement (DLH13), Divergence From Independence model (DFI0), A different hyper-geometric DFR model using Popper's normalization (DPH) which is parameter free, DFR based hyper-geometric models which takes an average of two information measures (DFRee), Inverse Term Frequency model with Bernoulli after-effect (IFB2), Inverse Expected Document Frequency model with Bernoulli after-effect (In_expC2), Inverse Document Frequency model with Laplace after-effect (InL2), Poisson model with Laplace after-effect (PL2), a log-logistic DFR model ( LGD) [4] and an Unsupervised DFR model that computed the inner product of Pearson's X^2. The experiment used 21 different models to perform information retrieval of Gujarati text documents. Few of the models required parametric values; we have used the default values that are generally applied to similar tests.
VII EVALUATION MATRICS
IR systems were evaluated using measures like Precision, Recall, R-Precision, E-Measure and Fallout [22], where precision measures the fraction of retrieved documents that are relevant whereas recall measures the fraction of relevant documents retrieved and fallout measures the fraction of non-relevant documents retrieved. Mean Average Precision (MAP) values are considered to give the best judgment in the presence of multiple queries [18].
In our experiments, to evaluate the retrieval performance, the mean average precision (MAP) values were considered. We evaluated the results separately for title (T), combination of title and description (TD) and the combination of title, description and narration (TDN). Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over multiple queries to obtain MAP. if each query qj belongs to a set of queries Q, that is if the set of relevant documents for a query is {d1, …dmj}and Rjk is the set of ranked retrieval results from the top result until we get document dk then MAP can be calculated as in (1).
MAP (Q) = Precision (Rjk) (1)
VIII RESULTS
A baseline of MAP values obtained from Gujarati IR tasks of different IR models.[12][7] MAP values obtained as a baseline are listed in Table III and the values obtained after stemming are listed in Table IV. The improvement in MAP values after Stemming is listed in Table V.
IR Model / Mean Average PrecisionT / TD / TDN
BB2 / 0.2370 / 0.2687 / 0.2063
BM25 / 0.2251 / 0.2598 / 0.1964
DFI0 / 0.2236 / 0.2432 / 0.1641
DFR_BM25 / 0.2268 / 0.2613 / 0.1974
DFRee / 0.2214 / 0.2369 / 0.1499
DirichletLM / 0.2350 / 0.2477 / 0.1867
DLH / 0.2204 / 0.2465 / 0.1816
DLH13 / 0.2225 / 0.2419 / 0.1657
DPH / 0.2277 / 0.2500 / 0.1724
Hiemstra_ / 0.2060 / 0.2405 / 0.1883
IFB2 / 0.2363 / 0.2696 / 0.2200
In_expB2 / 0.2366 / 0.2690 / 0.2144
In_expC2 / 0.2289 / 0.2655 / 0.2152
InB2 / 0.2351 / 0.2648 / 0.2025
InL2 / 0.2279 / 0.2595 / 0.1919
Js_KLs / 0.2236 / 0.2401 / 0.1560
LemurTF_IDF / 0.2096 / 0.2441 / 0.2104
LGD / 0.2236 / 0.2349 / 0.1590
PL2 / 0.2065 / 0.2486 / 0.1851
TF_IDF / 0.2247 / 0.2579 / 0.1918
XSqrA_M / 0.2262 / 0.2424 / 0.1591
TABLE 3 MAP VALUES WITHOUT USING STEMMER FOR GUJARATI TEXT DOCUMENTS
IR Model / Mean Average PrecisionT / TD / TDN
BB2 / 0.2334 / 0.275 / 0.2278
BM25 / 0.2219 / 0.2671 / 0.2184
DFI0 / 0.2247 / 0.2559 / 0.1964
DFR_BM25 / 0.2245 / 0.2691 / 0.2195
DFRee / 0.2232 / 0.2471 / 0.1907
DirichletLM / 0.2316 / 0.2508 / 0.2063
DLH / 0.2181 / 0.2543 / 0.2077
DLH13 / 0.2223 / 0.2543 / 0.2
DPH / 0.2262 / 0.2595 / 0.2045
Hiemstra_ / 0.2024 / 0.2434 / 0.2052
IFB2 / 0.2323 / 0.2748 / 0.2381
In_expB2 / 0.2329 / 0.2746 / 0.2338
In_expC2 / 0.2251 / 0.2712 / 0.2305
InB2 / 0.2326 / 0.272 / 0.2249
InL2 / 0.2256 / 0.2671 / 0.2186
Js_KLs / 0.2249 / 0.25 / 0.1938
LemurTF_IDF / 0.2058 / 0.2435 / 0.2222
LGD / 0.2223 / 0.2451 / 0.1971
PL2 / 0.2075 / 0.2531 / 0.2068
TF_IDF / 0.2222 / 0.2661 / 0.2162
XSqrA_M / 0.227 / 0.2552 / 0.1968
TABLE 4 MAP VALUES WITH USING STEMMER FOR GUJARATI TEXT DOCUMENTS
IX CONCLUSION
The investigations of our experiments show that using of stemmer in Gujarati text documents contribute to a significant amount of increase in precision values in information retrieval tasks. Enlarging a query from T to TDN improves the retrieval effectiveness to nearly 13%.
T / TD / TDNImprovement In MAP / No such Improve-ment / 5% / 13%
Table 5 Conclusion Table For MAP Improvement
X ACKNOWLEGMENT
This work was supported by the University Grants
Commission (UGC) Minor Research Project, grant number: [F. No: 41-1360/2012 (SR)]. We would like to extend our gratitude to the FIRE forum for providing the researchers with the data/corpus made available for free of cost and for providing the relevance judgments.
REFRENCES
[1] Amati, G., van Rijsbergen, C.J. : "Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness", ACM - Transactions on Information Systems, 20, 357-389, 2002
[2] CLEF, “Conference and Labs of the Evaluation Forum”, http://www.clef- initiative.eu/
[3] Cleverdon, C.W., Keen, M.: "Factors Affecting the Performance of Indexing Systems", Vol 2., ASLIB, Uk, pp. 37-59, 1966
[4] Clinchant, S., Gaussier, E. : "Bridging Language Modeling and Divergence From Randomness Approaches: A Log-logistic Model for IR", Proceedings of ICTIR, London, UK, 2009