Department of Information Technology Department of Information Technology

Multi-label Text Categorization Based on Feature Optimization using Ant Colony Optimization and Relevance Clustering Technique

PuneetNema Vivek Sharma

Department of information Technology Department of Information Technology

SATI VIDISHA,INDIA SATI VIDISHA

Abstract

Feature optimization and feature selection play an important role multi-label text categorization. In multi-label text categorization multiple features share a common class and the process of classification suffered a problem of selection of relevance feature for the classification. In this paper proposed feature optimization based multi-label text categorization. The process of feature optimization is done by ant colony optimization. The ant colony optimization accrued the relevant common feature of document to class. For the process of classification used cluster mapping classification technique. The feature optimization process reduces the loss of data during the transformation of feature mapping during the classification. For the validation of proposed algorithm used some standard dataset such as webpage data, medical search data and RCV1 dataset. Our empirical evaluation shows that proposed algorithm is better than fuzzy relevance technique and other classification technique.

Keywords: - multi-label classification, feature extraction, clustering ACO

1.Introduction

Multi-label text categorization improves the classification performance of internet data categorization. The processing of text retrieval and search optimization of text content required the text categorization technique. now a days the increasing rate of text data and multi-media data demands the multi-label text classification. In multi-label text classification the feature of content of text play an important role. The extraction of feature required the transformation of text data into another format. The processing of extracted feature used various feature optimization technique[1,2]. in this paper used ant colony optimization technique for the selection of text feature and optimization of selected feature. The optimized features improved the mapping of class[3]. The process of classification built the classification model. In this classification model have two sections one is model generation and other is unseen data processing. In text categorization, the number of the involved features is usually huge. This may cause the curse of the dimensionality problem [4]. The high-dimensional documents appear common data organization strategies from being efficient. Besides, a category is not necessarily a convex region. It can be a non-convex region which is a union of several overlapping or disjoint sub regions. An automatic classification system, thus, may suffer from large memory requirements or poor performance [5]. Since the current decade, the machine learning approach has most famous for the process of classification and categorization. Machine learning is concerned with constructing computer programs that can adapt and learn from past experience to solve a given problem[6]. The programs are usually based on a learning algorithm. Using machine learning terminology, the process that deals with classification is called supervised learning; whereas the process that deals with clustering is called unsupervised learning most work on text categorization focuses on supervised learning. Within this framework, a set of data examples are first manually classified and labeled with predefined categories by human experts[7]. A learning algorithm is then applied to learn the characteristics of each category, and finally a classification model is automatically built to decide the categories of future unknown data. Usually the sample dataset is divided into two parts, a training set, which is used to build classifiers by learning the characteristics of the categories, and a test set, which is used to test the performance of the classifiers. Nowadays, classifiers automatically built using machine learning techniques achieve high level of effectiveness and are dominant in the area of text categorization[8]. The process of clustering is also play an important role in multi-label text categorization. The process of clustering done by the portioning of clustering of given data according to their information entropy and gain value. For the selection and mapping of better cluster used ant colony optimization technique. the ant colony optimization technique is heuristic function and gets optimal result in concern of optimization and selection of feature point for the classification of text data. Section-II gives the information of feature extraction. In section-III discuss ant colony optimization. In section IV discuss the proposed algorithm. In section V discuss experimental task and finally discuss conclusion and future work.

II Feature Extraction

Text data feature extraction play an important role in multi-label text categorization. In text feature extraction first required the preprocessing of text data. The preprocessing of text data removal all the unwanted tags colon and syntax used in text data. After that various technique are used for the process of extraction of feature. Some technique is based on transform method, some technique are based on entropy based and some technique are based on frequency of words[9].Traditionally, feature extraction uses the basic features supplied with the training instances to construct more sophisticated features. In the case of text processing, however, applying this approach to the bag of words leads to losing the important information about word ordering. Therefore, we argue that feature extraction becomes much more powerful when it operates on the raw document text[10]. But should the extractor always analyze the whole document as a single unit, similarly to regular text classifiers? that considering the entire document may often be misleading, as its text can be too diverse to be readily mapped to the right set of concepts, while notions mentioned only briefly may be overlooked. Instead, used partition the document into a series of non-overlapping segments, and then generate features at this finer level. Each context is classified into a number of concepts in the knowledge base, and pooling these concepts together results in multi-faceted classification for the document. This way, the resulting set of concepts represents the various aspects or sub-topics covered by the document. Potential candidates for such contexts are simple sequences of words, or more linguistically motivated chunks such as sentences or paragraphs. The optimal resolution for document segmentation can be determined automatically using a validation set[11].

Figure 1 shows that the process of feature extraction of raw text data

III. Feature Optimization

The extracted text feature pass through ant colony optimization. The ant colony optimization process find the continuity of words feature. The process of ant colony optimization technique basically describe in terms of artificial ants. The process of ants finds the dissimilar and redundant group of text. The process of feature optimization describe here. In this method, we have introduced continuity of ants for similar feature points and dissimilar feature points collect into next node. In this process, ACO finds optimal selection of feature point subset. Suppose ants find feature points of similarity in continuous root[21]. Every ant of feature points compares their property value according to initial feature point set. When deciding feature is dissimilar words, we should consider two factors: importance degree and easiness degree of dissimilar words. While walking ants deposit pheromone on the ground according to importance of the outlier and follow, in probability pheromone previously laid by other ants and the easiness degree of the noise.

Let D be the feature set and m be the number of ants, importance degree a1, a2,…., an is c1, c2 ,c3 ……………..cn, the appetency of solutions searched by two ants is defined as

App (i, j) = ….……………………………………………..(1)

whereci and cj is the importance of dissimilar words path. The concentration of the solution (1) is defined as

Con (i+j)= ………………………………………………….(2)

where δi and δj is the number of ants whose appetency with other ants is bigger than α; α can be defined as m/10, then the incremented pheromone deposited by ants is

∆τi=Q.βi/Con (i+j)…………………………………………….(3)

where Q is constant.

Each level of pheromone modeled by means of a matrix τ where τij(t) contains the level of pheromone deposited in the node i and j at time t, ant k in node i will select the next node j to visit with probability,

……………….(4)

Where ηijrepresentsheuristic information about the problem which can be defined as the easiness of the path.The heuristic desirability of traversal and edge pheromone levels are combined to form the so-called probabilistic transition rule is given in equation (4), denoting the probability of an ant at feature point ichoosing to travel to feature point j at time t.

Direct search in the best solution need global update rule applied as:

τ(t+1) = (1-ρ). τij(t) + ρ.∆ τij………………………………(5)

Where is parameters that control the pheromone evaporation.

The steps of the proposed ACO based feature preprocessing procedure for cluster based classification.

IV Proposed Algorithm

In this section discuss the proposed algorithm of multi-label text categorization based on clustering and ant colony optimization algorithm. The ant colony algorithm optimized the feature of text data and passes through the mapping of class. In the proposed algorithm there are two section one section is classifier model and other is text model. The classifier model consist of four phase. In first phase data are transforming into ant feature space. In second phase the transferred ant process through cluster mapping phase. The cluster mapping phase generates the predefined class and find the value of matching of features[12,13,14]. The process is same follow for the preparation of test data. The process of algorithm is given below.

Step1. Initially raw text data passes through feature extractor.

Step2. Here show steps of processing of ACO

1)Initialize each ant’s value.

2)Randomly select the feature vector for the process of head and trail ant matrix.

3)Every ant is examined to find the best match feature.

4)The similarity of feature vector is decrease and the number of optimal feature are going to tailor phase.

5)After that the weight value of vector are adjusted and passes through the feature space of cluster map.

The function feature mapping creates forward class for the classification.

input the feature space of class generated feature Matrix.

estimate the feature correlation attribute as

Here a and b the feature vector of feature matrix

The estimated correlation coefficient of feature passes through class.

create the relative feature difference value

After processing of this of feature data creates class.
Generate feature mapping of each class according to the unseen data.
The classification measures the Similarity and return the equivalent class of data.
If the relevant class are not found that the process going again in feature space.

Figure 2 shows that classification model based on ant and clustering technique

V Experimental Result analysis

For the evaluation of the performance of proposed algorithm used MATLAB software. MATLAB well knows computational software. And the validation of algorithm used three dataset one is Webkb, yahoo and RCV1 dataset[17]. Along with proposed algorithm two algorithm is also evaluated one is ML_FRC and rank SVM[16]. Web KB, document belongs to only one category. The documents in the Web KB[15] are web pages collected by the World Wide Knowledge Base project of the CMU text learning group, and were downloaded from the four Universities Data Set Homepage the YAHOO web page dataset collected from the “yahoo.com” domain. The fourth experiment uses the RCV1 (REUTERS CORPUS VOLUME 1) dataset made available by Reuters . The percentage of the number of documents that belong to more than one category in Medical, YAHOO web page[18], and RCV1 is low, medium, and high, respectively. To evaluated the performance of each method, we adopt the performance measures micro averaged breakeven point (BEP), micro averaged F1 (F1), and Hamming loss (Loss) which are defined as follows[19].

wherep is the number of categories, n is the number of test patterns, and MicroPand MicroRare micro averaged precision and recall, respectively

Table 1 shows that performance of all three dataset along with F1, BEP and Hloss

Dataset / Method / F1 / BEP / HLOSS
webKB / RSVM / 89.7999 / 86.2731 / 13.81
ML-FRC / 92.3094 / 88.7004 / 14.85
PROPOSED / 96.3094 / 89.8004 / 8.06
Yahoo / RSVM / 91.4417 / 87.9149 / 18.45
ML-FRC / 96.3422 / 90.3422 / 12.49
PROPOSED / 97.9512 / 91.4422 / 9.70
RCV1 / RSVM / 89.7999 / 86.2731 / 11.46
ML-FRC / 95.3095 / 88.7004 / 13.85
PROPOSED / 96.3094 / 89.8004 / 10.065

Figure 3 shows that performance of webkb dataset in all three method in proposed method the value of F1 and BEP are increase and the value of HLoss are decrease.

Figure 4 shows that performance of yahoo dataset in all three methods in proposed method the value of F1 and BEP are increase and the value of HLoss are decrease.

Figure 5 shows that performance of R dataset in all three methods in proposed method the value of F1 and BEP are increase and the value of HLoss are decrease.

VI Conclusion & Future Scope

In this paper proposed multi-label text categorization algorithm based on ANTS algorithm and cluster mapping technique. in proposed algorithm the ants colony optimization algorithm play a task of feature optimization of text data mapping of classification. The optimization process also reduces the problem of data dimensions and loss of data. The better selection of feature gives better result of classification and prediction of text categorization. For the process of experimental task used three reputed dataset such as webkb, yahoo and RCV1. Along with proposed algorithm also used two algorithms one is RSVM and another is ML_FRC. Our empirical result shows that our proposed method is better than previous two algorithms. In future used proposed classification algorithm for real dataset.

References

[1] Shie-Jue Lee, Jung-Yi Jiang “Multilabel Text Categorization Based on Fuzzy Relevance Clustering” IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL-22, 2014. Pp 1457-1471.

[2] Francois Queyroi, MaylisDelest, Jean-Marc Fedou, Guy Melancon “Assessing the Quality of Multilevel Graph Clustering” Springer, 2014. Pp 1-20.

[3] Jose Antonio Sanz, Alberto Fernandez, HumbertoBustince, Francisco Herrera “IVTURS: A Linguistic Fuzzy Rule-Based Classiﬁcation System Based On a New Interval-Valued Fuzzy Reasoning Method With Tuning and Rule Selection” IEEE TRANSACTIONS ON FUZZY SYSTEMS, Vol-21, 2013. Pp 399-412.

[4] Min-Ling Zhang ,Zhi-Hua Zhou “A Review on Multi-Label Learning Algorithms” IEEE, 2013, Pp 1-13.

[5] GrigoriosTsoumakas, Afﬁliate, ,IoannisKatakis, IoannisVlahavas “Random k-Labelsets for Multi-Label Classiﬁcation” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE, 2010. Pp 1-12.

[6] Andrea Esuli ,TizianoFagni, FabrizioSebastiani “Boosting multi-label hierarchical text categorization” Springer, 2008. Pp 1-27.

[7] Xiangnan Kong, Michael K. Ng, Zhi-Hua Zhou “TransductiveMultilabel Learning via Label Set Propagation” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol-25, 2013. Pp 704-719.

[8] S.ArulMurugan, Dr. P. Suresh “Hybridization OfEm And Svm Clusters For Efficient Text Categorization” International Journal of Innovative Research in Advanced Engineering, 2014. Pp 163-171.

[9] Ying Yu WitoldPedrycz, Duoqian Miao “Multi-label classiﬁcation by exploiting label correlations” Elsevier ltd. 2014, Pp 2989-3004.

[10] GrigoriosTsoumakas, IoannisKatakis “Multi-Label Classiﬁcation: An overview” 2007. Pp 1-13.

[11] Min-Ling Zhang, Zhi-Hua Zhou “ML-KNN:A lazy learning approach to multi-label learning” Elsevier ltd. 2007. Pp 2038-2048.

[12] KrishnakumarBalasubramanian, Guy Lebanon “The Landmark Selection Method for Multiple Output Prediction” International Conference on Machine Learning, 2012. Pp 1-8.

[13] ai, F, Lin, H. T “Multi-label classiﬁcation with principle label space transformation” Neural Com-putation, 2012.

[14]X. Kong, M. K. Ng, and Z. H. Zhou, “Transductive multi-label learning via label set propagation,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp. 704–719, 2013.

[15] M. L. Zhang and Z. H. Zhou, “Multilabel neural networks with applications to functional genomics and text categorization,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1338–1351, Oct. 2006.

[16] M. L. Zhang and Z. H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048, 2007.

[17] M. L. Zhang, “ML-RBF: RBF neural networks for multi-label learning,” Neural Process. Lett., vol. 29, no. 2, pp. 61–74, 2009.

[18] M. L. Zhang, J. M. Pe˜na, and V. Robles, “Feature selection for multi-label Naive Bayes classification,” Inf. Sci., vol. 179, no. 19, pp. 3218–3229, 2009.

[19]J.-Y. Jiang, R.-J.Liou, and S.-J. Lee, “A fuzzy self-constructing feature clustering algorithm for text classification,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 3, pp. 335–349, Mar. 2011.

[20]H. Abdi and L. J. Williams, “Principal component analysis,” in Wiley Interdisciplinary Reviews Computational Statistics. New York, NY, USA: Wiley, vol. 2, pp. 433–459, 2010.

[21] RahulKarthikSivagaminathan, SreeramRamakrishnan” A hybrid approach for feature subset selection using neural networks and ant colony optimization” in Expert Systems with Applications 2007