A Review of Multi-Label Text Categorization Using Classification and Clustering Technique

Puneet Nema, Vivek Sharma

,

Department of Information Technology Department of Information Technology

SATI Vidisha, M.P., India SATI Vidisha M .P. India

Abstract

The increasing rate of data diversity in current decade faced a problem of data categorization. Data categorization used a classification technique such as KNN, decision tree and support vector machine. The process of classification depends on the similarity of features. The dependency of feature bound the limitation of accuracy of classifier. The process of classification mapped data into labels and labels categorized in the different predefined class for the classification purpose. In this paper present the review of classification and clustering technique for multi-label text categorization.

Keywords: - Data Mining, Text categorization, classification and clustering

Introduction

Multi-label text categorization plays an important role in multiple documents searching in different domain. The classification accuracy of multi-label text categorization is major issue in machine learning technique[1]. For the improvement of classification and clustering ratio used different clustering and classification technique[2]. The major issue in multi-label text categorization is similarity measure of class and attributes value. For that improvement used various feature selection based method.Different kinds of machine learning algorithms, such as the KNN, support vector machine, and logistic regression methods, have been proposed to resolve such classification problem, and have achieved a satisfactory level of classification[3,4]. Instead of that some real-world problems, each instance could be associated with multiple classes simultaneously.The process of classification technique divided into three section, in first section design a learning process, in second section design the testing phase and in final section design the application phase. The classifier builder built during the learning phase[5,6]. The form of classifier builder is mathematical function model and regression model. It may be in the form of classification rules, a decision tree, or a mathematical formula. Some authors used combined scheme for multi-level text categorization such as clustering and classification. The processing of clustering used clustering technique such as k-means, EM and FCM method. Impart of classification technique used various model of classifier. The validation of cluster data to classification technique used fuzzy transform function to map class data to classifier data. Here in figure shows that multi-level text categorization[7].

Figure 1: Multi label classification.

Multi-label classification refers to the task of learning a function that maps various instances into one. This makes multi-label data particularly interesting from the learning perspective, since, in contrast to binary or multi-class classification, there are label dependencies and interconnections in the data which can be detected and exploited in order to obtain additional useful information or just better classification performance.This paper is divided into five sections. Section-I gives the introduction of multi-label text categorization and classification technique[8]. Section-II gives the information of related work in the field of multi-label text categorization. In section-III discuss the problem formulation of multi-label text categorization and finally discuss conclusion and future work in section IV.

Related work

[In this section discuss the related work in the field of multi-label text categorization. For the improvements of multi-level text categorization used various machine learning technique such as clustering and classification technique. Some technique of work is discussed here.

1] In this paper author propose a fuzzy based method for multi-label text classification in which a document can belong to one or more than one category. In text categorization, the number of the involved features is usually huge, causing the curse of the dimensionality problem. Besides, a category can be a nonconvex region, which is a union of several overlapping or disjoint sub-regions. An automatic classification system, thus, may suffer from large memory requirements or poor performance. By incorporating fuzzy techniques, our proposed method can overcome these issues.

[2] In this paper author contribute to such a multilevel clustering theory, by designing and studying a multilevel modularity measure for hierarchically clustered graphs, explicitly taking the nesting structure of clusters into account. The multilevel modularity we propose generalizes a modularity measure in the context of reverse software engineering. The measure they designed recursively traverses the hierarchy of clusters and computes a one-variable polynomial encoding the intra and inter-cluster densities appearing at all levels in a hierarchical clustering. The resulting polynomial reflects how the graph combines with the hierarchy of clusters and can be used to assess the quality of a hierarchical clustering.

[3] In this paper author present IVTURS, which is a new linguistic fuzzy rule-based classification method based on a new completely interval valued fuzzy reasoning method. This inference process uses interval valued restricted equivalence functions to increase the relevance of the rules in which the equivalence of the interval membership degrees of the patterns and the ideal membership degrees is greater, which is a desirable behavior. Furthermore, their parameterized construction allows the computation of the optimal function for each variable to be performed, which could involve a potential improvement in the system’s behavior.

[4] In this paper author aims to provide a timely review on this area, with emphasis on state-of-the-art multi-label learning algorithms. Firstly, fundamentals on multi-label learning including formal definition and evaluation metrics are given. Secondly and primarily, twelve representative multi-label learning algorithms are scrutinized under common notations, with corresponding analyses and discussions. Thirdly, several extended topics on multi-label learning are briefly summarized. As a conclusion, online resources and open research problems on multi-label learning are outlined for reference purposes.

[5] In this paper author presented a new multi-label classification method, called RAkEL, that learns an ensemble of LP classifiers, each one targeting a different small random subset of the set of labels. The motivation was the com-putational efficiency and predictive performance problems of the simple and effective standard LP method, when faced with domains with large number of labels and training examples.

[6] In this paper author propose TREEBOOST.MH, a multi-label HTC algorithm consisting of a hierarchical variant of ADABOOST.MH, a very well-known member of the family of ‘‘boosting’’ learning algorithms. TREEBOOST.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed ‘‘locally’’, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated ‘‘locally’’.

[7] In this paper author study the problem of transductive multilabel learning and propose a novel solution, called TRAsductive Multilabel Classification (TRAM), to effectively assign a set of multiple labels to each instance. Different from supervised multilabel learning methods, they estimate the label sets of the unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. They first formulate the transductive multilabel learning as an optimization problem of estimating label concept compositions. Then, they derive a closed-form solution to this optimization problem and propose an effective algorithm to assign label sets to the unlabeled instances.

[8] In this paper author show that the Hybridization of EM algorithm and SVM cluster combines the classification power to produce the multi-label categorization results by removing noise effectively. Initially, EM algorithm extracts the potentially noisy article from the data set using the descending porthole technique. Descending porthole is a sliding window technique used from the top to bottom of the article for preprocessing. Subsequently, SVM cluster establish the content holdup method which generates a more efficient multi-label representation of the articles. Hybridization of EM algorithm and SVM cluster outperforms the Fuzzy Self-Constructing Feature Clustering Algorithm in terms of lexica inclusion and multi-label categorization of text results.

[9] In this paper author presents two novel multi-label classification algorithms based on the variable precision neighborhood rough sets, called multi-label classification using rough sets (MLRS) and MLRS using local correlation (MLRS-LC). The proposed algorithms consider two important factors that affect the accuracy of prediction, namely the correlation among the labels and the uncertainty that exists within the mapping between the feature space and the label space. MLRS provides a global view at the label correlation while MLRS-LC deals with the label correlation at the local level. Given a new instance, MLRS deter-mines its location and then computes the probabilities of labels according to its location. The MLRS-LC first finds out its topic and then the probabilities of new instance belonging to each class is calculated in related topic.

[10] In this work author was involved with the task of multi-label classification. It introduced the problem, gave an organized presentation of the methods that exist in the literature, and provided comparative experimental results for some of these methods. In the future they intend to perform a finer-grained categorization of the different multi-label classification methods and perform more extensive experiments with more data sets and methods. They also intend to perform a comparative experimental study of problem adaptation methods.

Problem Formulation

The major challenges of the machine learning approach to text classification is how to translate the textual information into the features that eventually can be used by a machine learning algorithm. This is what we refer to as feature generation. Perhaps in an ideal world the true semantics of the text is understood and only the relevant concepts are used as features. In practice just using each word as a separate feature already works quite well. However most approaches will generate an enormous number of features, which not all machine learning algorithms can handle well. In order for them to work, only the most promising features are selected to feed to the algorithm. In the process of review we found that some performance affected problem related to the multi-label classification. These problem are affected the performance and accuracy of multi-level classifier and generate unclassified region. The unclassified region increase, decrease the accuracy and performance of classifier. Some problems are mentioned here [4, 6, 9, 10].

  1. Infinite population of data.
  2. Feature selection of data
  3. Voting of class
  4. New class generation.
  5. imbalanced data problem
  1. dependence of Label

Conclusion and future work

In this paper presents the review of multi-label text categorization using classification and clustering technique. for the clustering and classification used various technique of classification and clustering. The major issue is feature generation in text categorization for the process of clustering and classification. In future improved multi-Label classification technique based on TLBO algorithm. The TLBO algorithm improved the accuracy of minority class of classifier and reduces the unclassified region in multi-label classification. The increasing of multi-label classification region improved the accuracy and performance of classifier. Increase the accuracy of classifier; Remove the dependency of label, Reduces size of data, Decrease the feature dissimilarity and used real time data for the classification.

REFERENCES:-

[1] Shie-Jue Lee, Jung-Yi Jiang “Multilabel Text Categorization Based on Fuzzy Relevance Clustering” IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL-22, 2014. Pp 1457-1471.

[2] Francois Queyroi, Maylis Delest, Jean-Marc Fedou, Guy Melancon “Assessing the Quality of Multilevel Graph Clustering” Springer, 2014. Pp 1-20.

[3] Jose Antonio Sanz, Alberto Fernandez, Humberto Bustince, Francisco Herrera “IVTURS: A Linguistic Fuzzy Rule-Based Classification System Based On a New Interval-Valued Fuzzy Reasoning Method With Tuning and Rule Selection” IEEE TRANSACTIONS ON FUZZY SYSTEMS, Vol-21, 2013. Pp 399-412.

[4] Min-Ling Zhang , Zhi-Hua Zhou “A Review on Multi-Label Learning Algorithms” IEEE, 2013, Pp 1-13.

[5] Grigorios Tsoumakas, Affiliate, , Ioannis Katakis, Ioannis Vlahavas “Random k-Labelsetsfor Multi-Label Classification” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE, 2010. Pp 1-12.

[6] Andrea Esuli , Tiziano Fagni, Fabrizio Sebastiani “Boosting multi-label hierarchical text categorization” Springer, 2008. Pp 1-27.

[7] Xiangnan Kong, Michael K. Ng, Zhi-Hua Zhou “Transductive Multilabel Learning via Label Set Propagation” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol-25, 2013. Pp 704-719.

[8] S.Arul Murugan, Dr. P. Suresh “Hybridization Of Em And Svm Clusters For Efficient Text Categorization” International Journal of Innovative Research in Advanced Engineering, 2014. Pp 163-171.

[9] Ying Yu Witold Pedrycz, Duoqian Miao “Multi-label classification by exploiting label correlations” Elsevier ltd. 2014, Pp 2989-3004.

[10] Grigorios Tsoumakas, Ioannis Katakis “Multi-Label Classification: An overview” 2007. Pp 1-13.

[11] Min-Ling Zhang, Zhi-Hua Zhou “ML-KNN:A lazy learning approach to multi-label learning” Elsevier ltd. 2007. Pp 2038-2048.

[12] Krishnakumar Balasubramanian, Guy Lebanon “The Landmark Selection Method for Multiple Output Prediction” International Conference on Machine Learning, 2012. Pp 1-8.

[13] ai, F, Lin, H. T “Multi-label classification with principle label space transformation” Neural Com-putation, 2012.