Aggregated Probabilistic Fuzzy Relational Sentence Level Expectation Maximization
Clustering Algorithm For Efficient
Text Categorization
LakshmiKartheek V1, Dr.ChandraSekhar V2
II M.Tech, Department of CSE, S.R.K.R Engineering College, Bhimavaram, AP, India1.
Associate Professor, Department of CSE, S.R.K.R Engineering College, Bhimavaram, AP, India2
Abstract-Text clustering has found an important application to organize the data and to extract useful information from the available corpus. We proposed EM algorithm extracts the potentially noisy article from the data set using the descending porthole technique. Many existing clustering techniques have difficulties in handling extreme outliers but fuzzy clustering algorithms tend to give them very small membership degree in surrounding clusters. In this paper a generalized Expectation–maximization Probabilistic fuzzy c-means (FCM) algorithm is proposed and applied to clustering fuzzy sets. This technique leads to a fuzzy partition of the fuzzy rules, one for each cluster, which corresponds to a new set of fuzzy sub-systems. When applied to the clustering of a flat fuzzy system results a set of decomposed sub-systems that will be conveniently linked into a Parallel Collaborative Structures. This algorithm is particularly used in finding maximum likelihood estimates of parameters.
Keywords: Fuzzy Clustering, probabilistic Fuzzy c-means, Fuzzy Sets, Parallel Collaborative Structures.
1
introduction
Cluster analysis is primarily a tool for discovering previously hidden structure in the set of unordered objects, where we assume that a natural grouping exists in the data. Cluster analysis is a technique for classifying data [1], i.e., to divide a given set of objects into a set of classes or clusters based on similarity. The goal is to divide the data set in such a way that cases assigned to the same cluster should be as similar as possible whereas two objects from different clusters should be as dissimilar as possible[2]. It is an approach towards unsupervised learning as well as one of the major techniques in pattern recognition. The conventional (hard) clustering methods restrict each point of the data set to exactly one cluster [1]. These methods yield exhaustive partitions of the example set into non-empty and pair wise disjoint subsets. Fuzzy cluster analysis, [3] therefore allows gradual memberships of data points to clusters in [0, 1]. This gives the flexibility to express that data points belong to more than one cluster at the same time. Furthermore, these membership degrees offer a much finer degree of detail of the data model. One of the most popular object data clustering algorithms is the FCM algorithm, proposed by Dunn.Clustering is the process of grouping or aggregating of data items.Sentence clustering mainly used in variety of Applications such as classify and categorization of documents, automatic summary generation, organizing the documents, etc [4].
In text processing, sentence clustering plays a vital role this is used in text mining activities. Size of the clusters may change from one cluster to another. The traditional clustering algorithms have some problems in clustering the input dataset [3]. The problems such as, instability of clusters, complexity and sensitivity. To overcome the drawbacks of these clustering algorithms, Later an algorithm called Hierarchical Fuzzy Relational Eigenvector Centrality-based Clustering Algorithm (HFRECCA) [5] is extension of FRECCA which is used for the clustering of sentences. Contents present in text documents contain hierarchical structure and there are many terms present in the documents which are related to more than one theme hence HFRECCA will be useful algorithm for natural language documents. In this algorithm single object may belong to more than one cluster [6],[7].In 1973 and Later it is extended by Bezdek (1981), which can be applied if the objects of interest are represented as points in a multi-dimensional space. FCM [8] relates the concept of object similarity to spatial closeness and finds cluster centres as prototypes. Several examples of application of FCM to real clustering problems have proved the good characteristics of this algorithm with respect to stability and partition quality [8]. Further, its convergence has been formally demonstrated (Bezdek, 1987; Hathaway et. al. , 1988). From this method a large variety of clustering techniques was derived with more complex prototypes, which are mainly interesting in data analysis applications [9]. However, the generalization of these techniques to clustering imprecisely or uncertainly data or objects is not yet explored. Moreover, in the real-world applications, transaction data are usually composed of quantitative values.
Designing a sophisticated data-mining algorithm to deal with different types of data turns a challenge in this research topic. Recently, fuzzy set theory is more and more frequently used in intelligent systems, because of its simplicity and similarity to human reasoning. The theory has been successfully applied to many fields such
as manufacturing, engineering, diagnosis, economics, and others (Höppner, 1999).
In this context, a generalization of the previously methods in order to be used in clustering of fuzzy data (or fuzzy numbers) [9] would be a meritorious research. In this work, a new fuzzy relational clustering algorithm, based on the fuzzy c-means [8] algorithm is proposed to clusters fuzzy data, which is used in the antecedent and the consequents parts of the fuzzy rules. This clustering process divides the fuzzy rules of a Fuzzy System into a set of classes or clusters of fuzzy rules based on similarity. From this new strategy, a flat fuzzy system f(x) can be organized into a hierarchical structure of fuzzy systems (Salgado, 2005a and 2007b). Hierarchical fuzzy modelling is a promising method to identify fuzzy models of target systems with many input variables or/and with different complexity interrelation. Partitioning a fuzzy system reduces its complexity, which simplifies the identification problem, improves the computation times and saves resources, such as memory space. Moreover, with the organization of the fuzzy system into a new hierarchical structure, the model readability and transparency can be improved. In this context, we propose a new technique, the Probabilistic Fuzzy Clustering of Fuzzy Rules (FCFR), based on cluster methodology, to decompose a flat fuzzy system f(x) into a set of n fuzzy sub-systems f1(x), f2(x), ..., fn(x), organized in a collaborative structure. Each of these clusters may contain information related with particular aspects of the system f(x). The proposed algorithm allows grouping a set of rules into c subgroups (clusters) of similar rules. It is a generalization of the Probabilistic Clustering Algorithm (FCM)[8] , here applied to rules instead of points. With this algorithm, the system obtained from The data is transformed into a new system, organized into several subsystems, in PCS structures (Salgado, 2005b and 2007a).
The paper is organized as follows: firstly, a brief introduction to fuzzy systems is presented. The concept of relevance of a set of rules and of fuzzy system is reviewed. The PCS structure is described in section 3. In section 4 the FCFR strategy is proposed. An example is presented in section 5. Finally, the main conclusions are outlined in section 6. The development of information technology in the last two decades paved the way for a world full of data. But much of these data are of potentially not useful. In order to make it useful one, we need to extract the large amount of information or knowledge underlying the data. Data mining is a process of extraction the valuable information inside the huge amount of data. Clustering techniques can help in this data discovery and data analysis. Clustering the sentences is mainly useful in Information Retrieval (IR) Process. Clustering text [7] at the sentence level and document level has many differences. Document clustering partitions the documents into several parts and cluster those parts based on the overall theme. It doesn’t give much importance to the semantics of each sentence in the document. So there may be content overlap or bad coverage of theme will happen in the case of multi document summarization. Each data element in hard clustering method belongs to exactly one cluster.
A.Cluster Analysis
There are several algorithms available for clustering [10]. Each algorithm will cluster or group similar data objects in a useful way. This task involves dividing the data into various groups called clusters. The application of clustering includes Bioinformatics, Business modeling, image processing etc. In general, the text mining [10] process focuses on the statistical study of terms or phrases which helps us to understand the significance of a word within a document. Even if the two words didn’t have similar meanings, clustering will takes place. Clustering can be considered the most important unsupervised[11] learning framework, a cluster is declared as a group of data items, which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Sentence Clustering mainly used in variety of text mining applications. Output of clustering should be related to the query, which is specified by the user.
B.Similarity Measure
Similarity between the sentences is measured in terms of some distance function; such functions are Euclidean distance or Manhattan distance [1]. The choice of the measure is based on our requirement that induces the cluster size and formulates the success of a clustering algorithm on the specific application domain [4,7]. Current sentence clustering methods usually represent sentences as a term document matrix and perform clustering algorithm on it. Although these clustering methods [12] can group the documents satisfactorily, it is still hard for people to capture the meanings of the documents since there is no satisfactory interpretation for each document cluster. Based on the similarity or dissimilarity values clustering will takes place [1].
1
1
BACK GROUND AND RELATEDWORK
- Efficient Conceptual Rule Mining on Text Clusters
Text mining is a modern and computational approach, which attempts to determine new, formerly unidentified information by applying various techniques from normal language processing and data mining domains. Clustering is one of the conventional data mining techniques[13] that is used us an unsubstantiated learning pattern, where clustering techniques attempt to groupings of the text documents with different data items. Clusters have high intra-cluster similarity and low inter-cluster similarity. Most of the current document clustering methods and algorithms are fully based on the Vector Space Model (VSM), which is a widely used as a data representation for text classification and clustering. Weighting these feature words accurately affects the result of the clustering algorithm and improves its efficiency. In this paper, we are going to present a Conceptual rule mining which is generated for the sentence meaning and related sentences in the document. Weights are appropriated for the sentences having higher contribution to the topic of the entire document.
- Semantic Similarity Based on CorpusStatistics and Lexical Taxonomy
This proposal is a new approach for measuring the semantic similarity between the words and concepts. The characteristics of polysemy and synonymy that exist in words of natural language have always been a challenge in the fields of Natural Language Processing (NLP) and Information Retrieval (IR) techniques [13]. There are certain advantages in the work of semantic association discovery by combining a taxonomy structure with corpus statistics. The proposed approach outperforms other computational models. It gives the highest correlation value, with a benchmark resulting from human similarity judgments.
- Some Experiments on Clustering SimilarSentences of Texts
Identifying similar text passages or sentences plays an important role in many of the text mining applications. The proposal is based on some experiments [14] on clustering similar sentences of texts in the documents. The proposed framework is based on an incremental and unsupervised clustering method which is combined with statistical similarity metrics to measure the semantic distance between sentences [15]. It aims at identifying sets of highly semantically-related sentences from a collection of documents. The Sentence Splitting is performed by a textual-segmentation tool called SENTER, which is based on a list of abbreviations and some sentence delimiters.
- A Clustering Algorithm for Text Summarization using Expectation Maximization
Nowadays, large amount of data is available in the form of texts. It is very difficult for human beings to manually find out useful and significant data[16]. This problem can be solved with the help of text summarization algorithms. Text Summarization is the process of condensing the input text file into shorter version by preserving its overall content and meaning. This paper is about called text summarization using natural language processing. The proposal describes a system, which consists of two steps. In first step, we are implementing the phases of natural language processing that are splitting, tokenization, part of speech tagging, and parsing [17]. In second step, we are implementing Expectation Maximization (EM) Clustering Algorithm[18] to find out sentence similarity between the sentences. Based on the value of sentences similarity, we can easily summarize text.
E. Fuzzy Relational Clustering FRECCA
A fuzzy relational clustering approach is used to produce clusters with sentences, where each of them corresponds to some content. The output of clustering indicates the strength of the association among the data elements. Andrew Skabar and KhaledAbdalgader [1] proposed a novel fuzzy relational clustering algorithm called FRECCA (Fuzzy Relational Eigen Vector Centrality based Clustering Algorithm). The algorithm involves the following steps.
Initialization: cluster membership values are initialized randomly, and normalized. Mixing coefficients are initialized.
Expectation: Calculates the Page Rank value for each object in each cluster.
Maximization: Updating the mixing coefficients based on membership values calculated in the Expectation Step.
- HFRECCA
The general idea of the Hierarchical fuzzy clustering is the partitioning of the data items into a collection of clusters. The data points are assigned membership values for each of the clusters. Many existing clustering techniques have difficulties in handling extreme outliers but fuzzy clustering algorithms tend to give them very small membership degree in surrounding clusters. This algorithm is an extension of fuzzy relational clustering algorithm. An expectation–maximization (EM) algorithm is an iterative process, in which the model mainly depends on some unobserved latent/hidden variables. This algorithm is particularly used in finding maximum likelihood estimates of parameters. The EM algorithm’s iteration alternates between performing an expectation (E) Step; this creates a function to compute the cluster membership probabilities and maximization (M) step, in which these probabilities are then used to reestimate the parameters. These parameter estimates are then used to find out the distribution of the latent variables in the next E step.
PROPOSED WORK
In this wok, we analyze how one can take advantage of the efficiency and stability of clusters, when the data to be clustered are available in the form of similarity relationships between pairs of objects. More precisely, we propose a new Hierarchical fuzzy relational clustering algorithm, based on the existing fuzzy C-means (FCM) algorithm, which does not require any restriction on the relation matrix. This HRECCA algorithm is applied for the clustering of the text data which is present in the form of xml (hierarchical) files. HFRECCA will give the output as clusters which are grouped from text data which is present in a given documents. In this HFRECCA algorithm, Page Rank algorithm is used as similarity measure.
A.Page Rank
We describe the application of the algorithm to data sets, and show that our algorithm performs better than other fuzzy clustering algorithms. In the proposed algorithm, we describe the use of Page Rank and use the Gaussian mixture model approach. Page Rank is used as a graph centrality measure. Page Rank algorithm is used to determine the importance of a particular node within a graph. Importance of node is used as a measure of centrality. This algorithm assigns numerical score (from 0 to 1) to every node in graph. This score is known as Page Rank Score. Sentence is represented by node on a graph and edges are weighted with value representing similarity between sentences. Page Rank can be used within the Expectation- Maximization algorithm to optimize the parameter values and to formulate the clusters. A graph representation of data objects is used in along with the Page Rank algorithm. It operates within an Expectation-Maximization; it is a framework which is a general purpose method for learning knowledge from the incomplete data. Each sentence in a document is represented by a node in the directed graph and the objects with weights indicate the object similarity.
B,EM algorithm
It is an unsupervised method, which does not need any training phase; it tries to find the parameters of the probability distribution that has the maximum likelihood of its parameters. Its main role is to parameter estimation. It is an iterative method, which is mainly used to finding the maximum likelihood parameters of the model. The E-step involves the computation of cluster membership probabilities. The probabilities calculated from E-step are reestimated with the parameters in M-step.
C.APFRSEC
The Main idea of aggregated probabilistic Fuzzy Relational sentence level Expectation maximization clustering Algorithm a clustering algorithm is used in this work to implement the separation of information among the various subsystems, which are organized into a Parallel Collaborative Structure, PCS. Each of these subsystems may contain information related with particular aspects of the system or merely collaborates to the performance of f(x). A PCS structure with n sub models fuzzy systems is depicted in Fig. 1. Each fuzzy system model ihas two outputs: an output variable yiand the correspondent fuzzy system relevance Ri (x)
1
1
This fuzzy system architecture describes the strength of mind collaboration among the different fuzzy models
1
Therefore, the output of the SLIM model is the integral of the individual contributions of each fuzzy subsystem:
1
f(X) = ------(1)