2.1 Web Content Mining

Web Mining

Table of Contents

Introduction…………………………………………………………… 2
Taxonomy of WEB MINING………………………………………… 6

2.1 Web Content Mining

2.2 Web Usage Mining

Pattern Discovery from web transactions…………………………… 10

3.1 Preprocessing Tasks

3.2 Discovery Techniques

Analysis of Discovered Patterns……………………………………… 15
Web usage Mining Architecture……………………………………... 16
Research Directions…………………………………………………… 18
Prospects………………………………………………………………. 20
Conclusion……………………………………………………………... 25

1. Introduction

The purpose of Web mining is to develop methods and systems for discovering models

of objects and processes on the World Wide Web and for web-based systems that showadaptive performance. Web Mining integrates three parent areas: Data Mining (we usethis term here also for the closely related areas of Machine Learning and KnowledgeDiscovery), Internet technology and World Wide Web, and for the more recent SemanticWeb.

TheWorldWideWeb has made an enormous amount of information electronically

accessible. The use of email, news and markup languages like HTML allow usersto publish and read documents at a world-wide scale and to communicate via chat connections,including information in the form of images and voice records. The HTTP

protocol that enables access to documents over the network via Web browsers created

an immense improvement in communication and access to information. For some years

these possibilities were used mostly in the scientific world but recent years have seen

an immense growth in popularity, supported by the wide availability of computers and

broadband communication. The use of the internet for other tasks than finding informationand direct communication is increasing, as can be seen from the interest in

“e-activities” such as e-commerce, e-learning, e-government, e-science.Independently of the development of the Internet, Data Mining expanded out of theacademic world into industry. Methods and their potential became known outside theacademic world and commercial toolkits became available that allowed applications atan industrial scale. Numerous industrial applications have shown that models can beconstructed from data for a wide variety of industrial problems. The World-Wide Web is an interesting area for Data Mining because huge amountsof information are available. Data Mining methods can be used to analyze the behaviorof individual users, access patterns of pages or sites, properties of collections of documents.

Almost all standard data mining methods are designed for data that are organized as multiple “cases” that are comparable and can be viewed as instances of a single pattern,for example patients described by a fixed set of symptoms and diseases, applicantsfor loans, customers of a shop. A “case” is typically described by a fixed set of features(or variables). Data on theWeb have a different nature. They are not so easily comparable and have the form of free text, semi-structured text (lists, tables) often with imagesand hyperlinks, or server logs. The aim to learn models of documents has given rise tothe interest in Text Mining methods for modeling documents in terms of propertiesof documents. Learning from the hyperlink structure has given rise to graph-basedmethods, and server logs are used to learn about user behavior.

Instead of searching for a document that matches keywords, it should bepossible to combine information to answer questions. Instead of retrieving a plan fora trip to Hawaii, it should be possible to automatically construct a travel plan that satisfiescertain goals and uses opportunities that arise dynamically. This gives rise to awide range of challenges. Some of them concern the infrastructure, including the interoperabilityof systems and the languages for the exchange of information rather thandata. Many challenges are in the area of knowledge representation, discovery and engineering.They include the extraction of knowledge from data and its representationin a form understandable by arbitrary parties, the intelligent questioning and the deliveryof answers to problems as opposed to conventional queries and the exploitation offormerly extracted knowledge in this process. The ambition of representing content ina way that can be understood and consumed by an arbitrary reader leads to issues inwhich cognitive sciences and even philosophy are involved, such as the understandingof an asset’s intended meaning.

The Semantic Web proposes several additional innovative ideas to achieve this:

Standardized format.

The Semantic Web proposes standards for uniform metaleveldescription language for representation formats. Besides acting as a basis for exchange,this language supports representation of knowledge at multiple levels. For example,text can be annotated with a formal representation of it. The natural language sentence“Amsterdam is the capital of the Netherlands”, for instance, can be annotated such thatthe annotation formalizes knowledge that is implicit in the sentence, e.g. Amsterdamcan be annotated as “city”, Netherlands as “country” and the sentence with the structured“capital-of(Amsterdam, Netherlands)”. Annotating textual documents (and alsoimages and possibly audio and video) thus enables a combination of textual and formalrepresentations of knowledge. A small step further is to store the annotated text itemsin a structured database or knowledge base.

Standardized vocabulary and knowledge.

The Semantic Web encourages and facilitatesthe formulation of shared vocabularies and shared knowledge in the form ofontologies: if knowledge about university courses is to be represented and shared, it isuseful to define and use a common vocabulary and common basic knowledge. The SemanticWeb aims to collect this in the form of ontologies and make them available formodeling new domains and activities. This means that a large amount of knowledgewill be structured, formalized and represented to enable automated access and use.

Shared services.

To realize the full Semantic Web, beside static structures also “Webservices” are foreseen. Services mediate between requests and applications and make itpossible to automatically invoke applications that run on different systems. The advent of the World-Wide Web (WWW) has overwhelmed home computer userswith an enormous flood of information. To almost any topic one can thinkof, one can find pieces of information that are made available by otherinternet citizens, ranging from individual users that post an inventory of theirrecord collection, to major companies that do business over the Web.

With the explosive growth of information sourcesavailable on the World Wide Web, it has becomeincreasingly necessary for users to utilize automatedtools in and the desired information resources and totrack and analyze their usage patterns. These factorsgive rise to the necessity of creating server side andclient side intelligent systems that can effectively minefor knowledge. Web mining can be broadly defined asthe discovery and analysis of useful information fromthe World Wide Web. This describes the automaticsearch of information resources available online i.e.Web content mining and the discovery of user accesspatterns from Web servers ie Web usage mining.In this paper we provide an overview of tools techniques and problems associated with both dimensions. We present a taxonomy of Web mining andplace various aspects of Web mining in their proper.This work was supported in part by NSF grant ASCticated types of analyses are to be done on serverside data collections. These include integrating various data sources such as server access logs referrerlogs user registration or profile information resolvingdifficulties in the identification of users due to missing unique key attributes in collected data and theimportance of identifying user sessions or transactionsfrom usage data site topologies and models of userbehavior. We devote the main part of this paper tothe discussion of issues and problems that characterizeWeb usage mining. Furthermore, we survey some ofthe emerging tools and techniques and identify several future research directions.

Many of these systems are based on machine learning and data miningtechniques. Just as data mining aims at discovering valuable information thatis hidden in conventional databases, the emerging field of web mining aimsat finding and extracting relevant information that is hidden in Web-relateddata, in particular in (hyper-)text documents published on the Web. Like datamining,web mining is a multi-disciplinary effort that draws techniques fromfields like information retrieval, statistics, machine learning, naturallanguage processing, and others.

2.Taxonomy of WEB MINING

In this section we present ataxonomy of Web mining, i.e. Web content mining and Web usage mining. We also describe and categorize some of the recentwork and the related tools or techniques in each area. This taxonomy is depicted in Figure 1.

2.1 Web Content Mining

The lack of structure that permeates the information sources on the World Wide Web makes automated discovery of Web-based information difficult. Traditional search engines such as Lycos, Alta Vista, webCrawler, ALIWEB, MetaCrawler, and others provide some comfort to users, but do not generallyprovide structural information nor categorize, alter, or interpret documents. A recent study provides a comprehensive and statistically thorough comparativeevaluation of the most popular search engines. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and to extend data mining techniques to provide a higher level of organizationfor semi-structured data availableon the Web.We summarize some of these efforts below:

2.1.1 Agent-Based Approach.

Generally, agent-based Web mining systems can be placed intothe following three categories:

1. Intelligent Search Agents:

Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. Agents such as Harvest , FAQ Finder , Information Manifold , OCCAM , and ParaSiterely either on pre-specified domain information about particular types of documents, oron hardcoded models of the information sourcesto retrieve and interpret documents. Agents such as ShopBot and ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only generalinformation about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy.

2. Information Fialtering/Categorization:

A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, alter, and categorize them, BO (Bookmark Organizer) 34] combines hierarchical clustering techniques and userinteraction to organize a collection of Web documents based on conceptual information.

3. Personalized Web Agents:

This category of Web agents learn user preferences and discover Webinformation sources based on these preferences, and those of other individuals with similar interests (using collaborative altering). A few recent examples ofsuch agents include the WebWatcher , PAINT , Syskill & Webert . For example, Syskill & Webert utilizes a user profile and learns to rate Web pages of interest using a Bayesian classier.

2.1.2 Database Approach.

Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources, and using standard database querying mechanisms and data mining techniques to analyze it.

Multilevel Databases:

The main idea behind this approach is that the lowest level of the database

Containssemi-structured information stored in various Web repositories, such as hypertext documents. At the higher level(s) metadata or generalizations areextracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. For example, Han, use amulti-layered database where each layer is obtained via generalization and transformation operations performed on the lower layers. King & Novak propose the incremental integrationof a portion of the schema from each information source, rather than relying on aglobal heterogeneousdatabase schema. The ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views.

Web Query Systems:

Many Web-based querysystemsandlanguagesutilize standarddatabasequerylanguages such as SQL, structural information aboutWeb documents, and even natural language processing for the queries that are used in World WideWeb searches. W3QL combines structure queries,based on the organization of hypertext documents,and content queries, based on information retrievaltechniques. WebLog,logic-based query languagefor restructuring extracts information from Web information sources. . TSIMMIS .extracts data from heterogeneous and semi-structuredinformation sources and correlates them to generatean integrated database representationof the extractedinformation.

2.2Web Usage Mining

Webusageminingistheautomaticdiscoveryofuseraccess patterns from Web servers. Organizations collect large volumes of data in their daily operations,generated automatically by Web servers and collectedinserveraccesslogs. Othersourcesof userinformationinclude referrer logs which contain information aboutthe referring pages for each page reference, and userregistration or survey data gathered via CGI scripts.Analyzing such data can help organizations determine the life time value of customers, cross marketingstrategies across products, and effectiveness of promotional campaigns, among other things. It can alsoprovide information on how to restructure a Web siteto create a more effective organizational presence, andshed lightonmoreeffective managementof workgroupcommunication and organizational infrastructure. Forselling advertisements on the World Wide Web, analyzing user access patterns helps in targeting ads tospecific groups of users.

Most existing Web analysis tools provide mechanisms for reporting user activity in theservers and various forms of data altering. Using suchtools it is possible to determine the number of accessesto the serverand to individual les, the times of visits,and the domain names and URLs of users. However,these tools are designed to handle low to moderate servers, and usually provide little or no analysis of data relationships among the accessed les anddirectories within the Web space.More sophisticated systems and techniques for discovery and analysis of patterns are now emerging.These tools can be placed into two main categories, as discussed below:

2.2.1 Pattern Discovery Tools.

The emergingtools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collecteddata. For example, the WEBMINER system, introduces a general architecture for Web usage mining. WEBMINER automatically discovers associationrules and sequential patterns from server access logs.In algorithms are introduced for finding maximal forward references and large reference sequences. These can, in turn be used to perform various typesof user traversal path analysis, such as identifying themost traversed paths thorough a Web locality. Pirolliet. al. use information foraging theory to combine path traversal patterns, Web page typing, andsite topology information to categorize pages for easier access by users.

2.2.2 Pattern Analysis Tools.

Once access patterns have been discovered, analysts need the appropriate tools and techniques to understand, visualize,and interpret these patterns, e.g. the WebViz system. Others have proposed using OLAP techniquessuch as data cubes for the purpose of simplifying theanalysisof usagestatisticsfromserveraccesslogs. The WEBMINER system proposes an SQL-likequery mechanism for querying the discovered knowledge (in the form of association rules and sequentialpatterns). These techniques and others are furtherdiscussed in the subsequent sections.

3. Pattern Discovery from WebTransactions

As discussed in section 2.2, analysis of how usersare accessing a site is critical for determining effective marketing strategies and optimizing the logicalstructure of the Web site. Because of many uniquecharacteristics of the client-server model in the WorldWide Web, including differences between the physicaltopology of Web repositories and user access paths,and the difficulty in identification of unique users aswell as user sessions or transactions, it is necessary todevelop a new framework to enable the mining process. Specifically, there are a number of issues in pre processing data for mining that must be addressed before the mining algorithms can be run. These includedevelopinga modelofaccesslogdata, developingtechniquesto clean/alterthe rawdatato eliminate outliersand/or irrelevant items, grouping individual page accesses into semantic units (i.e. transactions), integration of various data sources such as user registrationinformation, and specializing generic data mining algorithms to take advantage of the specific nature ofaccess log data.

3.1 .Preprocessing Tasks

The first preprocessing task is data cleaning. Techniques to clean a server log to eliminate irrelevantitems are of importance for anytype of Web log analysis, not just data mining. The discovered associationsor reported statistics are only useful if the data represented inthe serverloggivesanaccuratepicture of theuseraccessesof the Web site. Elimination of irrelevantitems can be reasonablyaccomplished bychecking thesuffix of the URL name. For instance, all log entrieswith filename suffixes such as, gif, jpeg, GIF, JPEG,jpg, JPG, and map can be removed.

A related but much harder problem is determiningif there are important accesses that are not recordedin the accesslog. Mechanisms such as local caches andproxy servers can severely distort the overall pictureof user traversals through a Web site. Current methods to try to overcome this problem include the use ofcookies, cache busting, and explicit user registration.As detailed, none of these methods are without serious drawbacks. Cookies can be deleted by theuser, cache busting defeats the speed advantage thatcaching was created to provide and can be disabled,and user registration is voluntary and users often provide false information. Methods for dealing with thecachingproblem include using site topology or referrerlogs, along with temporal information to infer missingreferences.

Another problem associated with proxy servers isthat of user identification. Use of a machine nameto uniquely identify users can result in several usersbeing erroneously grouped together as one user. Analgorithmpresented in checksto seeif each incoming requestis reachablefrom the pagesalreadyvisited.If a page is requested that is not directly linked to theprevious pages, multiple users are assumed to exist onthe same machine. In user session lengths determined automatically based on navigation patterns areused to identify users. Other heuristics involve usinga combination of IP address, machine name, browseragent, and temporal information to identify users.