Web Mining Pattern Discovery

Paul George
CSE 8331

Submitted to: Dr. M. Dunham

Department of Computer Science

Southern Methodist University, Texas, USA

TABLE OF CONTENTS

ABSTRACT

INTRODUCTION

Web Content Mining

Agent-Based Approach

IntelligentSearchAgents

InformationFiltering/Categorization

PersonalizedWebAgents

Database Approach

MultilevelDatabases

WebQuerySystems

Overview of Crawlers

Personalization

Web Usage Mining

PATTERN DISCOVERY

Preprocessing Tasks

Data Cleaning

Transaction Identification

Discovery Techniques

Path Analysis

Association Rules

Clustering and Classification

Sequential Patterns

WEB USAGE MINING ARCHITECTURE – WEBMINER

Tools

BENEFITS

APPLICATIONS

RESEARCH AREAS

CONCLUSION

BIBLIOGRAPHY

ABSTRACT

Web mining has been and is the focus of many research papers. Web mining can be classified into three categories: Web content mining which is the process of discovering information from various resources available on WWW, Web structure mining which is the process of discovering knowledge from the interconnections of hypertextdocuments and Web usage mining which is the process of pattern discovery and analysis. In this paper Web content mining approaches have been briefly discussed andWeb usage mining has been concentrated upon in the area of pattern discovery. I have also listed some applications where Web mining could be applied. Also WEBMINER, a system which is used for Web usage mining has been covered briefly. In the end I have concluded by listing the issues and the research directions with respect to the topics covered.

INTRODUCTION

Firstly let’s define data mining, Data Mining can be defined as finding hidden information in a database or exploratory data analysis, data driven discovery, and deductivelearning[1].

Web mining is data mining applied to the World Wide Web i.e. mining of data related to World Wide Web.

The Web data can be any of the following:

  • Web page content
  • HTML/XML code
  • Data automatically generated which are stored as server access logs, referrer logs and cookies residing on the client.
  • E-commerce transaction data

When data mining is applied to the Web, it can perform several functions like:

  • Informationextraction: This deals with acquiring/interpreting useful information using the Web data which may lead to Business Intelligence
  • Resourcediscovery: This is the discovery of locations of unfamiliar files on the network which may or may not be relevant.
  • Generalization: It relates to the discovery of information patterns [4].

FollowingpageshowsWebminingclassification–Fig.1.
As we see in Fig. 1, Web mining is divided into Web content mining, Web structure mining and Web usage mining. In the following sections in this paper I have discussed the types of approaches in Web content mining and very briefly covered Web structure mining and concentrated more on pattern discovery in Webusagemining.
Web mining can be classified as shown below:

Fig.1: Web mining classification [3].

Web Content Mining

Web content mining can be thought of as a process of information or resource discovery from resources on WWW [5].

There are two approaches in Web content mining:

  • Agent-based
  • Database approaches.

Agent-Based Approach

The agent-based approach involves artificial intelligence systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information [7].

Agent-based Web mining systems can be classified into three categories:

IntelligentSearchAgents
Many intelligent Search agents e.g. Harvest [10], FAQ-Finder [11] are available which search for information that are relevant by using domain specific characteristics tointerpretandorganizethediscoveredinformation[7].

The examples of agents mentioned above rely on documents which have domain specific information, or on “hard coded models of the information sources to retrieve and interpret documents” [7].

InformationFiltering/Categorization
Those agents that belong to this category use IR techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them.
For example, HyPursuit [13] uses semantic information embedded in link structures as well as document content to create cluster hierarchies of hypertext documents, and structure an information space [7].

PersonalizedWebAgents
Another category of Web agents includes those that learn about user preferences or obtain them (fromWWW) and discover sources from web data that correspond to these preferences, and possibly those of other individuals with similar interests.
Example of such an agent includes the WebWatcher [14] [7].

Database Approach

The database approach focuses on integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources. Analysis can be then performed on these resources of data using standard querying mechanisms. Examples of such resources can be relational or object-oriented databases [8].

MultilevelDatabases
This idea has been proposed to organize web based information.
According to this idea, the database is organized into hierarchies. The lowest level contains primitive semi-structured information stored in various Web repositories, such as hypertext documents whereas as we go higher in the hierarchy the data would be a generalization of the lower level and organized in structured collections such as relational or object-oriented databases. Example ARANEUS system [13] extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views [8].

WebQuerySystems
These systems attempt to utilize standard database query languages such as SQL and even natural language processing and interlace it with the types of queries that are used in WorldWideWebsearches.Example W3QL [14] combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques [8].

Basic content mining is a type of text mining. It extends the work performed by basic search engines. It can also improve a traditional search engine through various techniques like analyzing links between pages and user profile.

Overview of Crawlers

Crawlers (spider or robot) is a program that traverses the hypertext structure in the web [1].ThisisusedinWebcontentmininginsearchengines.

Before I explain how a crawler works let me explain what a Seed URL is.
Seed URL’s: It is the page / pages that the crawler starts with.

How it works: Starting from the Seed URL, the crawler records and saves all links in a queue. These new pages are in turn searched and their links are saved. As these Robots search the web, they collect information about each page like keywords and store in indices for users of the associated search engine [1].

Types of crawlers

Periodic crawler: It is activated periodically. It may visit a certain number of pages and then stop, build an index, and replace the existing index.

Incremental crawler: It updates as opposed to replacing index.

Focused crawler: This visits pages related to topics of interest.

To give the idea where crawlers are used in a search engine, given below is Search Engine General Architecture [3] which is self explanatory:

Fig. 2: Architecture of a search engine. [3]

Personalization

Personalization is another area in Web content mining. In this the contents of a web page are modified to better fit the desires of the user. The goal here is to entice a current customer to purchase something he or she may not have thought about purchasing. It includes such techniques as use of cookies, use of databases and more complex methods [1].

Personalization can be viewed as type of [1]:

  • Classification: The behavior of a user is determined based on those for the class.
  • Clustering: The behavior is determined based on those users to which he/she is determined to be similar to.
  • Prediction: Used to predict what the user would like to see.

There are three types of Web page personalization [1]:

  • ManualTechniques:This works by using details of user registration preferences or through the use of rules that are based on profiles to classify individuals.
  • Collaborativefiltering:In this personalization is achieved by recommending information that has previously been given high ratings from similar users.
  • Content-based filtering:This retrieves pages based on similarity between them and user.

Web Usage Mining

This is the process of discovering user access patterns (or user habits), as data are automatically collected in daily access logs [4].

Web usage mining performs mining on web usage data or web logs.

A web log is a listing of page reference data [1]. It is at times referred to as clickstreamdata as each entry corresponds to a mouse click. Logs can be examined from two perspectives [1]:

  • Server: It can be used to find where a web site resides and also improve the design.
  • Client: By evaluating a client’s sequence of clicks, information about a user (group of users) is detected. This information could be used to perform prefetching and caching of pages and hence easy and fast loading of web pages.

Analyzing logs can help organizations to effectively market their products by effective promotional campaigns, advertisements. It can also provide important information how to restructure a web site for effective organizational presence[7].

Web usage mining can be divided into two parts. Firstly pattern discovery which I have covered in the following section and secondly pattern analysis [7].

Weblog/Web usage mining can be used for the following[3]:

  • To enhance server performance
  • To improve Web-site navigation
  • Target customers for electronic commerce
  • Identify potential prime advertisements locations

PATTERN DISCOVERY

As we have previously discussed the various uses of Web usage mining, there are various issues in pre-processing data for mining that must be taken care of before the mining algorithms can be run. These include developing a model of access log data, developing techniques to clean/filter the raw data to eliminate outliers and/or irrelevant items, grouping individual page accesses into transactions, integration of various data sources such as user registration information, and specializing generic data mining algorithms to takeadvantageofthespecificnatureofaccesslogdata[7].

Preprocessing Tasks

This can be divided into two activities:

  • Data Cleaning
  • Transaction Identification

Data Cleaning

This is one of the most important activities as presence of irrelevant data during analysis phase may yield wrong results. Elimination of irrelevant items can be reasonably accomplished by checking the suffix of the URL name. For instance, all log entries with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed[7].

It may happen that a page may be accessed many time but listed only once. Recent methods which make use of cookies, cache busting, and explicit user registration try to overcome this problem. As detailed in [17], all the methods have drawbacks. Cookies can be deleted by the user, cache busting defeats the speed advantage that caching was created to provide and can be disabled, and user registration is voluntary and users often providefalseinformation. User identification is also another problem. Machine names cannot be used, as it is possible for more than one user to be behind that name (here name applies to the IP address of the machine) e.g. a company network have a single IP but several people may be accessing the same URL. [18]Discusses algorithm to check if each incoming request is reachable from the pages already visited. If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine[7]. Other methods to identify users are IP address, browser agent, temporal information which are discussed in [17].

Transaction Identification

The raw server log can be thought of as in two ways; either as a single transaction of many page references or set of many transactions each of single page reference. Sequences of page references must be grouped into logical units representing Web transactions or user sessions. A user session is all of the page references made by a user during a single visit to a site. A transaction differs from a user session in that the size of a transaction can range from a single page reference to the entire page references in a user session, depending on the criteria used to identify transactions [7].

The goal of transaction identification is to create meaningful clusters of references for each user. Thus, the task can be dividing large transaction into multiple smaller ones to merging small transactions into many large ones. [19]

Two types of transactions are defined. The first type is navigation-content, where each transaction consists of a single content reference and all of the navigation references in the traversal path leading to the content reference. These transactions can be used to mine for path traversal patterns. The second type of transaction is content-only, which consists of all of the content references for a given user session. These transactions can be used to discover associations between the content pages of a site[19].

A given page reference is classified as either navigational or content, based on the time spent on the page. This kind of page typing is further delineated in [18], where various page types such as index pages, personal home pages, etc. are used in the discovery of user patterns. [19]

A general transaction definitioncan be got from [19]. In that it assumes each transaction is made up of references from only one user. This will hence be a merge activity. Once the log has been arranged according to the general transaction definition we can apply one of the three divide modules to the log: reference length, maximal forward reference and time window.

The first two methods identify transactions based on User browsing behavior model. User browsing behavior model uses the concept of navigation and content page. According to this model there are two ways of defining a transaction. One is to define a transaction as all navigation references up to and including each content reference for a given user. The second would be to define a transaction as all of content-references. Time window is used a benchmark to compare with the other two algorithms.[19]

Reference length module is based on the assumption that the amount of time a user spends on a page correlates to whether the page should be classified as a navigation or content page for that user. [19]

In Maximal forward reference module each transaction is defined to be a set of pages in the path from the first page up the log up to the page a backward reference is made.A new transaction is started when the next forward reference is made. “A forward reference is defined to be a page not already in the set of pages for the current transaction”. Similarly, a backward reference is defined to be a page that is already contained in the set of pages for the current transaction. [19]

“Time window transaction identification module divides the log for a user into time intervals no larger than a specified parameter”. [19] Discusses the three modules in detail.

Discovery Techniques

Once the log has been divided into transactions, techniques can be applied to perform pattern mining. Some of them are listed below:

  • Path Analysis
  • Association Rules
  • Clustering and Classification
  • Sequential Patterns

Path Analysis

To perform pattern discovery path analysis makes use of graphs. A graph represents some relation defined on Web pages. Thus, for a Web site, a graph may be drawn with Web pages as nodes and hypertext links between pages as directed edges [7].

The navigation-content transactions of [19], maximal forward reference transactions of [20], or user sessions of [18] can be used for path analysis some of which has been briefly introduced above. Path analysis could be used to determine most frequently visited paths in a Web site.[7]

[18] Discusses approaches to extracting structure in the Web which can be used to form higher level abstractions that reduce the complexity and increase the richness of an overview. The approach discussed in [18] is based on graph structure representations of similarity, or strength of association and relations. Three types of graph structures are discussed in this. The first type of graph structure represents the link topology of a Web locality by using arcs labeled with unit strengths to connect one node to another where the node represents the page and an arc represents the hyperlink connecting the two pages. (A Web locality may be thought of as a “complex abstract space in which we have arranged Web pages of different functional categories or types”). The second type of graph structure represents inter-page text content similarity by labeling arcs connecting nodes with the computed text similarities between corresponding Web pages. The third type represents flow of users through the locality by labeling the arcs between two nodes with the number of users that go from one page to another. [18]

Examples of information that can be discovered through path analysis are [7]:

  • 75% of clients who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and /company/product1: This suggests that there is useful information in product2 page, but the path to the page is not clear.
  • 65% of the clients left the site after four or less page references.
    This suggests that more information should be provided in the first four pages from the points of entry into the site.

Association Rules

The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence. An association rule is of the form X => Y where X,Y are set of items and X ∩ Y is null. The support of an item (or set of items) is the percentage of transactions in which that item (or items) occurs. It is denoted by s. Confidence for an association rule X => Y is the ratio of the number of transactions that contain X U Y to the number of transactions that contain X. Confidence measures the strength of the rule while support measures how often it should occur in the database. It is denoted by α. Applications include cross-marketing, catalog design, store layout and customer segmentation based on buying patterns.Also, when we are dealing with Web mining, we would mostly be working with transaction logs rather than databases. [1][23]

Association rule discovery techniques are generally applied to databases of transactions where each transaction consists of a set of items. The difficulty here is to discover all associations and correlations among data items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items.In the context of Web mining, this problem amounts to discovering the correlations among references to various files available on the server by a given client. Each transaction is comprised of a set of URLs accessed by a client in one visit to the server.[7]