Is a Collection of Inter-Related Files on One Or More

In customer relationship management (CRM), Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. (Miningmeans extracting something useful or valuable from a baser substance, such as mining gold from the earth.) Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign.

Web mining allows you to look for patterns in data through content mining, structure mining, and usage mining. Content mining is used to examine data collected by search engines and Webspiders. Structure mining is used to examine data related to the structure of a particular Web site and usage mining is used to examine data related to a particular user's browser as well as data gathered by forms the user may have submitted during Web transactions.

The information gathered through Web mining is evaluated (sometimes with the aid of software graphing applications) by using traditionaldata miningparameters such as clustering and classification, association, and examination of sequential patterns.

Web Mining

is a collection of inter-related files on one or more

Web servers.

Web mining is

The application of data mining techniques to extract knowledge ation Web data.

Web data is

Web content-text, image, records, etc

Web structure-hyperlinks, tags, etc.

Web usage –

http logs, app server logs, etc.

Web Mining –history

-Term first used in [E1996], defined in a task oriented manner

-Alternate ‘data oriented’ de z

1 z

1997 [SM1997] ICTAI panel discussion at

Continuing forum z

WebKDD workshops with ACM SIGKDD, 1999, 2000, 2001, z

0 attendees 9 2002, … ; 60 –

shop 2001, 2002, … SIAM Web analytics work z

Special issues of DMKD journal, SIGKDD Explorations z

Papers in various data mining conferences & journals z

Surveys[ MBNL 1999, BL 1999, KB2000]

Pre-processing Web Data

-Web Content

Extract “snippets” from a Web document that

represents the Web Document

-Web Structure

Identifying interesting graph patterns or pre-

processing the whole web graph to come up with

metrics such as PageRank

-Web Usage

User identification, session creation, robot detection

and filtering, and extracting usage path patterns

Common Mining Techniques

The more basic and popular data mining

techniques include:

Classification

Clustering

Associations

The other significant ideas:

Topic Identification, tracking and drift analysis

Concept hierarchy creation

Relevance of content.

Web Content Mining Applications

Identify the topics represented by a Web Documents

Categorize Web Documents

Find Web Pages across different servers that are similar

Applications related to relevance

nhance standard Query Relevance with User, E Queries

Role, and/or Task Based Relevance

ist of top “n” relevant documents in L Recommendations

a collection or portion of a collection.

Filters-show/Hide documents based on relevance score.

What is Web Usage Mining?

A web is a collection of inter-related files on one or more Web

Web servers

Web Usage Mining

Discovery of meaningful patterns from data generated by

client-server transactions on one or more Web localities

Typical Sources of Data

access automatically generated data stored in server

cookies logs, and client-side agent logs, referrer logs,

user profiles

meta data: page attributes, content attributes, usage data

Conclusions

Web Structure is a useful source for extracting

information such as

Quality of Web Page

The authority of a page on a topic -

Ranking of web pages -

Interesting Web Structures Graph patterns like Co-citation, Social choice, -

Complete bipartite graphs, etc.

Web Page Classification

Classifying web pages according to various topic

Which pages to crawl

Deciding which web pages to add to the collection of -

web pages

Finding Related Pages Given one relevant page, find all related pages -

Detection of duplicated pages

Detection of neared-mirror sites to eliminate duplication.