Guidelines for Preparing Accepted Chapters

Designing and Mining Web Applications: a conceptual modeling approach

Rosa Meo

Dipartimento di Informatica, Università di Torino

Corso Svizzera, 185 - 10149 - Torino - Italy

E-mail:

Tel.: +39-011-6706817, Fax: +39-011-751603

Maristella Matera

Dipartimento di Elettronica e Informazione, Politecnico di Milano

P.zza L. da Vinci, 32 – 20133 – Milano – Italy

E-mail:

Tel.: +39-02-23993408, Fax: +39-02-23993411

Designing and Mining Web Applications: a conceptual modeling approach

ABSTRACT

In this Chapter we present the usage of a modeling language, WebML, for the design and the management of dynamic Web applications. WebML also makes easier the analysis of the usage of the application contents by the users, even if applications are dynamic. In fact, it makes use of some special-purpose logs, called conceptual logs, generated by the application runtime engine.

In this Chapter we report on a case study about the analysis of the conceptual logs for testifying to the effectiveness of WebML and of its conceptual modeling methods. The methodology of analysis of Web logs is based on the data mining paradigm of itemsets and frequent patterns and makes full use of constraints on the conceptual logs content. As a consequence, we could obtain many interesting patterns for the application management such as recurrent navigation paths, the most frequently visited page contents, and anomalies.

Keywords: WebML, Web Log Analysis, data mining, association rules, Web Usage Mining, KDD scenarios, Information Presentation, Computer-Assisted Auditing, Computer-Assisted Software Engineering – CASE, Information Analysis Techniques, Software Development Methodologies, Information System Design, Conceptual Design, Clickstream tracking

INTRODUCTION

In the recent years the World Wide Web has become the preferred platform for developing Internet applications, thanks to its powerful communication paradigm based on multimedia contents and browsing, and to its open architectural standards which facilitate the integration of different types of content and systems (Fraternali, 1999).

Current Web applications are very complex and high sophisticated software products, whose quality, as perceived by users, can heavily determine their success or failure. A number of methods have been proposed for evaluating their effectiveness in content delivery. Content personalization, for instance, aims at tailoring the Web contents to the final recipients according to their profiles. Another approach is the adoption of Web Usage Mining techniques for the analysis of the navigational behaviour of Web users by means of the discovery of patterns in the Web server log.

Traditionally, to be effective, Web usage mining requires some additional pre-processing, such as the application of methods of page annotation for the extraction of meta-data about page semantics or for the construction of a Web site ontology.

In this Chapter, we propose a novel approach to Web Usage Mining. It has the advantage of integrating Web usage mining goals directly into the Web application development process. Thanks to the adoption of a conceptual modelling method for Web application design, and of its supporting case tool, the generated Web applications embed a logging mechanism that - by means of a synchronization tool - is able to produce semantically enriched Web log files. This log, that we call conceptual log (Fraternali et al., 2003), contains additional information with respect to standard (ECFL) Web server logs and some of this information is useful to the Web mining process. It refers not only to the composition of Web pages in terms of atomic units of contents, to the conceptual entities Web pages deal with, but refers also to the identifier of the user crawling session, to the specific data instances that are published within dynamic pages, as well as to some data concerning the topology of the hypertext. Therefore, any extra effort is needed during or after the application development for reconstructing and analyzing usage behaviour.

The main contribution of this Chapter comes from two existing frameworks and integrates them. The first one is the model-based design and development of Web applications based on the Web Modeling Language (WebML) (Ceri et al., 2000; Ceri et al., 2002) and its supporting CASE tool WebRatio (Ceri et al., 2003). The second one is an evaluation of the applications based on data mining analytics that had started by collecting the application data based both on the static (i.e., compile-time) analysis of conceptual schemas and on the dynamic (i.e., run-time) collection of usage data. The evaluation of the application aimed at studying its suitability to respond to users' needs by observing their most frequent paths or by observing the application response in different contexts, often difficult by the network traffic conditions or determined by the users themselves (such as their browser) or even by security attacks.

The distinctive merit of WebML and WebRatio in this collection of application specific data lays in the ease with which relevant data are retrieved, automatically organized and stored. However, the illustrated results are of general validity and apply to any application that has been designed using a model-driven approach, provided that the conceptual schema is available and the application runtime architecture permits the collection of customized log data.

This Chapter presents a case study on the analysis of conceptual Web log files of the Web site of a University Department. Our objective is to testify to the power and versatility of conceptual modelling of data intensive Web applications. The aim of our study is manifold: (i) analyzing the Web logs and extracting interesting, usable and actionable patterns; (ii) evaluating the usability (in practical cases) and the expressive power of the conceptual Web logs; (iii) verifying the suitability of some KDD scenarios. In particular, KDD scenarios have been produced as a set of characteristic data mining requests, a sort of templates, to be filled in with specific parameters values. KDD scenarios should be able to solve some frequently asked questions (mining problems) by users/analysts (Web site administrators and/or information system designers) in order to recover from frequently occurring problems. Some KDD scenarios for some applications, such as Web mining and financial stock market analysis have been studied already in (Meo et al., 2005b).

BACKGROUND

The majority of the public and shareware tools for the analysis of Web application usage are traffic analysers (see for example Analog, AWSD-WebLog and CAPE WebLogs). Their functionality is limited to producing reports about site traffic, (e.g., number of visits, number of hits, page view time, etc.), diagnostic statistics, (such as server errors and page not found), referrer statistics, (such as search engines accessing the application), user and client statistics (such as user geographical region, Web browser and operating systems, etc). Only few of them also track user sessions and present specific statistics about individual users’ accesses.

A number of methods have been proposed for evaluating also Web applications quality. In particular, Web usage mining methods are employed to analyze how users exploit the information provided by the Web site. For instance, they highlight those navigation patterns that correspond to high Web usage, or those ones that correspond to early leaving (Kohavi & Parekh, 2003). However, Web usage mining approaches rely heavily on the pre-processing of log data as a way to obtain high level information regarding user navigation patterns and ground such information into the actual data underlying the Web application (Cooley, 2002; Facca & Lanzi, 2005; Srivastava et al., 2000).

Pre-processing generally includes four main steps: data cleaning, identification of user sessions, content and structure information retrieval (for mapping users’ requests into the actual information of visited pages) and data formatting. Notwithstanding the pre-processing efforts, in most cases the information extracted is usually insufficient and with much loss of the knowledge that is embedded in the application design. Even more, such approaches, mostly based on Web Structure mining, are ineffective on applications that dynamically create Web pages.

In (Dai & Mobasher, 2002), authors propose the use of ontologies to go beyond the classification of pages on the basis of the mere discovery of associations between pages and keywords. The approach uses complex structured objects to represent items associated to the pages.

Some efforts have been recently undertaken for enriching Web Log files, using Semantic Web techniques. In (Oberle et al., 2003), authors exploit RDF annotations of static pages for mapping page URLs into a set of ontological entities. Within dynamic applications, the same mapping is achieved by analyzing the query strings enclosed within page URLs.

In (Jin et al., 2004) the authors have observed that standard mining approaches, such as clustering of user sessions and discovering association rules or frequent navigational paths, do not generally provide the ability to automatically characterize or quantify the unobservable factors that lead to common navigational patterns. The reason is that the semantic relationships among users as well as between users and Web objects are generally “hidden”, i.e., not available in currently generated Web logs. Authors therefore propose a probabilistic latent semantic analysis (PLSA) with the aim of uncovering latent semantic associations among users and pages, based on the co-occurrence patterns of these pages in user sessions.

With respect to the previous works, the approach we present in this Chapter has the advantage of integrating Web usage mining goals directly into the Web application development process. In conceptual modelling, the semantic models of the Web applications allow the specification of the application and of related data in an increased level of abstraction. The fundamental issues in the adopted methodology, as we will see better along this Chapter, are the separation of the distinct tasks of the specification of a Web application: the structure of the information, designed in terms of the data entities and of their logical relationships, the composition of pages in terms of content units, their final presentation and collocation in the flow of the hypertext crawling by the user. This neat separation of roles of the various components of a Web application architecture and the clear reference to the actual objects to which the information content of each page refers to gives an enriched semantics to the obtained logs, which can be used immediately for mining, thus improving the overall application quality, its maintenance and the experience of users on the Web site. Therefore, no extra effort is needed for Web mining, during or after the application development. This is instead required by other methods for page annotation, for the extraction of meta-data about page semantics, or even for the construction of a Web site ontology.

The WebML Method for Web Application Development

In this section, we will shortly illustrate the main features of the adopted design model, WebML (Web Modeling Language), and of the rich logs that WebML-based applications are able to produce.

WebML (Web Modeling Language) is a conceptual model that provides a set of visual primitives for specifying the design of the information content and the hypertexts of data-intensive Web applications (Ceri et al., 2002). It is also complemented with a development methodology that, in line with other model-based development methods (Baresi et al., 2001, Gomez et al., 2001, Rossi et al., 2001), consists of different phases, centred on the definition and/or the refinement of the application conceptual design. Thanks to the use of a CASE tool enabling the automatic code generation (Ceri et al., 2003), at each iteration the conceptual design can be automatically transformed into a running prototype. This greatly facilitates the evaluation activities since the early phases of development.

WebML consists of a Data Model and a Hypertext Model, for specifying respectively the content structure of a Web application and the organization and presentation of contents in one or more hypertexts.

The WebML Data Model allows designers to express the organization of data, through well-known notations (namely, the Entity-Relationship and UML class diagrams). For simplicity, in this Chapter, we will refer to the Entity-Relationship (E/R) model, which mainly consists of entities, defined as containers of data elements, and relationships, defined as semantic connections between entities.

The WebML Hypertext Model allows describing how contents, whose organization is specified in the data model, are published through elementary units, called content units, whose composition makes up pages. It also specifies how content units and pages are interconnected by links to constitute site views, i.e., the front-end hypertexts.

The WebML Hypertext Model includes:

- The composition model, concerning the definition of pages and their internal organization in terms of elementary pieces of publishable content, the content units. Content units offer alternative ways of arranging contents dynamically extracted from entities and relationships of the data schema. The binding between the hypertext and the data schema is represented by the source entity and the selector of the content units. The source entity specifies the type of objects published by a content unit, by referencing an entity of the E/R schema. The selector is a filter condition over the instances of the source entity, which determines the actual objects published by the unit.

- The navigation model, describing links between pages and content units that support information location and hypertext browsing. Links are represented as oriented arcs, and have the double role of enabling user navigation and transporting parameters needed for unit computation.

- The content management model, consisting of a set of operation units specifying the creation, updating and deletion of content, and the interaction with external services.

Figure 1a) shows the visual specification for the page Research Area taken from the WebML schema of the application we will analyze later on in this Chapter. The page publishes the description of a University Department research area, and the list of the current research topics covered by the area.

Figure 1. A simplified WebML schema for the Research Page

in the DEI Web application.

Among others (we report here a simplified schema), the page includes two content units. The first one is a data unit, publishing some attributes (e.g., the title and the textual description) of a single instance of the Research_Area entity. The instance is retrieved from the database, according to a selector condition that allows selecting an area based on the equality of its OID with the OID of the area previously selected