National Conference on Soft Computing and Its Applications (NCSCA-07), Dec 2007

Data Mining – Semantic Web Mining

Presented by: R.S.Ch.Sridevi,M.B.Sruthi,K.N.V.Bhavani

4thyr B.Tech,

M.V.G.R.Coll.of Engg

Abstract –

This paper gives as an overview of semantic

web mining, semantic web languages and

applications of semantic web. Semantic Web Mining

aims at combining the two fast-developing research

areas Semantic Web and Web Mining. The idea is to

improve, on the one hand, the results of Web Mining

by exploiting the new semantic structures in the

Web; and to make use of Web Mining, on the other

hand, for building up the Semantic Web.

Keywords – Semantic Web, Web Mining, World

Wide Web, Meta data, RDF, Web Development.

I INTRODUCTION

Semantic Web Mining aims at combining the two

areas Semantic Web and Web Mining. This vision

follows our observation that trends converge in both

areas: increasing numbers of researchers work on

improving the results of Web Mining by exploiting (the

new) semantic structures in the Web, and make use of

Web Mining techniques for building the Semantic Web.

Last but not least, these techniques can be used for

mining the Semantic Web itself. The wording Semantic

Web Mining emphasizes this spectrum of possible

interaction between both research areas: it can be read

both as Semantic (Web Mining) and as (Semantic Web)

Mining.

In the past few years, there have been many

attempts at “breaking the syntax barrier”1 on the Web. A

number of them rely on the semantic information in text

corpora that is implicitly exploited by statistical

methods. Some methods also analyze the structural

characteristics of data; they profit from standardized

syntax like XML. In this paper, we concentrate on

markup and mining approaches that refer to an explicit

conceptualization of entities in the respective domain.

These relate the syntactic tokens to background

knowledge represented in a model with formal

semantics. When we use the term “semantic”, we thus

have in mind a formal logical model to represent

knowledge.

The aim of this paper is to give an overview of

where the two areas of Semantic Web and Web Mining

meet today. In our survey, we will first describe the

current state of the two areas and then discuss, using an

example, their combination, thereby outlining future

research topics. We will provide references to typical

approaches. Most of them have not been developed

explicitly to close the gap between the Semantic Web

and Web Mining, but they fit naturally into this scheme.

II SEMANTIC WEB

The Semantic Web was thought up by Tim

Berners-Lee, inventor of the WWW, URIs, HTTP, and

HTML. There is a dedicated team of people at the World

Wide Web consortium (W3C) working to improve,

extend and standardize the system, and many languages,

publications, tools and so on have already been

developed.

Today it is almost impossible to integrate

information that is spread over several Web or intranet

pages. Consider, e. g., the query for a data mining expert

in a company intranet, where the only explicit

information stored are the relationships between people

and the projects they work in on the one hand, and

between projects and the topics they address on the other

hand. In that case, a skills management system should be

able to combine the information on the employees’ home

pages with the information on the projects’ home pages

in order to find the respective expert. To realize such

scenarios, metadata have to be interpreted and

appropriately combined by machines.

The process of building the Semantic Web is

still in genesis, but first standards, e.g. for the underlying

data model and an ontology language already appeared.

However, those structures are now to be filled with life

in applications. In order to make this task feasible, one

should start with the simpler tasks first. The following

steps show the direction where the Semantic Web is

heading:

1. providing a common syntax for machine

understandable statements,

2. establishing common vocabularies,

3. agreeing on a logical language,

4. using the language for exchanging proofs.

2.1.Layers of Semantic Web

Fig1. Layers of Semantic Web

The semantic web layers are suggested by

Berner;s Lee. On the first two layers, a common syntax

is provided. Uniform resource identifiers (URIs) provide

a standard way to refer to entities,3 while Unicode is a

standard for exchanging symbols. top of that sits

syntactic interoperability in the form of XML. The

Extensible Markup Language (XML) fixes a notation for

describing labeled trees, and XML Schema allows the

definition of grammars for valid XML documents. XML

documents can refer to different namespaces to make

explicit the context (and therefore meaning) of different

tags.

The Resource Description Framework (RDF)

can be seen as the first layer where information becomes

machine-understandable: According to the W3C

recommendation4, RDF “is a foundation for processing

metadata; it provides interoperability between

applications that exchange machine-understandable

information on the Web.”

RDF documents consist of 3 types of entities:

resources, properties, and statements. Resources may be

Web pages, parts or collections of Web pages, or any

objects which are not directly part of the WWW.

Properties are specific attributes, characteristics, or

relations describing resources. A resource together with

a property having a value for that resource forms an

RDF statement. A value is either a literal, a resource, or

another statement. Statements can thus be considered as

object–attribute–value triples.

Example of RDF statements: In fig. 2. Two of

the authors of the present paper are represented as

resources ‘URI-GST’ and ‘URIAHO’. The statement on

the lower right consists of the resource ‘URI-AHO’ and

the property ‘cooperates-with’ with the value ‘URI-

GST’.The resource ‘URISWMining’ has as value for the

property ‘title’ the literal ‘Semantic Web Mining’.

RDF Schema was designed to be a simple data

typing model for RDF. Using RDF Schema, we can

create properties and classes, as well as doing some

slightly more "advanced" stuff such as creating ranges

and domains for properties. In fig. 2 ‘URI-SWMining’ is

an instance of the concept ‘Project’, and thus by

inheritance also of the concept ‘Top’. The RDF Schema

is having additional properties. i.e. The schema allow us

to say that one class or property is a sub class or sub

property of another, and provides us ranges and domains

let us say what classes the subject and object of each

property must belong to, RDF Schema also contains a

set of properties for annotating schemata (plural form of

schema), providing comments, labels, and the like.

Fig 2. Example of RDF Statements

The next layer is the ontology vocabulary. An

ontology is “an explicit formalization of a shared

understanding of a conceptualization”. This high-level

definition is realized differently by various research

communities and thereby in ontology representation

languages. However, most of these languages have a

certain understanding in common, as most of them

include a set of concepts, a hierarchy on them, and

relations between concepts. Some of them also include

axioms in some specific logic. We will discuss the most

prominent approaches in more detail in the next section.

Logic is the next layer according to Figure 1.

However, nowadays research considers usually the

ontology and the logic levels together, as ontologies are

already based on logic and should allow for logical

axioms. By applying logical deduction, one can infer

new knowledge from the information which is stated

explicitly. For instance, the axiom saying that the

’cooperates with’-relation is symmetric (Figure 2) allows

to logically infer that the person addressed by ‘URI-

AHO’ is cooperating with the person addressed by ‘URI-

GST’ although only the person ”GST“ specifies his

cooperation with the person ”AHO“. The kind of

inference that is possible depends heavily on the logics

chosen.

Proof and trust are the remaining layers. They

follow the understanding that it is important to be able to

check the validity of statements made in the (Semantic)

Web. Therefore the creators of statements should be able

to provide a proof which is verifiable by a machine. At

this level, it is not required that the machine of the reader

of the statements finds the proof itself, it ‘just’ has to

check the proof provided by the creator. These two

layers are rarely tackled in today’s research.

III LANGUAGES & TOOLS

The Semantic Web should enable greater access not only

to content but also to services on the Web. Users and

software agents should be able to discover, invoke,

compose, and monitor Web resources offering particular

services and having particular properties, and should be

able to do so with a high degree of automation if desired.

Powerful tools should be enabled by service

descriptions, across the Web service lifecycle.

Ontologies: Languages and Tools. Any knowledge

representation mechanism can play the role of a

Semantic Web language. Frame Logic (or F–Logic) is

one candidate, since it provides a semantically founded

knowledge representation based on the frame-and-slot

metaphor. The most popular framework at the moment

are Description Logics (DL). DLs are subsets of first

order logic which aim at being as expressive as possible

while still being decidable.

The description logic SHIQ provides the basis

for DAML+OIL, which, in its turn, is a result of joining

the efforts of two projects: The DARPA Agent Markup

Language DAML was created as part of a research

programme started in August 2000 by DARPA, a US

governmental research organization. OIL (Ontology

Inference Layer) is an initiative funded by the European

Union programme.The latest version of DAML+OIL has

been released as a W3C Recommendation under the

name OWL.

OWL :

An important goal for Semantic Web markup

languages, then, is to establish a framework within

which these descriptions are made and shared. Web sites

should be able to employ a standard ontology, consisting

of a set of basic classes and properties, for declaring and

describing services, and the ontology structuring

mechanisms of OWL provide an appropriate, Web-

compatible representation language framework within

which to do this. The OWL-S ontology as a language for

describing services, reflecting the fact that it provides a

standard vocabulary that can be used together with the

other aspects of the OWL description language to create

service descriptions.

The OWL – S provides three kinds of tasks:

1. Automatic Web service discovery. Automatic

Web service discovery is an automated process

for location of Web services that can provide a

particular class of service capabilities, while

adhering to some client-specified constraints.

2. Automatic Web service invocation. Automatic

Web service invocation is the automatic

invocation of an Web service by a computer

program or agent, given only a declarative

description of that service, as opposed to when

the agent has been pre-programmed to be able to

call that particular service.

3. Automatic Web service composition and

interoperation. This task involves the automatic

selection, composition, and interoperation of

Web services to perform some complex task,

given a high-level description of an objective.

DAML :

DAML is a language created by DARPA as an

ontology and inference language based upon RDF.

DAML takes RDF Schema a step further, by giving

us more in depth properties and classes. DAML

allows one to be even more expressive than with

RDF Schema, and brings us back on track with our

Semantic Web discussion by providing some simple

terms for creating inferences.

DAML+OIL

DAML provides us a method of saying things

such as inverses, unambiguous properties, unique

properties, lists, restrictions, cardinalities, pair wise

disjoint lists, data types, and so on.

One DAML construct that we shall run through is

the daml:inverseOf property. Using this property, we

can say that one property is the inverse of another.

The rdfs:range and rdfs:domain values of

daml:inverseOf is rdf:Property. Here is an example

of daml:inverseOf being used:-

:hasName daml:inverseOf :isNameOf .

:Sean :hasName "Sean" .

"Sean" :isNameOf :Sean .

The second useful DAML construct that we shall

go through is the daml:UnambiguousProperty

class. Saying that a Property is a

daml:UnambiguousProperty means that if the

object of the property is the same, then the

subjects are equivalent. For example:-

foaf:mbox rdf:type daml:UnambiguousProperty .

:x foaf:mbox .

:y foaf:mbox .

implies that:- :x daml:equivalentTo :y.

DAML is only one in a series of languages and

so forth that are being used.

TOOLS:

Several tools are in use for the creation and

maintenance of ontologies and metadata, as well as for

reasoning within them. Ontoedit is an ontology editor

which is connected to Ontobroker, an inference engine

for F–Logic. It provides means for semantics-based

query handling over distributed resources. F–Logic has

also influenced the development of Triple, an inference

engine based on Horn logic, which allows the modelling

of features of UML, Topic Maps, or RDF Schema. It can

interact with other inference engines, for example with

FaCT or RACER.

FaCT provides inference services for the

Description Language SHIQ. In, reasoning within SHIQ

and its relationship to DAML+OIL are discussed.

Reasoning is implemented in the FaCT inference engine,

which also underlies the ontology editor OilEd.

SAHA:

SAHA is an annotation tool. Saha is used with a

web browser, and it supports collaborative distributed

creation of metadata by centrally storing annotations,

which can be viewed and edited by different annotators.

Saha supports collaborative annotation of web-

documents and it can utilize ontology services for

sharing URIs and importing concepts defined in various

external ontologies. The tool is targeted especially for

creating metadata of web resources in semantic web

portals.

In Saha, our primary goal has not been the

automation of the annotation process, but rather to

support the creation of annotations that cannot be

produced automatically. Although requiring a lot of

work, such annotation can be seen as a collaborative

effort, comparable to the creation of different kinds of

Wikis. The basic architecture of Saha is depicted in

figure 3. It consists of the following functional parts: