Data Mining – Semantic Web Mining
Presented by: R.S.Ch.Sridevi,M.B.Sruthi,K.N.V.Bhavani
4thyr B.Tech,
M.V.G.R.Coll.of Engg
Abstract –
This paper gives as an overview of semantic
web mining, semantic web languages and
applications of semantic web. Semantic Web Mining
aims at combining the two fast-developing research
areas Semantic Web and Web Mining. The idea is to
improve, on the one hand, the results of Web Mining
by exploiting the new semantic structures in the
Web; and to make use of Web Mining, on the other
hand, for building up the Semantic Web.
Keywords – Semantic Web, Web Mining, World
Wide Web, Meta data, RDF, Web Development.
Semantic Web Mining aims at combining the two
areas Semantic Web and Web Mining. This vision
follows our observation that trends converge in both
areas: increasing numbers of researchers work on
improving the results of Web Mining by exploiting (the
new) semantic structures in the Web, and make use of
Web Mining techniques for building the Semantic Web.
Last but not least, these techniques can be used for
mining the Semantic Web itself. The wording Semantic
Web Mining emphasizes this spectrum of possible
interaction between both research areas: it can be read
both as Semantic (Web Mining) and as (Semantic Web)
In the past few years, there have been many
attempts at “breaking the syntax barrier”1 on the Web. A
number of them rely on the semantic information in text
corpora that is implicitly exploited by statistical
methods. Some methods also analyze the structural
characteristics of data; they profit from standardized
syntax like XML. In this paper, we concentrate on
markup and mining approaches that refer to an explicit
conceptualization of entities in the respective domain.
These relate the syntactic tokens to background
knowledge represented in a model with formal
semantics. When we use the term “semantic”, we thus
have in mind a formal logical model to represent
The aim of this paper is to give an overview of
where the two areas of Semantic Web and Web Mining
meet today. In our survey, we will first describe the
current state of the two areas and then discuss, using an
example, their combination, thereby outlining future
research topics. We will provide references to typical
approaches. Most of them have not been developed
explicitly to close the gap between the Semantic Web
and Web Mining, but they fit naturally into this scheme.
The Semantic Web was thought up by Tim
Berners-Lee, inventor of the WWW, URIs, HTTP, and
HTML. There is a dedicated team of people at the World
Wide Web consortium (W3C) working to improve,
extend and standardize the system, and many languages,
publications, tools and so on have already been
Today it is almost impossible to integrate
information that is spread over several Web or intranet
pages. Consider, e. g., the query for a data mining expert
in a company intranet, where the only explicit
information stored are the relationships between people
and the projects they work in on the one hand, and
between projects and the topics they address on the other
hand. In that case, a skills management system should be
able to combine the information on the employees’ home
pages with the information on the projects’ home pages
in order to find the respective expert. To realize such
scenarios, metadata have to be interpreted and
appropriately combined by machines.
The process of building the Semantic Web is
still in genesis, but first standards, e.g. for the underlying
data model and an ontology language already appeared.
However, those structures are now to be filled with life
in applications. In order to make this task feasible, one
should start with the simpler tasks first. The following
steps show the direction where the Semantic Web is
1. providing a common syntax for machine
understandable statements,
2. establishing common vocabularies,
3. agreeing on a logical language,
4. using the language for exchanging proofs.
2.1.Layers of Semantic Web
Fig1. Layers of Semantic Web
The semantic web layers are suggested by
Berner;s Lee. On the first two layers, a common syntax
is provided. Uniform resource identifiers (URIs) provide
a standard way to refer to entities,3 while Unicode is a
standard for exchanging symbols. top of that sits
syntactic interoperability in the form of XML. The
Extensible Markup Language (XML) fixes a notation for
describing labeled trees, and XML Schema allows the
definition of grammars for valid XML documents. XML
documents can refer to different namespaces to make
explicit the context (and therefore meaning) of different
The Resource Description Framework (RDF)
can be seen as the first layer where information becomes
machine-understandable: According to the W3C
recommendation4, RDF “is a foundation for processing
metadata; it provides interoperability between
applications that exchange machine-understandable
information on the Web.”
RDF documents consist of 3 types of entities:
resources, properties, and statements. Resources may be
Web pages, parts or collections of Web pages, or any
objects which are not directly part of the WWW.
Properties are specific attributes, characteristics, or
relations describing resources. A resource together with
a property having a value for that resource forms an
RDF statement. A value is either a literal, a resource, or
another statement. Statements can thus be considered as
object–attribute–value triples.
Example of RDF statements: In fig. 2. Two of
the authors of the present paper are represented as
resources ‘URI-GST’ and ‘URIAHO’. The statement on
the lower right consists of the resource ‘URI-AHO’ and
the property ‘cooperates-with’ with the value ‘URI-
GST’.The resource ‘URISWMining’ has as value for the
property ‘title’ the literal ‘Semantic Web Mining’.
RDF Schema was designed to be a simple data
typing model for RDF. Using RDF Schema, we can
create properties and classes, as well as doing some
slightly more "advanced" stuff such as creating ranges
and domains for properties. In fig. 2 ‘URI-SWMining’ is
an instance of the concept ‘Project’, and thus by
inheritance also of the concept ‘Top’. The RDF Schema
is having additional properties. i.e. The schema allow us
to say that one class or property is a sub class or sub
property of another, and provides us ranges and domains
let us say what classes the subject and object of each
property must belong to, RDF Schema also contains a
set of properties for annotating schemata (plural form of
schema), providing comments, labels, and the like.
Fig 2. Example of RDF Statements
The next layer is the ontology vocabulary. An
ontology is “an explicit formalization of a shared
understanding of a conceptualization”. This high-level
definition is realized differently by various research
communities and thereby in ontology representation
languages. However, most of these languages have a
certain understanding in common, as most of them
include a set of concepts, a hierarchy on them, and
relations between concepts. Some of them also include
axioms in some specific logic. We will discuss the most
prominent approaches in more detail in the next section.
Logic is the next layer according to Figure 1.
However, nowadays research considers usually the
ontology and the logic levels together, as ontologies are
already based on logic and should allow for logical
axioms. By applying logical deduction, one can infer
new knowledge from the information which is stated
explicitly. For instance, the axiom saying that the
’cooperates with’-relation is symmetric (Figure 2) allows
to logically infer that the person addressed by ‘URI-
AHO’ is cooperating with the person addressed by ‘URI-
GST’ although only the person ”GST“ specifies his
cooperation with the person ”AHO“. The kind of
inference that is possible depends heavily on the logics
Proof and trust are the remaining layers. They
follow the understanding that it is important to be able to
check the validity of statements made in the (Semantic)
Web. Therefore the creators of statements should be able
to provide a proof which is verifiable by a machine. At
this level, it is not required that the machine of the reader
of the statements finds the proof itself, it ‘just’ has to
check the proof provided by the creator. These two
layers are rarely tackled in today’s research.
The Semantic Web should enable greater access not only
to content but also to services on the Web. Users and
software agents should be able to discover, invoke,
compose, and monitor Web resources offering particular
services and having particular properties, and should be
able to do so with a high degree of automation if desired.
Powerful tools should be enabled by service
descriptions, across the Web service lifecycle.
Ontologies: Languages and Tools. Any knowledge
representation mechanism can play the role of a
Semantic Web language. Frame Logic (or F–Logic) is
one candidate, since it provides a semantically founded
knowledge representation based on the frame-and-slot
metaphor. The most popular framework at the moment
are Description Logics (DL). DLs are subsets of first
order logic which aim at being as expressive as possible
while still being decidable.
The description logic SHIQ provides the basis
for DAML+OIL, which, in its turn, is a result of joining
the efforts of two projects: The DARPA Agent Markup
Language DAML was created as part of a research
programme started in August 2000 by DARPA, a US
governmental research organization. OIL (Ontology
Inference Layer) is an initiative funded by the European
Union programme.The latest version of DAML+OIL has
been released as a W3C Recommendation under the
name OWL.
An important goal for Semantic Web markup
languages, then, is to establish a framework within
which these descriptions are made and shared. Web sites
should be able to employ a standard ontology, consisting
of a set of basic classes and properties, for declaring and
describing services, and the ontology structuring
mechanisms of OWL provide an appropriate, Web-
compatible representation language framework within
which to do this. The OWL-S ontology as a language for
describing services, reflecting the fact that it provides a
standard vocabulary that can be used together with the
other aspects of the OWL description language to create
service descriptions.
The OWL – S provides three kinds of tasks:
1. Automatic Web service discovery. Automatic
Web service discovery is an automated process
for location of Web services that can provide a
particular class of service capabilities, while
adhering to some client-specified constraints.
2. Automatic Web service invocation. Automatic
Web service invocation is the automatic
invocation of an Web service by a computer
program or agent, given only a declarative
description of that service, as opposed to when
the agent has been pre-programmed to be able to
call that particular service.
3. Automatic Web service composition and
interoperation. This task involves the automatic
selection, composition, and interoperation of
Web services to perform some complex task,
given a high-level description of an objective.
DAML is a language created by DARPA as an
ontology and inference language based upon RDF.
DAML takes RDF Schema a step further, by giving
us more in depth properties and classes. DAML
allows one to be even more expressive than with
RDF Schema, and brings us back on track with our
Semantic Web discussion by providing some simple
terms for creating inferences.
DAML provides us a method of saying things
such as inverses, unambiguous properties, unique
properties, lists, restrictions, cardinalities, pair wise
disjoint lists, data types, and so on.
One DAML construct that we shall run through is
the daml:inverseOf property. Using this property, we
can say that one property is the inverse of another.
The rdfs:range and rdfs:domain values of
daml:inverseOf is rdf:Property. Here is an example
of daml:inverseOf being used:-
:hasName daml:inverseOf :isNameOf .
:Sean :hasName "Sean" .
"Sean" :isNameOf :Sean .
The second useful DAML construct that we shall
go through is the daml:UnambiguousProperty
class. Saying that a Property is a
daml:UnambiguousProperty means that if the
object of the property is the same, then the
subjects are equivalent. For example:-
foaf:mbox rdf:type daml:UnambiguousProperty .
:x foaf:mbox .
:y foaf:mbox .
implies that:- :x daml:equivalentTo :y.
DAML is only one in a series of languages and
so forth that are being used.
Several tools are in use for the creation and
maintenance of ontologies and metadata, as well as for
reasoning within them. Ontoedit is an ontology editor
which is connected to Ontobroker, an inference engine
for F–Logic. It provides means for semantics-based
query handling over distributed resources. F–Logic has
also influenced the development of Triple, an inference
engine based on Horn logic, which allows the modelling
of features of UML, Topic Maps, or RDF Schema. It can
interact with other inference engines, for example with
FaCT provides inference services for the
Description Language SHIQ. In, reasoning within SHIQ
and its relationship to DAML+OIL are discussed.
Reasoning is implemented in the FaCT inference engine,
which also underlies the ontology editor OilEd.
SAHA is an annotation tool. Saha is used with a
web browser, and it supports collaborative distributed
creation of metadata by centrally storing annotations,
which can be viewed and edited by different annotators.
Saha supports collaborative annotation of web-
documents and it can utilize ontology services for
sharing URIs and importing concepts defined in various
external ontologies. The tool is targeted especially for
creating metadata of web resources in semantic web
In Saha, our primary goal has not been the
automation of the annotation process, but rather to
support the creation of annotations that cannot be
produced automatically. Although requiring a lot of
work, such annotation can be seen as a collaborative
effort, comparable to the creation of different kinds of
Wikis. The basic architecture of Saha is depicted in
figure 3. It consists of the following functional parts: