PRESENTATION ON
DATA MINING & WAREHOUSING
ABSTRACT
In today's fast-paced, information-based economy, companies must be able to integrate vast amounts of heterogeneous data and applications from disparate sources in order to support strategic IT initiatives such as Business Intelligence, Business Process Management, Business Process Reengineering, Business Activity Monitoring and Business Performance Management. Since its inception, It has continued to build on its unique software architecture to make the integration process easier to learn and use, faster to implement and maintain, and operate at the best performance possible- in other words, Simply Faster Integration.
Relational database management systems (RDBMSs) are designed to store data according to the most efficient method of data cataloging, which is that defined by mathematical set theory as expressed in the relational paradigm. In many cases, however, the most efficient method for cataloging data is not the most efficient method for storing and retrieving such data. Where relational databases do well is where the data is most appropriately managed, as flat lists having simple data types, involving few associations with data in other lists. When dealing with data that must be kept in complex interdependent structures or when data must be rapidly retrieved by following paths of associations rather than by simply walking down simple lists, the relational database begins to show characteristics such as multiple-index management and traversal and complex normalized schema structures. These impediments, along with limits in row length or table size, can, in some cases, represent such profound encumbrances that an RDBMS must be regarded as impractical for certain data management tasks. Although leading RDBMS vendors have been introducing features that enable their products to support data outside the relational paradigm, the fundamental means of management and access of such data remains relational and, for the most part, SQL based. This fact will continue to make RDBMS products unnecessarily difficult to set up and manage, and too inefficient, for some kinds of databases.
REFERENCES
1. Ralph Kimball, The Data Warehouse Toolkit (New York, NY: John Wiley & Sons, Inc., 1996), Pp. 15-16
2. W. H. Inmon, Claudia Imhoff, and Ryan Sousa, Corporate Information Factory (New York, NY: John Wiley & Sons, Inc., 1998), Pp. 87-100
3. Len Silverston, W. H. Inmon, and Kent Graziano, The Data Model Resource Book (New York, NY: John Wiley & Sons, Inc., 1997)
4. Douglas Hackney, Understanding and Implementing Successful Data Marts (Reading, MA: Addison-Wesley, 1997), Pp. 52-54, 183-84, 257, 307-309
5. White Paper, available at http://www.informatica.com.
6. Hackney, op. cit.
7. Informatica, op. cit.
8. http://www.sybase.com/products/dataware/ [10] I introduced the DKMS concept in two previous White Papers "Object-Oriented Data Warehouse," and "Distributed Knowledge Management Systems: The Next Wave in DSS." Both are available at
9. http://www.dkms.com/White_Papers.htm.[11] Wolfgang Keller, Christian Mitterbauer, and Klaus Wagner, "Object-Oriented Data Integration," in Mary E. S. Loomis,and Akmal B. Choudri (eds.), Object Databases in Practice (Upper Saddle River, NJ: Prentice-Hall, 1998), pp. 7-11
10. http://www.template.com. ersistence Software, "The PowerTier Server: A Technical Overview" at http://www.persistence.com/products/tech_overview.html, and John Rymer, "Business Process Engines, A New Category of Server Software, Will Burst the Barriers in Distributed Application Performance Engines," Emeryville, CA, Upstream
An Introduction to Data Mining
Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"
This paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today’s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate.
Figure 1 shows the data explosion. and the Growing Base of Data
Data storage became easier as the availability of large amounts of computing power at low cost ie the cost of processing power and storage is falling, made data cheap.
An Architecture for Data Mining
To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on.
The term data mining has been stretched beyond its limits to apply to any form of data analysis. Some of the numerous definitions of Data Mining, or Knowledge Discovery in Databases are:
Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency net works, analyzing changes, and detecting anomalies.
Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database
The following diagram summarizes the some of the stages/processes identified in data mining and knowledge discovery
The phases depicted start with the raw data and finish with the extracted knowledge which was acquired as a result of the following stages:
· Selection: Selecting or segmenting the data according to some criteria e.g. all those people who own a car, in this way subsets of the data can be determined.
· Preprocessing: This is the data cleansing stage where certain information is removed which is deemed unnecessary and may slow down queries for example unnecessary to note the sex of a patient when studying pregnancy. Also the data is reconfigured to ensure a consistent format as there is a possibility of inconsistent formats because the data is drawn from several sources e.g. sex may recorded as f or m and also as 1 or 0.
· Transformation: The data is not merely transferred across but transformed in that overlays may added such as the demographic overlays commonly used in market research. The data is made useable and navigable.
· Data mining: this stage is concerned with the extraction of patterns from the data. A pattern can be defined as given a set of facts (data) F, a language L, and some measure of certainty C a pattern is a statement S in L that describes relationships among a subset Fs of F with a certainty c such that S is simpler in some sense than the enumeration of all the facts in Fs.
Applications of Data mining
Data mining has many and varied fields of application some of which are listed below.
1. Retail/Marketing
· Identify buying patterns from customers
· Find associations among customer demographic characteristics
· Market basket analysis
2. Banking
· Detect patterns of fraudulent credit card use
· Identify `loyal' customers
· Predict customers likely to change their credit card affiliation
· Determine credit card spending by customer groups
3. Insurance and Health Care:
· Claims analysis - i.e which medical procedures are claimed together
· Predict which customers will buy new policies
· Identify behaviour patterns of risky customers
4. Medicine
· Characterise patient behaviour to predict office visits
· Identify successful medical therapies for different illnesses
Data Mining Functions
Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are…
1. Classification
Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.
2. Associations:
Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.
Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.
Data Warehousing
Introduction
When your strategy is deep and far reaching, then what you gain by your calculations is much, so you can win before you even fight. When your strategic thinking is shallow and near-sighted, then what you gain by your calculations is little, so you lose before you do battle. Much strategy prevails over little strategy, so those with no strategy can only be defeated. So it is said that victorious warriors win first and then go to war, while defeated warriors go to war first and then seek to win. It is obvious to anyone that culls through the voluminous information technology (I/T) literature, attends industry seminars, user group meetings or expositions, reads the ever accelerating new product announcements of I/T vendors, or listens to the advice of industry gurus and analysts, that there are four subjects that overwhelmingly dominate I/T industry attention as we move into the late 1990s:
Why we need Data Warehousing
Data mining potential can be enhanced if the appropriate data has been collected and stored in a data warehouse. A data warehouse is a relational database management system (RDMS) designed specifically to meet the needs of transaction processing systems. It can be loosely defined as any centralized data repository which can be queried for business benefit but this will be more clearly defined later.
Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats. As well as integrating data throughout an enterprise, regardless of location, format, or communication requirements it is possible to incorporate additional or expert information. It is, the logical link between what the managers see in their decision support EIS applications and the company's operational activities
In other words the data warehouse provides data that is already transformed and summarized, therefore making it an appropriate environment for more efficient DSS and EIS applications.
Characteristics of A Data Warehouse
According to Bill Inmon, author of Building the Data Warehouse and the guru who is widely considered to be the originator of the data warehousing concept, there are generally four characteristics that describe a data warehouse:
· Subject-Oriented: Data are organized according to subject instead of application e.g. an insurance company using a data warehouse would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.). The data organized by subject contain only the information necessary for decision support processing.
· Integrated: When data resides in many separate applications in the operational environment, encoding of data is often inconsistent. For instance, in one application, gender might be coded as "m" and "f" in another by 0 and 1. When data are moved from the operational environment into the data warehouse, they assume a consistent coding convention e.g. gender data is transformed to "m" and "f".
· Time-Variant: The data warehouse contains a place for storing data that are five to 10 years old, or older, to be used for comparisons, trends, and forecasting. These data are not updated.