Data Mining with Microsoft SQL Server 2008
Table of Contents
Before You Begin
Lab Setup
Exercise 1: Creating Data Mining Models
Exercise 2: Viewing Mining Accuracy Charts
Exercise 3: Creating a Prediction Query
Exercise 4: Creating a Time Series Model
Information in this document, including URL and other Internet Web site references, is subject to change without notice. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
The names of manufacturers, products, or URLs are provided for informational purposes only and Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links are provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not responsible for the contents of any linked site or any link contained in a linked site, or any changes or updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission received from any linked site. Microsoft is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement of Microsoft of the site or the products contained therein.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
Copyright © 2007 Microsoft Corporation. All rights reserved.
Microsoft, SQL Server, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Version 1.2
Data Mining with Microsoft SQL Server 20081
Before You Begin
Estimated time to complete this lab
90minutes
Objectives
After completing this lab, you will be able to:
- Create Decision Tree and Naïve Bayes data-mining models.
- View mining accuracy charts.
- Create a prediction query.
- Create a Time Series model.
Prerequisites
Before working on this lab, you must have:
- A general understanding of data-mining concepts and terminology.
- A general familiarity with Microsoft SQL Server®Business Intelligence Development Studio.
- A general familiarity with SQL Server Management Studio.
Lab Setup
Lab scenario
AdventureWorks Cycles, a bicycle manufacturing company, uses business analytics to better understand itscustomer base. The company plans to analyze and improve the performance of its bicycle retail sector. Over time, the company has collected information about past customers and sales. It now wants to use thisinformation to gain insights about itscustomers.
Virtual PC
This lab makes use of Microsoft Virtual PC 2007, which is an application that allows you to run multiple virtual computers on the same physical hardware. During the lab, you will use a virtual machine runningWindows Server® 2003.
Before you start the lab, familiarize yourself with the following basics of Virtual PC:
- To switch the focus for your mouse and keyboard to the virtual machine, click inside the virtual machine window.
- To remove the focus from a virtual machine, move the mouse pointer outside the virtual machine window.
- To mimic the CTRL+ALT+DELETE key combination inside a virtual machine, use RIGHT-ALT+DELETE. In Virtual PC, the RIGHT-ALT key is called the host key.
- To enlarge the size of the virtual machine window, drag the lower-right corner of the window as seen in the screenshot.
- To switch to and from full-screen mode, press RIGHT-ALT+ENTER.
Computers in this lab
This lab usesonecomputer as described in the following table. Before you begin the lab, you must start the virtual machine and then log on to the computer. In each exercise, you only have to start the virtual machine that isneeded.
Virtual Machine / Computer Name / User Name / PasswordSQL Server 2008 HOLs / MIAMI / Student / Pa$$w0rd
Start the virtual machine
- Launch Microsoft Virtual PC from the Start menu or desktop. If the Virtual PC console does not appear, double-click its icon in the notification area.
- Select SQL Server 2008 HOLs, and then click Start.
- When the virtual server is running, in the virtual server window,on the Action menu, click Ctrl+Alt+Del (or press RIGHT-ALT+DELETE on your keyboard) to send a CTRL+ALT+DEL sequence to the logon dialog box within the virtual server window.
- Type the following information, and then click OK:
- User name: Student
- Password: P@ssw0rd
Exercise1: Creating Data Mining Models
In this exercise, you will develop an Analysis Services solution by using the Microsoft Business Intelligence Development Studio environment. Business Intelligence Development Studio is an environment based on the Microsoft Visual Studio® environment. Business Intelligence Development Studio provides an integrated development environment for designing, testing, editing, and deploying projects to an Analysis Servicesinstance. You will create and view a data mining structure with Decision Trees and Naïve Bayes data mining models.
SQL Server 2008 Analysis Services providesenhanced data mining functionality through the following data mining techniques:
- Microsoft Association Rules
- Microsoft Clustering
- Microsoft Decision Trees
- Microsoft Naïve Bayes
- Microsoft Neural Network
- Microsoft Sequence Clustering
- Microsoft Time Series
- Microsoft Text Mining
- Microsoft Linear Regression
- Microsoft Logistic Regression
Create views in the AdventureWorksDW database
- Click Start, and then click Computer.
- Browse to theC:\SQLHOLS\Data Mining\Starter folder.
- Double-click Setup.cmd.
- Wait for the Command Prompt window to close before proceeding to the next procedure.
Create an Analysis Services project
- StartSQL Server Business Intelligence Development Studio.
- On theFilemenu, point toNew, andthen clickProject.
- In the New Project dialog box, in the Project Types pane, click Business Intelligence Projects.
- In the Templates pane, click Analysis Services Project.
- In the Name box, type DM Exercise 1
- In the Location box, typeC:\SQLHOLS\Data Mining\Starter\
- Clear the Create directory for Solution checkbox, and then click OK.
Note: The project is created in a new solution. A solution is the largest unit of management in the Business Intelligence Development Studio environment. Each solution contains one or more projects. An Analysis Services project is a group of related files that contain the XML code for all of the objects in an Analysis Services database.
You can view the solution and its projects in the Solution Explorer windowon the right-hand side in Business Intelligence Development Studio. If Solution Explorer is not visible, you can view it by selecting the View, Solution Explorer menu item (or the keyboard shortcut CTRL+ALT+L).
- In Solution Explorer, right-click the DM Exercise 1 project, and then clickProperties.
- In the DM Exercise 1 Property Pages dialog box, under Configuration Properties, clickDeployment.
- In the right pane, in the Deployment Mode drop-down list, click DeployAll, and then click OK.
Note: You can configure the build, debugging, and deployment properties of an Analysis Services project.
Create a data source
- In Solution Explorer, under the DM Exercise 1project, right-click the Data Sources folder, and thenclickNew Data Source.
- In theData Source Wizarddialog box, on the Welcome to the Data Source Wizard page, click Next.
- On the Select how to define the connection page, ensure that the Create a data source based on an existing or new connectionoption is selected, andthen click New.
- In the Connection Manager dialog box, in the Provider drop-down list at the top of the page,clickSqlClient Data Providerin the .Net Providers folder, and then click OK.
- In the Server namebox, type (local)
- Under Log on to the server, click Use Windows Authentication.
- In the Select or enter a database name drop-down list, click AdventureWorksDW.
- Click Test Connection, and then clickOK to dismiss the message box.
- In the Connection Manager dialog box, click OK.
- In the Data Source Wizard dialog box, on the Select how to define the connection page, verify that (local).AdventureWorksDW is selected, and then clickNext.
- On the Impersonation Information page, select Use the service account, and then click Next.
- On the Completing the Data Source Wizardpage, leave the default data source name Adventure Works DW unchanged, and then click Finish.
You have now set up the connection information for the database. You must now define the schema information that you want to use in the solution. You do this by creating aData Source View.
Create a Data Source View
- In Solution Explorer, under the DM Exercise 1project, right-click the Data Source Views folder, and on the shortcut menu, clickNew Data Source View.
- In the Data Source View Wizard dialog box, on the Welcome to the Data Source View Wizard page, click Next.
- On the Select Data Source page, in the Relational data sources pane, verify that Adventure Works DW is selected, and then click Next.
Note: At this point, Analysis Services may take a few moments to read the database schema.
- In this project, the Data Source View is not based on a table; instead, it is based on a view. On the Select Tables and Views page, double-click vDMLabCustomerTrainto add this view to the Included objects list,and then click Next.
Note: You may need to expand the Name column, or the entire dialog box, to be able to select vDMLabCustomerTrain.
- On the Completing the Wizard page, in the Name box, type Customersand then click Finish. The Data Source View Designer will open. The Data Source View Designer is a graphical representation of the data schema that you have defined.
- Right-click the vDMLabCustomerTraintable, and then click Explore Data.
Note: Analysis Services may take a few moments to read the data.
- This opens a new window in which you can view the data for the table. If you want, you can make the tab into a dockable floating window instead. Todo this, right-click the windowtab, and then clickFloating or Dockable.
- In the Explore vDMLabCustomerTrain Table window, click the Chart tab.
Note: The values in the charts show the relationship between the number of records sampled for different properties. These values are based on the sampling that was taken when you clicked Explore Data. You can configure if the sampling should come from the first rows returned or from random rows. You can also configure how many rows you want to sample.
- In the Explore vDMLabCustomerTrain Table window, click the Table tab, scroll to view the data, and then close the Explore VDMLabCustomerTrain Tablewindow.
Create a Data Mining Structure with a Decision Trees mining model
- In Solution Explorer, under the DM Exercise 1database, right-click the Mining Structures folder, and then clickNew Mining Structure.
- In the Data Mining Wizard, on the Welcome to the Data Mining Wizard page, click Next.
Note: The Data Mining Wizard is the starting point for all datamining operations.
- On the Select the Definition Method page, selectFrom existing relational databaseor data warehouse, and then click Next.
- On the Create the Data Mining Structurepage, under Create mining structure with a mining model option, verify that Microsoft Decision Treesis selected, and then click Next.
- On the Select Data Source View page, in the Available data source views pane, verify that the Customersdata source view is selected, and then click Next.
- On the Specify Table Typespage, in the Input tablespane, in the vDMLabCustomerTrainrow, verify that the Case check box is selected, and then click Next.
- On the Specify the Training Datapage, in the Mining model structurepane, select or deselect each cell by selecting or clearing the check box as shown in Figure 1, and then click Next.
Figure 1
Because CustomerKey is the primary key of the source table, the Data Mining Wizard has automatically selected it as the key. The key identifies the cases in the mining model. The attributes selected as Input are analyzed to determine their relationship and influence on the attribute selected as Predictable.
Important:The CustomerKey, FirstName, and LastName columns must not be selected as Input or Predictable columns.
- On the Specify Columns’ Content and Data Type page,review the Content Type column for all numeric rows, and then click Detect.
- When the detection is complete, notice that the NumberCarsOwned and NumberChildrenAtHome fields have been changed from Continuous to Discrete,and then click Next.
Note: The Content and Data Type page shows the Data Type determined from the source data.When you click Detect, Analysis Services scans numeric fields to determine if they are continuous or discrete data. After the detection has occurred, the interface provides you with the flexibility to manually edit both the Data Type and Content Type fields.
- On the Split data into training and testing sets page, ensure that Percentage of testing data is set to 30%, and then click Next.
Note: SQL Server 2008 Analysis Services enables you to partition your input data into training and testing sets of data. The training data will be used by the mining model algorithm to determine patterns and relations. A randomly selected portion of the data will be held to test the accuracy of the datamining models created by comparing model predictions to actual values. All mining models associated with this mining structure will use the training and testing sets defined in this step.
- On the Completing the Wizard page, in the Mining Structure Namebox, type Customers,selectthe Allow drill through check box, and then click Finish. The Mining Structure designer will open.
Note: A data mining structure can contain multiple data mining models. Each data mining model uses a subset of the data referenced by the data mining structure. When the data mining structure is processed, the source data is queried once and then all of the data mining models are processed in parallel.
Modify the mining model
In the following procedures, you will add mining models to the Customers mining structure. To make it easier to identify the different models in this mining structure, you will rename the existing mining model. You can rename the model on the Completing the Wizard page of the Data Mining Wizard, or at any time through the Mining Model properties.
- Select the Mining Models tab to view information about the model.
- In the Mining Models grid, right-click the second column’s heading, and then click Properties.
- In the Properties window, in the Name property box, type Customers DTto rename the mining model, and then press ENTER.
Note: Step 3 renames the Decision Tree mining model but does not rename the mining model structure.
Create a filtered mining model
In this procedure, you will create a filtered model based on the Australia region. By creating this model, you will be able to compare and contrast buying patterns worldwide with those in Australia.
- In the Mining Models tab, click anywhere in the Customers DT model.
- Click the Create a related mining model button on the toolbar in the Mining Models tab.
- In the New Mining Modeldialog box, in the Model Name box, type Australia Customers DT and click OK.
- Verify that all of the columns for the Australia Customers DT model are set to Input except for the Bike Buyer and Customer Key columns. Bike Buyermust be set to PredictOnly and Customer Keymust be set to Key.
- Right-click the Australia Customers DT model, and then click Set Model Filter.
- On the [Australia Customers DT] Model Filter page,in the first row, in the Mining Structure Column box,select English Country Region Name, verify that the Operator is = ,and in the Value column,type Australia.Then click OK.
Modify the mining structure