Dr. Bjarne Berg DRAFT

Overview of the CRISP DM Methodology for Data Mining

The Cross Industry Standard Process for Data Mining (CRISP- DM) was initially a methodology developed by a consortium of members through a set of workshops with industry practitioners in 1996 and 1997. Following input from a panel of Special Interest Groups (SIGs), the authors consolidated and refined the model over the next few years and presented the methodology in 1999 as a tool for practitioners. The methodology consists of a set of six phases that takes the practitioner from the inception of a problem to the completion of the analysis.

The first phase is the Business understanding. This phase consist of the tasks of determine the business objectives (why are we doing this), as well as an assessment of the business situation. This is done to place the data mining effort in context of the problem and in the context of the organization that is conducting the activity. As a result of this effort, clear goals of the effort is established, prioritized and incorporated into a project plan with scope statement, duration, dependencies and resource allocations. This is often considered to be part of the ‘project preparation’ phase in other more traditional System Development Life-Cycle methodologies (SDLC).

The next phase is called the Data Understanding. In this phase an initial set of data is collected to get a clearer understanding on what is available and see how the data sets can be used to address the problem(s). The phase includes a detailed documentation of the data through the creation of a data library which describes that data. In addition, the phase include an initial exploration of the data (often done through the use of descriptive statistics) as well as a verification of data quality to determine any issues early in the project so that it can be addressed before a significant effort is consumed by the project.

The third phase is the data preparation. This includes a data selection of the actual data to be used in the project. It may include wholepopulations, or simply a sample of the data. It may also include some level of cleansing of the data as well. The next part of this phase is dedicated to the data construction where samples are created, organized based on the tool(s) selected and integrated in a storage format that can be accessed by the tool. It may also include a reformatting of data into new data types, codes, indicators and flags as well as more structured formatting of unstructured data such as texts, comments and other non-numeric data.

The fourth phase consists of the modeling. The first step is to select the appropriate modeling technique. This should be based on both the sample size, data type as well as the problem that is being addressed. For many problems, there may be more than one technique that can be used, and the modeler can decide to use both to see which yields the best result. After a technique has been selected, it is important that the modeler does not simply engage in the number ‘punching’ but that he instead takes a serious look at the test design of the problem. This include a detailed approach to test for validity (are we measuring what we think we are) and reliability (is this only valid for this one data set, or can it be repeated). We would also test to see which assumptions may be violated i.e. random sampling, sampling methods, normality, homoscedasticity etc., as well as the impact of those validations to the test design and subsequent findings. The next step in this phase is to actually build the model. This is often done with a subset of the sample that is randomly selected. I.e. of an overall sample of 5,000 another random subset of 1,000 can be used to build a model and the remaining 4,000 can be used to assess the model result on known data points. This is a very common approach when attempting to optimize the model building through leveraging multiple methods and models. This fourth phase ends with an assessment of the model and its ability to predict, illustrate or explore the findings of the system.

The fifth phase of the methodology is the evaluation. In this phase the first step is to evaluate the results and place them back into business context. It discussed what does the standard deviation, mean and other statistical measures represents in business terms. Based on this context, the process is re-evaluated to see if any improvements can be made, or if other techniques should be selected. It is important to note that in the previous phases of data preparation and modeling, the methodology recommends an iterative nature of the model building through revisiting the data preparation in a cyclical manner until the model has been refined. Later in the evaluation phase, we are not revisiting the model creation, but merelyplacing it in business context for an evaluation of reasonableness, significance (in statistical terms as well as in business terms), as well as impact to the organizations. As a result, the last step in this is the determination of the next steps of the process which may result in repeating the whole project cycle as a distinctly new effort. This is due to the fact that the findings of one data mining effort may result in the need to further explore those findings before considering any deployments based on the results.

If the findings are found to have significance and validity, the project may progress to the last phase, which is known as deployment. This is the phase were the project team asks the questions about how the findings should impact changes in the business model or the organizational behavior. This may result in a new way of interacting with customers through credit or new marketing initiatives, or simply be a validation of the known relationships to see how they might have changed over time. If the findings are actionable, a plan for deployment is created in this phase. This includes planning items for new technology, people, processes and resources to take advantage of the findings. It also includes a plan for monitoring and maintenance of the proposed solution as well. This is often known as a ‘sustain organization’ or ‘sustain support model’. The last steps in this phase are the creation of the actual final report of the project. This can be through a set of media such as word documents, collaboration rooms, web pages and any other tool the project may decide to employ. As any good methodology, CRISP-DM also advocates that the project ends with a “lessons learned” session where the participants sit down as reviews the project shortly before the project termination. The purpose of this step is to make sure that the organization learning occurs and that the mistakes and approaches learned by the project are employed in future efforts by new team members and leveraged by the current members as well. Unfortunately, this is a step that is often ignored. As a result, many organizations are ‘doomed’ to continue to make the same mistakes on project after project.

The CRISP-DM model organizes the sub-levels of the phase into ahierarchical model that consists of generic tasks that are mapped to specialized tasks that has various process instances. These two lower levels are known as the CRISP processes and the two higher levels are known as the CRISP Process Model. While the model is short and at a generic high level (only consists of less than 100 pages), it provides a solid framework for organizations to approach data mining. It is worth noting that some companies such as SPSS and SAS has used this methodology a reference and been instrumental to its development