NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64

Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

B. Cover Page

NSF Topic 4(c)

MATHEMATICAL SCIENCES: Statistical Methods

“Fuzzy Data Mining”

Submitted to:

Solicitation 97-64 (SBIR Program)

National Science Foundation PPU

4201 Wilson Blvd Room P60

Arlington VA 22230

703/306-1391

SciFish / - 20 - /

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64

Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

C. Project Summary

SUMMARY

With the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty.

Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed.

COMMERCIAL POTENTIAL

The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning.

SciFish / - 20 - /

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64

Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

Table of Contents

B. Cover Page

C. Project Summary

Table of Contents

D. Identification and Significance of the Problem or Opportunity

E. Background and Technical Approach

E.1 Background

E.1.1 The Ten Steps of Data Mining

E.1.2 Key Environmental Factors Affecting Fishes

E.1.3 Fuzzy Data Mining the North Pacific Fisheries: A Test Case

E.2 Technical Approach

E.2.1 Task 1: Data Collection for the North Pacific

E.2.2 Task 2: Develop Fuzzy Representation Methodology

E.2.3 Task 3: Develop Fuzzy Correlation Methodology

E.2.4 Task 4: Specifying the Fuzzy Data Mining Software Product

E.2.5 Task 5: Perform Market Analysis (SciFish Funded)

E.2.6 Task 6: Technology Transfer

E.3 Related Research and Development

E.3.1 Related Work by SciFish

E.3.2 Related Work by Others

F. Phase I Technical Objectives

G. Phase I Research Plan

H. Commercial Potential

I. Principal Investigator and Senior Personnel

I.1 Patrick K. Simpson, Principal Investigator

J. Subcontracts and Consultants

K. Equipment, Instrumentation, Computers, and Facilities

L. Current and Pending Support of PI and Senior Personnel

M. Equivalent or Overlapping Proposals to Other Federal Agencies

N. Proposed Budget

N.1 General Information

N.2 Cost References


D. Identification and Significance of the Problem or Opportunity

With the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. Data mining in an automated search for new and valuable information in a set of data. The ultimate objective of data mining is knowledge discovery. Data mining methodology extracts hidden predictive information from large databases.

The Problem. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. As an example, data mining can be a valuable tool if applied to the fisheries. But, with fisheries data, there is a tremendous difference in value ranges, spatial extent, temporal extent and data validity. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty.

The Opportunity. Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed. As an example, it is quite possible that a high degree of young fish are strongly correlated with high water temperatures. Such analysis results would be immediately available using the proposed technique. Using existing techniques, the same result would not be revealed because high correlations would be biased to revealing older fish and higher temperatures, the larger end of both value ranges. An illustration of the entire fuzzy data mining approach is outlined below in Figure 1.

The Benefits. The proposed fuzzy data mining approach will allow the practitioner to partition the parameter space into a set of membership functions that are germane to the task. A large Walleye Pollock has a very different length and weight than a large Pacific Halibut. The proposed approach allows those different ranges to be compared equitably.

In addition to the application of fuzzy set technology to the data mining process, the proposed approach is also emphasizing the exploration of spatial data sets. Although it is the intent of geographic information systems (GIS) to provide analyses of spatial data, you’ll find that such analysis is almost entirely application specific, intending to answer questions such as: Where is the water shed? How much area is covered by trees? Where is the best spot to look for oil? The proposed fuzzy data mining approach will be a significant new tool in that arsenal, providing answers to a whole new set of questions, such as: What parameters have the greatest impact on young fish? What is the relationship between depth and fish size? What other species are most strongly correlated with Walleye Pollock?

Prior Experience. SciFish is an innovative technology company with a proven track record of taking concepts into working field prototypes, and prototypes into the marketplace. Current fisheries-related products include the development of a broadband sonar fish identification system, a broadband sonar temperature profiler, and a fisheries geographic information system entitled Fisherman’s Associate that integrates several data sources to help fishers optimize their operations. This last product is currently being sold commercially. The sonar fish identification system will begin manufacturing and sales in late 1998. The temperature profiler recently completed Phase I development. All of this technology is the result of SBIR funded projects. Although the proposed fuzzy data mining

Figure 1. Outline of Fuzzy Data Mining Approach

product is not specifically a fisheries-related application, SciFish will be using a fisheries data set to develop the approach. In addition, SciFish’s prior experience in developing a software product provides this project with valuable insights that can enhance the overall probability of becoming a commercial success.

During Phase I, SciFish will develop a fuzzy data mining software product that can be applied to a myriad of spatial problems. To accomplish this goal, SciFish will develop the fuzzy data mining methodology through the application in the fisheries. Several software product specifications will be created for different commercialization opportunities. A detailed market analysis will be conducted with SciFish funding. And, a final report will be produced that describes the details of each stage of this development process.

During Phase II, SciFish will produce at least one of the software products, as well as extend the fuzzy data mining methodology from local spatial analysis to global and spatiotemporal analysis.

The Commercial Potential. The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning.

E. Background and Technical Approach

The following three sections provide background (§E.1), describe the technical approach (§E.2), and review related research in the proposed area (§E.3).

E.1 Background

The following background sections lay the groundwork for the Phase I Work Plan that follows. There are four areas that are reviewed. First (§E.1.1) a set of ten steps for data mining is outlined. Next (§E.1.2), the key environmental factors that influence fishes is reviewed. Finally (§E.1.3), the motivation for using the Walleye Pollock as a test case during product development is provided.

E.1.1 The Ten Steps of Data Mining

In a recent PC AI article, a set of 10 steps for data mining were described. These are summarized here to provide an overview of the current data mining methodology. In the following sections, the steps that will be modified are steps 7 and 8, which deal with model construction and validating the findings.

  1. Identify the Objective. Clearly define the intent of the analysis.
  1. Select the Data. Select the data available for achieving the goal.
  1. Prepare the Data. Determine which attributes and parameters within the selected data should be used for the analysis format the selected parameters.
  1. Audit the Data. Evaluate the resulting data to determine if the data from the various sources has the same level of confidence, range of values, time extent, and spatial extent. Discard all parameters and attributes that are deemed insufficient.
  1. Select the Tools. Decide which tool is the best for meeting the objective. The emphasis of the proposed approach is to utilize a fuzzy systems approach for those data elements which widely varying ranges in value, time, and space.
  1. Format the Solution. Determine the format of the solution. With the fuzzy systems approach, this step includes the creation of fuzzy membership functions for each of the parameters and attributes. For the application presented herein as an example, the format of the data will consist of fuzzy membership values defined for cells of a predefined resolution
  1. Construct the Model. Apply the selected data mining tool, in this instance a fuzzy correlation approach, to the formatted data. The result of the proposed fuzzy data mining approach will be the identification of strong correlations between variables.
  1. Validate the Findings. Share the results of the data mining with the client. Determine if the results are valid. Make corrections as needed and repeat step 7, if needed.
  1. Deliver the Findings. Provide a report to the client summarizing the results.
  1. Integrate the Solution. Apply the findings as appropriate.

E.1.2 Key Environmental Factors Affecting Fishes

There are several environment factors that influence different aspects of a fishes life. Some environmental factors, such as current, affect the transportation of larvae while others are related to food. The following sections outline many of the key environmental factors that affect fishes. With this information, it is then possible to determine which MTPE-derived data products can be used to measure these environmental factors.

Sea and Swell. Waves in the sea, generated by local and distant wind fields (wind waves and swell, respectively) are the most significant phenomena at sea which affect safety, comfort, fishing operations, and fish behavior and availability. There are three different effects of waves on the sea below the surface which are of concern to the fisheries:

·  Vertical mixing by wave action and turbulence caused by breaking waves. This wave mixing can deepen the surface mixed layer depth and sharpen the thermocline gradient. Furthermore, it can affect fish directly by making them seasick and inducing them to move deeper, where the orbital movement of water, caused by waves, is absent.

·  Waves cause current (mass transport by waves) in addition to surface wind drag.

·  Breaking waves cause wave noise which can affect fish behavior.

Surface Currents. Fish sense currents with the rheotactic organ located on the lateral line. Generally, fish head into the current even when they let themselves be carried with it. The swimming speed of the fish depends on their size, and is affected by temperature, being slower in lower temperatures. Fish eggs, larvae, and small juveniles are carried with currents and dispersed by them. Japanese fisheries scientists as well as fishermen have long recognized that pelagic fish tend to aggregate at current boundaries, where good catches are made.[1] The reasons for this are considered to be threefold:

·  Food supply (micronekton) accumulates at current convergences;

·  The current boundary acts as an environmental boundary; and

·  migrating fish dynamically aggregate at current boundaries.

Salinity and Basic Nutrients. The chemistry of sea water which might affect fish is little influenced by weather and by climatic changes. Two aspects should, however, be considered: salinity and basic nutrient salts such as phosphates and nitrates. These elements can be limiting factors in basic organic production (phytoplankton production) in the sea. Changes in salinity are small and usually indicative of advective changes and mixing. Of other chemical properties, the changes of nutrient salts are indicative of productivity changes and eutrophication.