In This Research We Are Focusing on the Security Side of the Data Mining

Abstract- According to MTI Technology review magazine, data mining is going to be one of the most 10 sectors that is going to change the world in the future. Many giant companies entered this sector recently like Oracle and IBM by supplying software or models used to serve data mining. Also there are many companies interested with the security of data mining like Cisco Company. But, what makes all these companies interesting in data mining ?.What is behind the big profit gained from data mining companies?.Many standards and rules was added recently to help improving the information security .These standards are figured and controlled by strong organizations and sometimes governments like International Organization for Standardization(ISO) .Lets take the ISO27001 for managing the information security as an example .

In this paper, we are trying to link two important and new aspects for data which are the security of these data and the extracting of it or what is known as data mining. The technique of data mining comes with the huge size of databases used now. This will increase the risk of losing or damaging these data warehouses .Then it comes the need of more security management to guarantee your data reliability, privacy, integrity, etc... Information security is needed in all organizations, businesses and for individuals also. We will try to clarify as much as possible the relation between data mining and information security.

In this research we are focusing on the security side of the data mining.

Introduction

We are going to talk about a new powerful technology that helps firms and companies focus on the important information in their warehouses. This technology is data mining, which is extracting information from large data sets. The future of data mining is bright and portentous ،and growing very fast to reach web and text mining .Many researches are done recently to serve the future knowledge of the data mining. Data mining allows businesses to make positive knowledge decisions by its tools which predict future trends and behaviors. Data mining tools help finding predictive information that experts may miss because it lies outside theirexpectation.

Data mining techniques can be incorporated with new products and systems as they are brought on line, and

they can be implemented fast on obtainable software and hardware platforms to increase the value of existing information resources.

Information security was known as an old definition used in the Second World War, but it becomes a large sector because of the revolution of technologies. The security of information avoids risks not only for individuals also for organizations, business companies and the most important governments. When we are talking about Information security, we are talking about the most important matter of data mining. It's a very hard, complicated and long-time aspect. Information security cannot be done; there is always a risk but the goal is to reduce it as much as possible.

We will explain data mining and we will mention the most common techniques. And we will talk about data warehouse Also, we will talk about the data security and then we will move to the relation between data mining and information security.

Data Mining

What is data mining?

Data mining is known as the science of extracting useful information from large data sets or databases. Data mining is a new discipline, it lies at the intersection of machine learning, statistics, databases and data management, artificial intelligence, pattern recognition, and more other areas. [1]

Data warehouse

"A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process."[2]

"Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject." [2]

A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Fig 1: A data warehouse example

Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered."[2]

Data Mining Techniques

We will describe some of the most common data mining algorithms in use today. We have divided the techniques into two sections:

Classical Techniques:
Statistics.
Neighborhoods
Clustering
Next Generation Techniques:
Decision Trees
Neural Networks
Rules [3].

First: Classical techniques.

The classical technique has descriptions of techniques that have been used for decades. It should help the user to understand the rough differences in the techniques and at least enough information to be dangerous and well armed enough to not be baffled by the vendors ofdifferent data mining tools.

Statistics

By strict definition "statistics" or statistical techniques are not data mining. They were being used long before the term data mining was coined to apply to business applications. However, statistical techniques are driven by the data and are used to discover patterns and build predictive models. And from the users perspective you will be faced with a conscious choice when solving a "data mining" problem as to whether you wish to attack it with statistical methods or other data mining techniques. For this reason it is important to have some idea of how statistical techniques work and how they can be applied. [3]

Regression is an old and most well-known statistical technique used in data mining in functions format. Some of them are simple like the linear regression to find appropriate values according to predicted values. There are other advanced regression techniques such as multiple regression for more complex relations. Successful data mining still requires skilled technical and analytical specialists who can structure the analysis and interpret the output. [4]

Neighborhoods

Clustering and the Nearest Neighbor prediction technique are among the oldest techniques used in data mining. Most people have an intuition that they understand what clustering is - namely that like records are grouped or clustered together. Nearest neighbor is a prediction technique that is quite similar to clustering - its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it “nearest” to the unclassified record. [3]

Clustering

"Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions.

Popular clustering techniques include k-means clustering and expectation maximization (EM) clustering."[5]

Another definition: A grouping of a number of similar things; a bunch of trees; a clusterof admirers.

Second: Next Generation Techniques.

The next Generation techniques represent the most often used techniques that have been developed over the last two decades of research. These techniques can be used for either for building predictive models or discovering new information within large databases

Decision Trees

"Decision tree structure and nodes vary depending on the object of data mining and on the structure of information you possess." [5]As shown in fig 2

Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

Fig 2: An example for a Decision Tree.

Fig 3: A simplified view of a neural network for prediction of loan default.

Neural Networks

"To be more precise with the term “neural network” one might better speak of an “artificialneural network”. True neural networks are biological systems (a k abrains) that detect patterns, make predictions and learn. The artificial ones are computer programs implementing sophisticated pattern detection and machine learning algorithms on a computer to build predictive models from large historical databases. Artificial neural networks derive their name from their historical development which started off with the premise that machines could be made to “think” if scientists found ways to mimic the structure and functioning of the human brain on the computer. Thus historically neural networks grew out of the community of Artificial Intelligence rather than from the discipline of statistics. Despite the fact that scientists are still far from understanding the human brain let alone mimicking it, neural networks that run on computers can do some of the things that people can do." [3] As fig3 shows an example of simplified view of a neural network.

Rules

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transactional databases, relational databases, and other information repositories.[6]

Fig 4: Data Mining Process

Data mining process

The data processing comes before the algorithms because it must be processed to bring it to a form suitable for pattern identification. The processing consists of six phases. As shown in figure 4:

Define the problem by defining variables, objectives, and requirements then translate them to definition.
Prepare the data by constructing the final data set, it should be clean (error free) and formatted. The major tasks involved in this phase are selecting tables, records, and attributes as well as transformation of the data for the next phase.
Explore data, collect and describe the data. Statistics are used in this process.
Building models by selecting a model and apply functions such as association, classification, and clustering. Different functions can be used for the same data type; some functions can only be used for specific data type.
Evaluate the model if it does not satisfy the expectations the model is rebuild until it achieves the objectives.
Deploy the result and present it as simple report or as complex database. [7]

What can data mining do?

A retailer can use point-of-sale records of customer purchases to send targeted promotions based on an individual's buy history and this can be done by data mining. By mining demographic data from comment or warranty cards, the retailer could develop goods and promotions to demand to specific customer segments.

These days' companies with a powerful retail, communication, financial, and marketing organizations use data mining. Data mining enables the companies to find out the impact on sales, customer agreement, and share profit. It also makes it easier for the companies to determine relationships among external factors. For example product, price, staff skills, customer demographics, economic indicators, and positioning. Finally, data mining makes it easy to summary information to view detail transactional data.[8]

These are some examples to show you companies that use data mining, firstly, American Express it can suggest product to its cardholders based on analysis of their monthly expenditure. Secondly, blockbuster Entertainment which mines its video rental history database to recommend rentals to individual customers. Thirdly, Wall Mart has over 2,900 stores in 6 different countries and it transmits these data to its 7.5 Tara byte data warehouse. It allows more than 3,500 suppliers, to access and perform data analyses. The suppliers use this information to manage local store inventory and identify new opportunity. [8]

Information Security

In the past, people used to carry their money, gold and silver with a big chance of losing them. Then, they realized that we need to make a safe place and avoiding caring expensive things. In addition to that, banks starts working by guarantee the secure of the customer's savings. Actually, we are not going far of our topic, but we are trying to show the important of it .Now, information in warehouse can be much more important than savings in banks. Transferring information need to be secure as transferring savings. Companies paid lots of money to make their data secure, Confidential and feasible as much as possible.

Fig 5: Governments Security Classification Cost 2009

Fig 5 shows how the US governments spend for the information security more than other security matter. No one of us is not concerning about his or her information security .Indeed, we need it most of the time to minimize the breach crimes, but not ending it.

History

During the world war II ,armies and governments needed to avoid leaking of information .They focused on developing new technologies to help hiding very high secret information .Cryptography ,for example ,is one of the most popular and powerful technique was used till now. This is the study of hiding information.”The US department of Defense and the Department of State improve this technique since the 1970s with expertise in cryptography.” [9].

Encryption was used only by governments, but now it's used for organization and individuals also.

It's easy to encrypt your email so no one during the transferring can read it other than the receiver. Information security become an ongoing learning process in a big field including techniques, algorithms ,issues etc For instance ,cloud computing technology to manage sharing and saving information very easily and safety on servers .Information security is taken in a serious consideration to many sectors like business and healthcare for example .The world concern about the data security more, so governments and organizations add new principles and strict laws to guarantee the information security.ISO27K standards found by ISO(International Organization for Standardization) ,to protect the information on which we all depend. Although laws are there, computer crimes are increasing, but awareness people about how to avoid problems in information security may increase the security of their information.

Definition

There is no universal definition of information security, but we can say it's the process of protecting data by giving authorizations to see and use a certain data. To understand information security we need to understand the three aspects of information security which are: confidentially, integrity and availability.

First, the data must be confidential to make sure that every user is having his information in a system in a very high private level, and no one can reach it without his or her permission.

Providing passwords and IDs can serve the issue. But this is not done only by the system or in other word the DBMS(database management system) .

Let's take an example of a person who is saving sensitive information related to his company with no authorization (an one who owns the file can see it) in a USB driver, and a bad day came when the USB has been stolen .Another example is when someone owns a credit card and he associate his password to be all zeros or his birth date .In the two previous cases, the system has provide a privacy choice to the two persons, but they didn't use it property. Let's move to more complex situation. A company with very huge database of customer's information. Hiding all the data is not a good idea, because users want to access data as much as possible with no many constraints. It's difficult to the security system know which data is sensitive and which is not. Precision is an approach which goal is to maximize as much no sensitive data as possible and protect the rest data (the sensitive one).

We move to the integrity aspect where the data must be consistent and reliable with the intended data to minimize the loss of data or the inconsistencies of the data; information should not be changed or removed randomly.

”A successful attack can happen when integrity is violated first then the system availability or confidentiality"[10]. The DBMS work in this aspect by reducing and analyzing failures that could happen. Because these failures are commonly happened and the reconstruction is costly, integrity is very important for organizations.

Last but not least, is to serve the sharing of information which done within the availability aspect. A system with correct controlling, storing and communicating processes is serving the availability aspect.

Risk Management

The meaning of risk management in data reefers to the guidelines used to reduce security risks in data to an acceptable level. This is done by knowing the weaknesses in the security system that brings threats .In a security system, risk management are needed to serve the value of security very well. In other word, it gives a backup plan to what if a bad situation happened .This not only includes the security issue. It expands to include managing and fixing the operational and economic costs to establish a high level of protectively and protecting the IT systems and data that support a certain organization. . Other impacts cannot be measured in specific units but it can be described in terms of high, medium, and low impacts .For instance or loss of public confidence, loss of credibility. In this research, we are only concerning about the information security management instead of business risk management.

To manage the risk management in information security, we must first collect factors that could affect it, which are:

Hardware

Software

People who are using the system

Sensitive data

System interfaces

Critical

"A threat is a circumstance or event with a harm effect to an information system ".Threat-Sources are commonly appeared. They can be human threats which caused by human like hackersor environmental threats (physical) like the failure of a power. Also, some threat can cause a direct damage (primary threat), or a long term damage (secondary threat).