University of South Australia
School of Computer and Information Science
Profiling Users for Behavioral Intrusion Detection
Minor Thesis
Student: Grant Pannell
Supervisor: AsPr. Helen Ashman
November 2009
Bachelor of Computer and Information Science (Honours)
Disclaimer
I declare that this thesis does not incorporate, without acknowledgment, any material previously submitted for a degree or diploma in any university; and that to the best of my knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text.
Grant Pannell
Acknowledgements
I wish to express my sincere gratitude to my honours thesis supervisor Helen Ashman, who is an associate professor and director of the Security Lab at the University of South Australia, for her helpful suggestions, support, and encouragement throughout the research and writing of this thesis.
I would also like to thank family, friends, and members of the Security Lab for their support, encouragement, and constructive criticism during my honours year.
Abstract
Intrusion detection systems (IDS) have often been used to analyze network traffic to help network administrators quickly identify and respond to intrusions. These detection systems generally operate over the entire network, identifying “anomalies” deemed to be atypical of the network’s normal collective user activities; however, they could be adopted to be host-based, instead of network-based, so that the normal usage patterns of an individual user, on a single machine, could be profiled. This would allow the detection of anomalies when a user begins to use different applications or use the same applications in a different manner. This host-based detection system can be a combination of a knowledge-based system, in which a user specifies rules that cannot be broken, and a behavioral-based system, in which the system uses learning algorithms that gathers and analyses many different characteristics to create a user profile. This research attempts to determine whether such a system is feasible.
This thesis outlines the background of intrusion detection systems, user profiling, and machine learning along with relevant literature. The methodology of building a host-based, user profiling intrusion detection system is described with the multiple aims: firstly, to successfully profile a user by recent usage patterns and specified rules, secondly, to have a low failure of detection rate, and finally, to be performance–efficient so that it does not tax system resources. The operating system to develop the detection system prototype is Microsoft Windows and the system analyses several behavioral characteristics, such as typing speed, current applications open, and the performance values of the applications, e.g. CPU usage and memory usage. The results, from a user study of seven different machines with different purposes, show that the system is successfully able to create a user profile and learn a user's activity, given a learning period of approximately 10 days. The results also showed that the most effective characteristics were the user-specific habits such as the websites viewed and the keystroke patterns. It was shown that some characteristics tend to be more useful depending on their relation to the authorized user. The built prototype was successfully able to detect an unauthorized user of the machine within 90 seconds of using a machine with a trained profile.
Table of Contents
Disclaimer i
Acknowledgements ii
Abstract iii
Table of Contents iv
List of Figures vii
1. Introduction 1
1.1 Motivation 3
1.2 Research Question 4
1.3 Scope 5
1.4 Contributions 6
2. Literature Review 7
2.1 Behavioral Intrusion Models 7
2.2 Behavioral Intrusion Detection Systems (User Profiling) 9
2.3 Behavioral Intrusion Detection Systems (Application Profiling) 13
2.4 Behavioral Intrusion Detection Systems (Network Profiling) 13
2.5 Using Multiple Characteristics 14
2.6 Improving Performance 15
2.8 Summary 17
3. Methodology 18
3.1 Outcomes 18
3.1.1 Developed System 18
3.1.2 Testing 20
3.1.3 Ethics 20
3.1.4 User Study 21
3.2 Development Environment 21
4. Implementation 23
4.1 Prototype Architecture 23
4.2 Algorithms 24
4.2.1 Data Collection Engine 25
4.2.2 Characteristic Analysis Engines 26
4.2.2.1 CPU Usage 26
4.2.2.2 Memory Usage 29
4.2.2.3 Number of Processes 30
4.2.2.4 Number of Windows 31
4.2.2.5 Websites Viewed 32
4.2.2.6 Keystroke Analysis 32
4.2.2.7 Misuse Detection 33
4.2.3 Data Mining Engine 34
4.2.4 Alert/Action Engine 35
4.3 Problems and Flaws 36
5. Results 40
5.1 Gathering Data 40
5.2 Detection Rates 41
5.2.1 False-Positive Rate 41
5.2.2 False-Negative Rate 41
5.2.3 True-Positive Rate 42
5.2.4 True-Negative Rate 42
5.2.5 Measuring Detection Rates 43
5.3 Discussion 43
5.3.1 Individual Characteristics 43
5.3.1.1 CPU Usage 44
5.3.1.2 Memory Usage 47
5.3.1.3 Number of Windows 50
5.3.1.4 Number of Processes 51
5.3.1.5 Keystroke Usage 53
5.3.1.6 Websites Viewed 55
5.3.2 Combined Characteristics 57
5.3.3 Intrusions 59
5.3.4 System Resources 62
6. Future Work 64
7. Conclusion 69
References 72
Appendix A: Extended Abstract 75
Appendix B: Memory Usage Graphs With Outlier 77
Appendix C: Data Tables For Included Graphs 78
CPU Usage 78
Memory Usage 79
Number of Processes 80
Number of Windows 80
Websites Viewed 81
Keystroke Analysis 81
Data Mining Engine 82
Intrusions 82
List of Figures
Figure 1 Architecture of a two-engine detection system (Lunt et al. 1989) 8
Figure 2 Use of Genetic Algorithms to profile user behaviour (Balajinath & Raghavan 2001) 11
Figure 3 The architecture of the implemented prototype 23
Figure 4 An intrusion pop-up from the Alert/Action Engine on triggered digraph 'gr' 36
Figure 5 Number of False-Positives per Machine for the CPU Usage Characteristic 44
Figure 6 Table of False-Positive Rates, for each machine, for the CPU Usage Characteristic 45
Figure 7 Number of False-Positives per Machine shown over time for the CPU Usage Characteristic 46
Figure 8 Number of False-Positives per Machine for the Memory Usage Characteristic 48
Figure 9 Table of False-Positive Rates, for each machine, for the Memory Usage Characteristic 48
Figure 10 Number of False-Positives per Machine shown over time for the Memory Usage Characteristic 49
Figure 11 Number of False-Positives per Machine for the Number of Windows Characteristic 50
Figure 12 Number of False-Positives per Machine shown over time for the Number of Windows Characteristic 51
Figure 13 Number of False-Positives per Machine for the Number of Processes Characteristic 52
Figure 14 Number of False-Positives per Machine shown over time for the Number of Processes Characteristic 52
Figure 15 Table of False-Positive Rates, for each machine, for the Memory Usage Characteristic 53
Figure 16 Number of False-Positives per Machine for the Keystroke Usage Characteristic 54
Figure 17 Number of False-Positives per Machine shown over time for the Keystroke Usage Characteristic 55
Figure 18 Number of False-Positives per Machine for the Websites Viewed Characteristic 56
Figure 19 Number of False-Positives per Machine shown over time for the Websites Viewed Characteristic 56
Figure 20 Number of False-Positives per Machine for the Data Mining Engine 58
Figure 21 Number of False-Positives per Machine shown over time for the Data Mining Engine 58
Figure 22 Table of False-Positive Rates for each Machine 59
Figure 23 False-Positives Rate of each Characteristic over all Machines 59
Figure 24 Time to Detect Intrusions for each Intrusion Test 62
vii | Page
1. Introduction
Intrusion detection is an area of computer security that attempts to detect actions that compromise the confidentiality, integrity, or availability of a computing resource. An intrusion detection system (IDS) is often network-based, so that the system looks for patterns and characteristics in network traffic to determine whether or not the data is malicious; however, as encryption becomes widely adopted it may become impossible to closely inspect packets flowing over a network (Debar 1999). Instead, a host-based instruction detection system could be used to either track the behavior of applications, so that the execution flow of an application is profiled to determine if it has been exploited (Forrest et al. 1996), or, the behavior of the user, so that the way the user that uses the system is profiled to determine if the current user is unauthorized. This paper deals with the latter: tracking the behavior of a user to determine if an intrusion has occurred on system with a host-based intrusion detection system installed.
Systems can also be classified in the way that they detect intrusions:
· Misuse Detection
· Anomaly Detection
The simplest detection method is misuse detection. Misuse detection analyses incoming data by comparing it to a set of attack signatures in a database. It is often known as rule-based detection as it explicitly specifies what is allowed and not allowed. Due to this, misuse detection often gives a high detection rate and low failure detection rate (or false-alarm/false-positive rate); however, as rules are explicitly set, the system is always acting in response to known attacks and is not predictive of new types of attacks (Labib 2004). The other detection method is anomaly detection. This is where the system gathers audit data over time and analyses if new data drifts significantly away from a “normal” pattern, using learning algorithms. Anomaly detection is more likely to give higher false positives and lower detection rates as an intrusion is not explicitly defined and behavior may change over time. This allows anomaly detection to adjust to new attack types or profiles more easily (Labib 2004).
The system reported in this thesis combines both methods of detection to decrease false-positives, where an attack did not occur but the system detected one, and false-negatives, where an attack occurred but it was not detected. For example, when tracking the behavior of a user, the user could specify that they never use a specific application, such as Notepad, using the misuse detection engine; however, the system will still attempt to create a user profile by collecting audit data, using the anomaly detection engine, and detect an intrusion on behavioral use.
A behavioral intrusion detection system that tracks user behavior must look at several different characteristics to gather audit data, which is used to determine “normal” behavior. Characteristics that the system includes are:
· Applications Running
· Number of Windows Open
· Performance Details of Running Applications
· Keystroke Analysis
· Websites Viewed
The applications that are running on a machine would allow profiling of a user to determine the default or typical applications they use. For example, a user may exclusively use Notepad for text editing, and if WordPad is used then this may trigger an alarm, as it is not part of the profile. The number of windows that are currently open could also determine one user from another, depending on their style of use. Performance details of the running applications relates to such metrics as CPU usage and memory usage. These performance values could determine how the applications are being used. For example, if there is abnormally high CPU usage in a database application it could mean that an intruder is extracting data. Keystroke analysis includes such characteristics as speed, combination of keys and pauses between key presses (Bergadano et al. 2003). This analysis is quite useful as many users have a characteristic typing style. Websites viewed is also an obvious metric as many users often visit a set of sites on a regular basis. All of these characteristics can also be related to the time of day, duration of use or time between uses, so that if the machine is suddenly being used in “off-peak” hours, a different portion of the profile is applied. For example, having windows open and typing activity at the early hours of the morning on a workplace machine could mean an intrusion has occurred.
An algorithm must be chosen to accurately profile a user while in anomaly detection mode. There are both static and dynamic algorithms. An example of a static algorithm would be occurrence frequency distribution, where events are recorded and related to the number of times they have occurred. This information can then be used to find the frequency distribution and then determine if new values significantly skew the current distribution. A static algorithm is unable to find temporary variations in behavior over a defined period, unlike a dynamic algorithm (Yeung & Ding 2001). A dynamic algorithm can explicitly model temporal variation and is more able to represent temporary irregularities. An example of a dynamic algorithm is a genetic algorithm that attempts to gather the user actions, calculate their fitness using a given function, and combine them to create a new generation of data to evaluate. If the new generation of data solves the problem the algorithm can end, or, the generated data may be fed back into the fitness function to generate yet another set of data to solve the problem (Yeung & Ding 2001). Certain characteristics may value a static algorithm over a dynamic algorithm; however, for the sake of simplicity the prototype uses static methods.
This thesis aims to investigate whether a host-based intrusion detection system that tracks user behavior by combining several different characteristics, is practical to implement by looking at detection performance and system resource usage.
1.1 Motivation
The motivation behind a behavioral intrusion detection system comes from needing to identify unauthorized use of a machine. Common network-based intrusion detection systems are becoming unable to operate on network traffic due to the adoption of encryption (Debar 1999), and as a result, the need for a host-based intrusion detection system rises.
Research has been performed on designing and implementing a behavioral intrusion detection system, however, many of these systems profile a user around a single characteristic rather than many as shown in the literature review. The combination of these characteristics, as well as adding new characteristics to identify a user, may help in determining unauthorized use in a shorter period of time and more precisely.
The majority of systems that have been previously implemented have used the UNIX operating system, which is heavily command-based. These systems look at ordering of commands and frequency of command usage, but not keystroke patterns, rather than properties, reported in this thesis, to do with graphical user interface (GUI) elements, such as number of windows, and common user activities, such as the sites a user views. The implemented behavioral system uses the Microsoft Windows operating system, which is based on a graphical interface rather than a command line interface. This means the implemented intrusion detection system can be geared towards an “every-day”, non-specialist user who is not always typing in commands to perform tasks.