The Florida State University
College of Arts and Sciences
INTRUSION DETECTION BASED ON STATIC ANALYSIS OF PROGRAM BEHAVIOR
By
Wei Yu
February, 2002
A master project submitted to the
Department of Computer Science
In partial fulfillment of the requirements for the
Degree of Master of Science
Major Professor: Dr. Robert A. van Engelen
Dr. Robert A. van Engelen
Major Professor
Dr. David Whalley
Committee Member
Dr. Ernest McDuffie
Committee Member
Acknowledgement
I would like to express my sincere appreciation to my major professor, Dr. Robert A. van Engelen. Thanks for his guidance, encouragement, financial assistance, valuable discussion and help. Thanks Dr. David Whalley to be my committee member, and special thanks for his encouragement, valuable discussion and warm help. Thanks Dr. Dr. Ernest McDuffie to be my committee member.
TABLE OF CONTENTS
List of Figures ……………………………..……….………………….3
Abstract ……………………………………..…………………………4
1. INTRODUCTION ………………………..…….………………….5
2. RELATED WORK ………………………..…….………………….7
3. DESIGN AND IMPLEMENTATION …………….……………...10
3.1 GENERAL DESCRIPTION …………………………………….11
3.2 CONSTRUCTION OF A DFA FROM A CFG ……...………….12
3.2.1 FROM ASSEMBLY CODE TO A CFG……………….....….…12
3.2.2 GENERATION OF A DFA FROM A
REGULAR EXPRESSION……………………………..……13
3.3 GENERATION OF A CONTEXT-FREE
GRAMMAR FROM A CFG…………………………...………...16
3.4 OPTIMIZATION ………………………………………….……..17
4.TEST RESULT ………………………………………………..……..19
5. PROBLEMS AND DISCUSSIONS ………………………….…….25
5.1. INDIRECT FUNCTION CALL ……..……..……………….…..25
5.2 LEX TOOL ………………………………………………………25
5.3 LINKING LIBRARY ……………………………………………..26
5.4 AMBIGUITY AND CONFLICTS OF YACC…………………....26
5.5 OTHER IMPLEMENTATION ISSUES
AND COMPLICATIONS ………………………….….….27
6. CONCLUSIONS AND FUTURE WORK ……………….….….…..28
7. REFERENCE ……………………………….…………….…..….…..30
LIST OF FIGURES
Fig 1(a,b) Deleting Empty Node …………………………..…………….31
Fig 2(a,b) Optimization of Diamond Path ………………………………32
Fig3(a,b) Optimization of Loop …………………………………………33
Fig4a Control flow of program “test.c” …………………………………34
Fig4b Control flow of program “test.c” after deleting empty nodes ……35
ABSTRACT
This project investigates the use of static program analysis techniques for direct intrusion detection based on monitoring program behavior. A detailed description of the method and analysis is presented. We use two methods to construct a model of application behavior through analysis of the control flow graph (CFG) of the assembly file of a program: a deterministic finite automaton (DFA) constructed via LEX from the CFG and a pushdown automata (PDA) generated from a context-free grammar parser via YACC from the CFG. The result shows that through modeling complete application behavior either as a DFA or a PDA, we can directly detect intrusions without possible false alarms.
1. INTRODUCTION
Although a great progress has been made in intrusion detection since 1980, the number of operating system vulnerabilities discovered each year is on the rise [5]. Many of these vulnerabilities are fixed within days or weeks of their discovery, but effective intrusion detection systems are still a powerful preventative security measure.
An intrusion detection systemattempts to detect intruders breaking into your system or monitors legitimate users misusing system resources. There are many commercial intrusion detection systems on the market today. But all of the commercial intrusion detection systems have been implemented independently of the operating system they attempt to protect. It has been argued that this approach is flawed [6], and that an intrusion detection system should be integrated into the operating system. Such an integrated intrusion detection system could provide real time detection and response to intrusions. Real-time intrusion detection systems that are part of an operating system are based on monitoring program behavior by analyzing operating system calls and detect attacks by verifying the legitimacy of the operating system calls executed by the program.
There are two main types of intrusion detection based on program behavior: anomaly and misuse detection. The anomaly detection model devises a set of statistical metrics that model the behavior of an entity, usually a user, a group of users or a host computer. The misuse detection model on the other hand, works by searching for a set of known attacks that have been stored in the system's database. The knowledge of the attacks is encoded as a set of attack signatures, which are essentially patterns that occur every time an attack takes place. The former approach has the problem of false alarms, and the later approach may miss intrusions.
An operating system provides a set of basic operations, system calls, which an application uses to ``communicate with the outside’’. All services that an operating system provides can be requested through system calls. From the point of view of a user, the entire functionality of the operating system is defined by the system calls. All attacks that caused damage to a system must have occurred via a set of system calls. The set of all possible system call sequences that a program executes characterizes the fundamental behavior of the program. Therefore, a possible way to find attacks on the program’s code is by checking the system call sequences executed by the program against the set of all system call sequences that a program can execute. System call parameter values may be included, but are generally not necessary to model program behavior. Many common attacks such buffer overrun attacks can be detected. Furthermore, the complete description of program behavior will eliminate false alarms that plague anomaly detection methods that rely on using training data to construct a model of normal behavior.
The problem is to find a suitable representation for the set of all system call sequences that a program can execute, which is possibly infinite. This representation forms the database of normal expected behavior of a program. Once the database is built, the checking of the system calls of the currently executing program against the database must be performed within a limited amount of time, preferably in unit time per system call.
This project will focus on deriving database representations that allow unit time checking per system call. This approach deals with attacks that cause a program to behave in a manner inconsistent with its author’s intent by checking the sequence of system calls. Of course, some security problems are directly attributed to faulty application logic, which allow for execution of arbitrary machine code of the attacker’s choice [7]. One limitation of our intrusion detection system is that it does not detect attacks that exploit explicit logic errors. Since some intrusion will use the program logic bugs and these kinds of errors usually cannot be found by static analysis, we will not consider this kind of attack in this study.
The rest of this paper is organized as follows: Section 2 discusses related work, Section 3 discusses our framework and implementation, Section 4 discusses the result, Section 5 lists the problems existed in this investigation, Section 6 gives the conclusion and future work.
2. RELATED WORK
In order for a process to request resources from the operating system, it must use the operating system call interface. By monitoring this interface we can learn the behavior of the process being monitored. There are many research efforts about intrusion detection systems based on examining sequences of operating system calls. Damon Snyder [1], Forrest et al. [2, 3, 4], use normal behavior of an application by sliding a window across a system call trace of an application and recording the positional relationships among the system calls in the window. A process behaving mysteriously may be monitored via a window into the system call interface. This approach slides a window of size k across the system call trace and records positional relationships among the system calls in the window. Then the recorded information will be used to monitor execution of the deployed program and raise an alarm if the execution diverges from the normal behavior. Like other current model-based approaches, they all share one common problem: a truly robust intrusion detection system must solve a special case of the machine-learning problem. Although the sliding window approach is simple to implement and efficient, it still has the false alarm problem.
The characterization techniques used in this project is similar to David Wagner [9]. In his work, the basic method is to set up the non-deterministic finite automaton (NDFA) model through the static analysis of the control-flow graph (CFG) of a program. He pointed out that there were no false alarms when using the CFG-based static analysis to build a model of normal behavior. Since by construction, every possible path of execution through the CFG corresponds to an accepting path of the NDFA, and thus every dynamically possible execution trace will be accepted by the NDFA. The NDFA can be expressed as non-deterministic pushdown automaton (NDPDA), or equivalently, a context-free grammar. He implemented a parsing routine based on top-down parsing technique. His application was written in Java and the worst-case asymptotic running time is cubic in the length of the system call sequence being analyzed. In our work, the worst-case asymptotic running is linear in the length of the system call sequence, which allows for real-time intrusion detection.
David Wagner also mentioned some problems with the static analysis approach. These problems are discussed below. In our project we attempted to adequately address these problems in the implementation of the intrusion detection system.
INDIRECT JUMPS
The presence of indirect function calls makes the implementation of the intrusion detection method based on static analysis difficult. To build the program call-graph in the presence of function pointers (indirect calls), it is crucial to be able to predict the possible targets of every indirect call. It is impossible to predict the targets, because the pointer values are determined at run time and may point to any of the functions defined in any of the object files linked to the application. The author simply assumed that every pointer could refer to any function whose address has been taken. Even this crude technique still takes significant effort. Since the GNU stdio library implementation uses function pointers extensively to emulate an object-oriented programming style, the inferred models are too imprecise. So the author had to replace the automated analysis results with hand-crafted models. There are many disadvantages to hand-built models: they are time-consuming to construct; they are difficult to get right; and they make it unpleasant to keep up with changes made to the code. Therefore, a more precise automatic analysis of indirect function calls is necessary for a realistic application of the static analysis method.
LIBRARY LINKING
David Wagner’s method requires a model for each library function that might be directly or indirectly called from a program. An automated analysis algorithm needs to recursively analyze all of the related library functions through analyzing the library function dependencies. Because of the extensive use of function pointers by the GNU libraries, which makes predicting the targets of the calls impossible, the author had to use hand-built models. The disadvantage of the hand-built models has been discussed above.
DYNAMIC LINKING
Dynamically linked libraries challenge the static analysis method. Since the binding of the function is determined at run time, the current static analysis cannot determine the possible system call sequences. Therefore the prediction information can be out of date during run time and this can result in false alarm. To avoid this problem, it is necessary to update the model information at run time from object code. The author did not explore this direction.
THREADS
Threads pose yet another challenge, because the context-switching operation introduces another type of implicit control flow, especially for user threads, which is more complicated. There is no good general solution. Threaded code may contain security vulnerabilities due to synchronization bugs and there is no good way to detect those bugs.
Our work will focus on directly exploiting the system call sequence of the application program through C code. We will investigate the characterization of a program by building a DFA and compare this to a context-free grammar representation. We will investigate to what extent YACC can be used to create a parser for the context-free grammar.
3.DESIGN AND IMPLEMENTATION
This section describes the design and implementation of a static analysis algorithm that first constructs CFGs from the assembly of a source program and then constructs either a regular expression representation or a context-free grammar for the set of all possible sequences of system calls of the program.
3.1 GENERAL DESCRIPTION
We consider two implementations in our study to characterize the set of all system call sequences a program can execute: a DFA representation by using LEX to transform a regular expression into a DFA and a context-free grammar by using YACC to construct a parser for this grammar. We do not investigate the use of an NFA as in David Wagner’s work, since both DFA and context-free grammar parser are faster to use in a real-time intrusion detection system (analysis requires unit time per system call). Our approach requires a static analysis of the assembly file for each library function that might be called. In order to get the assembly files of all of the library functions, we changed the Makefile used to build libc.a. The CFGs of all functions of a program statically models the program’s behavior. These can be derived through the analysis of the assembly file generated by the GNU C compiler. The CFGs will show us all possible sequences of the function calls in the application program. The analysis is recursively applied to find the CFGs of the library functions which might be called by the program. Based on the analysis of the CFGs we can finally derive the regular expression or context-free grammar of the system call sequence for the program. A DFA can be constructed via the tool LEX or a context-free grammar parser can be constructed via the tool YACC. With the DFA or context-free grammar parser, the behavior of the program can be monitored by following the transitions on the DFA for each system call executed or can be monitored by using the parser to parse each system call as a terminal symbol of the grammar. For LEX, we need to combine the regular expressions of all functions called by a program into one file for input to LEX. Similarly, we need to combine the context-free grammars of all functions called by a program into one file for input to YACC. The retrieval and construction of CFGs of all functions called by a program is performed automatically by a search algorithm, which we developed for this purpose.
As mentioned in David Wagner’s work, the presence of indirect function calls in a library of program is a big challenge for the static analysis method to succeed. We need to solve this problem in both a regular expression and a context-free grammar. We use the simple assumption that an indirect function call can target any functions whose target address has been used in the program or statically linked libraries.
3.2 CONSTRUCTION OF A DFA FROM A CFG
3.2.1 FROM ASSEMBLY CODE TO A CFG
The text graph expresses the logical relationship of the parts of a program. The text graph is constructed by analyzing the assembly files generated by the compiler. During the preprocessing, the compiler generated the assembly file, which provides us with detailed information about all of the possible running directives of the program. Since we are only concerned about the sequence of the function/system calls in this study, we can neglect the directives and only extract the logical relationship among the segment of the code that involves system calls. The output result is a CFG for each function. The nodes of the CFG consist of the basic blocks. Each basic block has a unique number. Each node has edges to previous blocks and successive blocks in the graph, and each node has references to function and system calls performed in the basic block. The previous blocks are the blocks that appear immediately in front of the current block in the program running procedure. The successive blocks are those blocks that appear immediately after the current block in the program running procedure. Therefore, the CFG represents the relationships among the possible sequences of the function and system calls. By analyzing the CFG of each function, we can extract all possible sequences of the function and system calls, and output a regular expression or context-free grammar that represent the set of all possible system call sequences that a function can execute.
On Linux machine, we use “gcc –S” to generate the assembly files, and then get the CFG by analyzing the assembly files. The following sections give a description of transforming CFGs into regular expressions.
3.2.2 GENERATION OF A DFA FROM A REGULAR EXPRESSION
DATA STRUCTURE FOR CFG
Based on the characteristics of the CFG, we define the basic blocks as nodes of the CFG. Each node has several previous pointers and successive pointers as discussed in the previous section. The previous pointers point to immediate predecessor blocks, and successive pointers point to the immediate successor blocks. The function/system calls are stored as plain string representations in each of the nodes. The CFG can be expressed as a set of strongly connected components that separate the loops in a function. We define node 1 as root, which has no previous nodes (a 0 in degree node). The leaf node is defined as node 0, which has no successive nodes (a 0 out degree node).