Very Large Text File Viewer

2009
Electrical Engineering Faculty, Technion 2009
Software System Laboratory
VERY LARGE TEXT FILE VIEWER
Requirements & Design Document

By:

Amitay Svetlit

David Nasi

Advisor:

Oved Itzhak

Table of Contents

1.Motivation – Very Large Text File Viewer

2.Project Objectives & Goals

3.User Interface Manual

4.General Approach

5.Architecture

6.Basic Layers Design Overview

6.1Data Structure Layer

6.2Business Logic Layer

6.2.1Searching Capabilities Design Concept

6.3Graphic Interface Layer

7.Design Components

8.Class Overview

8.1IndexAndStatistics

8.2Abstract Controller & LTFController

8.3Abstract Searcher & LTFSearcher

1.Motivation – Very Large Text File Viewer

Conventional text file viewer applications (e.g. Notepad, Wordpad) employ simple file access patterns (i.e. reading the entire file upon opening). These patterns make the user response time directly proportional to the file size. As a result conventional text file viewer applications are adequate for moderate files but poorly suitable for very large files (user response time wise). Trying to open a large text file (~500 Mb) in the standard WINDOWS Notepad application results in system thrashing and/or application crash. However very large text files (VLTFs) are in common use in logging processes and often need to be reviewed by professionals that are interested in analyzing them.

Conventional text file applications do not provide tools for custom search and index, Our application: “Large Text File Viewer” is an application optimized for viewing very large text files (VLTFs). In our project we aim to provide the user with a file size independent responsiveness and an overall convenient VLTF viewing and indexing experience.

2.Project Objectives & Goals

A user friendly graphic interface specialized for viewing VLTFs.

Application responsiveness is independent of input file size.

Extensible indexing architecture for employing indexing by arbitrary criteria.

3.User Interface Manual

Our graphic interface is optimized for viewing large text files. In this section we provide the manual for the functionality to be implemented in the VLTF viewer application according to the project requirements:

Basic Layout:

Legend -

3.1 Open File – Opens a windows explorer file browser through which it’s possible to choose the input text file.

3.2 Go to Line – Takes a line number as an input and shows the file text such that the entered line is in the middle of the “Text view area” (3.10).

3.3 Search - Performs a search in the file according to the search architecture rules, the results are presented in the “Search Results Pane” (3.7) sorted according to line indexes.

3.4 Conventional Scroll bar - Scroll bar that is used to scroll over the file by constant gaps.

3.5 Scroll Knob – Mechanism for scrolling with modifiable speed.

3.6 Line Numbers.

3.7 Search Results Pane - Presents the lines that are compatible with the search performed.

3.8 File lines counter – Total number of lines in the text.

3.9 Progress Bar - Shows processes percentage of progress (like the progress of search).

3.10 Text view area.

4.General Approach

Our main goal is to create a highly responsive viewer when displaying VLTFs (in the order of GBs). This excludes the conventional approach of reading the entire file into memory because reading a whole VLTF may render the application unresponsive to user actions for a very long time.

Our solution is based on segmentation of the file: the main idea is to divide the VLTF to chunks and load to memory only the chunks that are currently needed. Upon opening the VLTF ispartitioned to fixed-size segments and indexed. The segmentation process maps line numbers to segments in which they are held. At any given time the application holds in memory only a couple of segments which are needed to show the lines currently viewed by the user in the “Text View Area”. Consequentlyit’s possible to open the file for user view when only a small amount of segments were indexed. The indexing information may be saved persisted for later use (if the same file is opened again the indexing process may be skipped).

Since segmentation requires reading the entire file, a question arises about the user experience before the file is fully indexed. To still provide responsiveness in this case weextrapolate using some metrics (average number of lines in a segment) and provide the user with the estimatedline with an indication that this is only estimation.

5.Architecture

In order to address our objectives we decided to employ a three-tier software design pattern.'Three-tier' is a pattern in which the user interface, functional process logic ("business logic"), computer data storage and data access are developed and maintained as independent modules. The reasons for preferring this architecture are:

Development Stage - It’s intuitive to look at our objectives as a three parts problem: a graphic user interface, extendible functionality and a data store. These parts are hardly interconnected and we may implement a separate solution to each.

Maintenance Stage -Our application’s functionality logic is extendible and its implementation may change over time, however this extensibility does not imply changes in the graphic user interface module nor in the data storage module.

Thus our application will consist of these modules:

Data Structure layer - handling and maintaining the data structure of the VLTF.

Business Logic layer – provides functional algorithms which handles the functionality.

Graphic User Interface layer–provides graphic input/output functionality to the user according to the “User Manual”.

6.Basic Layers Design Overview

6.1 Data Structure Layer

Data Structure Layer is responsible for building the index. Segmentation of a VLTF is a process that goes over the VLTF and divides it to fixed sized segments, each segment holds a certain fixed amount of lines (e.g. segment #1 holds lines #1-#200, segment #2 holds lines #201- #380). Therefore the segmentationprocess builds the index that holds the information about which lines every segment contains. This data allows us to provide (almost) random access by line number. After the few first segments have been processed the application may fetch those segments to the memory and become responsive to the user (present him the beginning of the text file), however the segmentation process itself requires going over the whole VLTF thus the segmentation process dependant on the length of the file. After segmentation is finished, the access to all file lines is possible by their exact number (in contrast to the estimation mechanism that was described earlier).

We also assume that a user may open the same VLTF more than once, in which case it makes sense to cache the index by persisting it.

6.2 Business Logic Layer

Business Logic Layer provides the GUI layer with functionality. It operates as a mediator between the GUI and the Data Structure, upon requests that come from the GUI layer the Business Logic checks if it can satisfy it itself.If it cannot, a request to the Data Structure is made.

In our design this layer holds a couple of cache buffers that contain the recently used segments and provides their text to the GUI. Business Logic receives read request from the GUI. If the GUI needs to display lines that are held within those buffers then the Business Logic Layer provides them without accessing the Data Structure Layer. However if the user performs some other activity that requires lines that are not in thelookup buffers then the Business Logic requests those lines from the Data Structure layer. In order to retrieve data from the Data Structure layer the Business Logic sends asynchronous requests for data.The next request for data from the GUI will be submitted by the Business Logic only when the previous submitted request returns.Whendata request returns the next request to be served is chosen to be the last to arrive from the GUI (all the requests that came from the GUI in the period while the current request was served are discarded) thus skipping all unnecessary data requests that came before. By using this architecture we achieve responsiveness by two means:

The user interface will not block while data is retrieved from the hard drive.
The system skips unnecessary data read and always performs the most recent data request.

6.2.1Searching Capabilities Design Concept

The Business Logic layer provides extensible search architecture. In our design we provide the developers with an abstract class of a searcher. Through implementing this class it’s possible to interconnect with the GUI performing the search requests.

The abstractsearcher class provides the following interfaces:

Construction with a specified file to be searched.
Request for data to be searched with an extendible search criteria parameter.
Prepare for closing notification.
“Add Search result” event fire for the GUI layer notification.
“Clear Search results” event fire for the GUI layer notification.

By providing all notification needed for the basic searching: Open searcher, Start searching, close searcher notifications, we give the developer the freedom of managing the search and indexing implementation.

6.3 Graphic Interface Layer

This layer’s functionality is already expressed above (3) however we would like to emphasize some design issues:

Scroll Knob (3.5) – Scrolling over a VLTF using a conventional scroll bar (where scrolling speed is constant) may be very tedious. In the Scroll Knob feature scrolling speed is adapted according to the distance of the knob from the center of the bar (the speed is increased in a polynomial order of the distance from the center of the knob). Thus giving the user control over the scrolling speed.[OI1][D2]

Status Bar – The GUI contains a status bar that will provide the user information about:

(a) Progress of indexing and data structure construction (last line processed).
(b) Search Progress.

7.Design Components

According to the three tier design we devised the following Components Diagram:

As seen in the above datagram, user extensability is provided in the busness logic layer by adding the pairs of controller-searcher classes derived from the abstract classes.

8.Class Overview

This section contains the system class relation diagram.A detailed design of each class is specified in the next subsections.

8.1 IndexAndStatistics

Class Diagram:

IndexAndStatistics is the Data Structure Layer in the VLTF viewer. It’s responsible for:

Constructing the index file and saving it.

Replying to business logic requests.

We decided to implement this class as “Factory” design pattern not allowing more than one IndexAndStatistics instance per one very large text file. This is done because we want to create logic binding between the VLTF and its index. Our application saves the index data for each VLTF in a separate file by serializing the IndexAndStatistics instance.

SegmentData Types:

The information about each segment (i.e. meta-data) is held in BlockMetaData data type, it contains the following fields:

public int firstByte -Offset of the segment in the VLTF.
public int numberOfBytes - Size of the segment.
public int firstLine – Number of the first line in the segment.
public int numberOfLines – Number of lines in the segment.

BlockDBKeyType inner class:

This inner class implements the IComparable interface. It allows us to compare two distinct segments by their first line number and last line number. The comparison consists of determining which segment precedes the other in the VLTF.

Segments meta-data is saved in a sorted dictionary:

private SortedDictionary<BlockDBKeyType, BlockMetaData> blockDB.

This is the data structure for all segments that are indexed. The DB is sorted by the BlockDBKeyTypekey of each segment.

Building the DB

BuildBlockDBis the method that constructs the index file DB. It goes over the VLTF and maps it to segments by the following method:

Offset of the segment in the VLTF (i.e. first byte of each segment) is the end address of the previous segment + 1.

Size of each segment is bounded by maximumBlockSizesubject to: segments may end only at end of line (ensures that each segment holds only complete lines).

Instantiation of the Index:

Instantiation of the index class is done by “GetIndexAndStatisticsInstance” method.It determines whether an index file already exists (file with the same name as the VLTF with .VLTFI extension).

If the index file does not exist wecreate an IndexAndStatistics instance and invoke the index construction on a differentthread, returning the new instance immediately.
If the index file does exist we create an instance by de-serialization of the file, we then check whether the persistent data is up-to-date with the VLTF. Persistencecompleteness is done by comparing VLTFs last modify date with index file last modify date (assuming that VLTF may be modified only by adding new content to it as this is the case for Log files).
If the index file is not up to date we continue indexing from the last mapped segment by invoking the index construction on a different thread, returning the de-serialized index instance immediately.
If the index file is up to date we return the de-serialized instance of IndexAndStatistics.

Serving Read Requests:

The GetBlock[OI3][D4]MetaData method returns meta-data of a segment that holds a certain text line (by taking line number as an input). By using this method the Business Logic layer may translate a need for a certain line to a segment request in the Data Structure Layer.The requests to the Data Structure Layer are of the type: ReadRequestInfoType.Each request contains the meta-data of the requested segment and a buffer to which the segment will be written.

Request to IndexAndStatistics is done by invoking GetBlockContentAsynchronously method with a specific ReadRequestInfoTypeinstance. The requests are handled asynchronously by running DataReadCallBack method on different threads, in this way we neverblock the data structure layer and it remains responsive to other requests. DataReadCallBack method reads from the VLTF the proper segment and copies it to the buffer of the request.

Estimation Mechanism:

BuildBlockDBmethod saves statistic data about segments in a file:

averageBlockSize

averageLinesInBlock

LastAvailableLineNumber – last indexed line

If the requested line was not yet mapped to a segment then this statistic allows the Business Logic Layer to estimate the position of a line in the file.

This way the Business Logic provides an estimated segment.

8.2 AbstractController & LTFController

Class Diagram:

In order to separate between functionality and implementation we encapsulated all the functionality of the Business Logic Layer in an abstract class. AbstractController provides the functionality of the Business Logic Layer in the VLTF viewer. Our implementation to it is the LTFController class.

LTFController contains two cache buffers that hold two consecutive segments:

Current Segment
Next Segment

We chose to keep two buffers because sometimes the GUI will show text from two consecutive segments at once (text from end of the current segment and the beginning of next one).

Serving Read Requests:

The most important functionality of the AbstractConroller is to serve requests for chunks of text, in our case of the LTFController these requests are made by the GUI layer.

The method that satisfies those requests is: GoToLine.GoToLine takes as an input a line number and fills the GUI buffers with the proper text such that it will be able to present to user the text that contains the given line. This method implements our caching mechanism:

If the line is already in the Current buffer we load the GUI buffers with the proper text lines such that the given line is the first in the buffer.
If the line is in the Next buffer we set the Next segment as Current and bring the next segment to the Next buffer by asynchronous call to the Data Structure Layer. Only when the call returns we load the GUI buffers with the proper text lines such that the given line is the first in the buffer.
If the line is not present in the buffers we bring the needed segment and its consecutive segment by asynchronous call to the Data Structure Layer. Only when the call returns we load the GUI buffers with the proper text lines such that the given line is the first in the buffer.

Non Linear Scroll (Scroll Knob):

This functionality allows scrolling over the VLTF with adaptive speed by polynomial increase/decrease of the delta between the lines invoked byGoToLine.Speed adjustments are dependent on the distance of the knob from the center.