12
CS 303, Random Access File I/O and the Collections Framework
This document and the accompanying set of assignments cover two major topics, random access file I/O and the Java collections framework. The relationship between these two topics may not be immediately apparent. However, the topic of file I/O can be developed to the point where the contents of files conceptually are being treated like the contents of a data collection; and the classes of the Java collections framework can be adapted for use as an index for a random access file. Presenting the two topics together makes it possible to develop a programming assignment which includes multiple interacting parts and illustrates the relationship between dynamic data structures in a program and file data in secondary storage.
Section 1, Using Random Access to Support Records and Fields
This section deals with the translation between the concepts of structured files, containing fixed-length records and fields, and objects. In Java, the RandomAccessFile class can be used to implement a structured file and access its contents. It is assumed that the student has already been exposed to simple exception handling, character-oriented file I/O, and the Java API. For the RandomAccessFile class all details on construction, methods, and use, including the use of try, catch, and finally blocks to deal with I/O exceptions, can be found in the Java API.
The federal W4 form will be used as the explanatory example and the basis for the assignments. In this file a record consists of the fields shown below. The last four fields are data fields, while the first will be useful in implementing some of the techniques of file management. In general, a structured file may contain multiple records. Each of these records is of fixed length because it contains a set of fixed length fields.
deletemarkfield: This is an integer field. For the time being, assume that a value of 0 in this field indicates an active record.
sossecnumfield: This is a text field which should be filled with 9 digits and no dashes.
filingstatusfield: This is a text field that is 8 characters long. Possible values are: single, married, divorced, or widowed.
allowancesfield: This is an integer field.
additionalamtfield: This is a double field.
The initial goal is to manage the writing and reading of such records to and from files. It is assumed that there is a W4 class with suitable instance variables, and that in programs W4 records are implemented and manipulated as objects, not groups of simple String references or numeric variables. Serializability is a topic which will be covered in a later unit. It makes it possible to save whole objects, but this technique is not suitable for illustrating the concepts in this unit. Although encapsulated in W4 record objects in a program, each instance variable of a W4 object should be input and output to a file as an individual String or a simple numeric variable.
A Java random access file has a system pointer into it, and the user has the ability to reposition that pointer and write to or read from a particular location, or offset, in the file. This means that the programmer is not limited to writing sequentially, always appending at the end, or reading sequentially, from the beginning to the end of the file. There is a given set of fields in a record, and the fields are of fixed length. This means that records are of fixed length. Record based I/O is supported by calculating an offset into a file as a multiple of the size of a record.
Records can be randomly accessed according to their sequential position in a file. The general term used for the position of a record in a file is relative record number, or RRN. The offset into the file then depends on the fixed length of all records and the RRN of a given record of interest. In such a file, there is always the option of appending at the end. It is also possible to overwrite the data for the record at a given offset. Reading can be done starting at the beginning of the file, but it is also possible to read starting at the offset determined by a given RRN. From the user point of view, the granularity of access is the record. That is, complete records are read or written, not the individual field values of the records.
A file may have system fields, like the deletemarkfield, and a program may manipulate such a field individually in order to support a simple file management scheme. If a record is deleted, the deletemarkfield is given a value, say 1, that signifies deletion. If that deletion value is present, the location is available for overwriting. When a new record is added, the file is searched sequentially for the first deleted record, and the new one is put there. If there is no deleted record, the new record is appended to the end of the file.
Section 2, Recycling File Space Using Stack Logic
Section 1 described a simple file management scheme which involved a sequential search of the file’s contents. The space in the file can also be managed in a way that does not require a search for the first deleted record. The deletemarkfield can be used to organize the records in the file as nodes in a stack. You may have implemented an in-memory stack using linked structures in a programming course you have taken. The use of the stack for file management is different. There is no in-memory data structure. The stack is conceptual and consists of the relationship between deletemarkfield values in the file in secondary storage.
Here is an outline of a scheme for managing file space as a stack using the deletemarkfield:
1. When a file is created, put a header record at offset 0. As long as the file is in existence, this header record should never be deleted. The header record will contain no values except in the deletemarkfield. The initial value will be -1, signifying that there is no empty space in the body of the file resulting from records’ having been deleted. How the value in the deletemarkfield changes will be explained below.
2. Suppose you want to add a record when the header record contains a -1, meaning there are no empty slots in the file. It will be necessary to simply append the data record at the end of the file. The data records for the file begin at offset 1, following the header. The first will be at offset 1, the second at offset 2, and so on. The deletemarkfield takes the value of 0 for all new records entered in this way. 0 is the value which represents a current record which has not been deleted.
3. When you delete a record, put the value found in the deletemarkfield of the header record into the deletemarkfield of the deleted record, and put the RRN or offset of the deleted record into the deletemarkfield of the header record. This builds a stack of deleted record slots in the body of the file. The header record points to the most recently deleted record, while the deletemarkfield of that record points to the next most recently deleted record. The record containing a deletemarkfield value of -1 is the least recently deleted record.
4. Now suppose you want to add a record when the header record contains something besides a -1. The value in the deletemarkfield of the header record is the offset to the empty record space at the top of the stack. Put the new record there—making sure to save the deletemarkfield value at that offset before it’s overwritten and replace the value in the deletemarkfield of the header record with it. Keep in mind that the new record is valid, so its deletemarkfield should contain 0.
A sequence of insertions and deletions is given below, followed by the contents of the file if the scheme outlined here is used. Recall that the file’s records consist of these fields: deletemarkfield, socsecnumfield, filingstatusfield, allowancesfield, additionalamtfield. Note that this is still based on access by location, not access by value.
1. First insertion: (0, 111111111, single, 1, 1.0)
2. Second insertion: (0, 222222222, single, 2, 2.0)
3. Third insertion: (0, 333333333, single, 3, 3.0)
4. Fourth insertion: (0, 444444444, single, 4, 4.0)
5. Delete RRN 4.
6. Delete RRN 2.
7. Fifth insertion: (0, 555555555, single, 5, 5.0)
8. Sixth insertion: (0, 666666666, single, 6, 6.0)
9. Delete RRN 1.
10. Delete RRN 3.
These should be the contents of the file after the sequence given above:
Offset 0, header record, deletemarkfield = 3.
Offset 1, inactive, deletemarkfield -1, bottom of stack of empty spots;
Contains (old) data for 111111111.
Offset 2, active, deletemarkfield = 0;
Contains data for 555555555.
Offset 3, inactive, deletemarkfield = 1;
Contains (old) data for 333333333.
Offset 4, active, deletemarkfield = 0;
Contains data for 666666666.
Section 3, Value Based Access and Indexes
The previous sections discussed access to records in files according to their locations. A more advanced and more useful capability is accessing records based on the values they contain. Value based access has the characteristic of granularity. Is it possible to locate records on the basis of the value of a single data field, or does access depend on the contents of the whole record? Likewise, is it possible to write just the value of a single data field into a record, or is it necessary to write the whole record? This section will be concerned with this straightforward approach: Records will be located on the basis of the value of a single field, known as the key; and writing complete records will be supported, but not the writing of individual data fields.
Just like location based access, value based access can also be supported by searching the contents of the file sequentially until either the value is found or the end of the file is reached. In Section 2, the idea of a stack was introduced to eliminate sequential search for location based access to records. Another idea, an index, implemented as a separate data structure, can be used to implement value based access to a file without sequential search. In simplest terms, an index can be thought of as a lookup table containing the sorted values of a given field along with their corresponding RRN’s in the file.
A simple index is maintained on a single key data field in the file, where every record has a unique value for that field. It is possible index on multiple fields and non-unique fields, but those topics will not be covered here. Let the socsecnumfield be the key field under consideration. Social Security numbers should be unique, so this means that any given value can occur only once in the file and once in the index. As records are entered into and deleted from a file, they have to be entered into and deleted from the index. When records are to be accessed, the field values are found in the index, and the corresponding RRN is used to find the record in the file. When implementing an index it is important to include the logic which will prevent the existence of duplicate entries for a field which should contain a unique value. Since the file is only accessed through the index, this logic will also prevent duplicate entries in the file.
The index has been described as a lookup table. Without bringing additional algorithms, that would imply that looking for a record still involves sequential search, through the index. An index can be much smaller than a data file. It can be maintained as an in-memory data structure. Searching it will be much quicker than searching the file in secondary storage. Since the index is maintained in sorted order, it is also possible to implement advanced searching techniques, like binary search, in order to improve the performance of indexed access.
It may be desirable for an index to be persistent, that is, available from one program run to the next, but that is not a complexity under consideration here. Although conceptually it can be treated as a look-up table, it would be counter-productive to implement it as a file itself, in secondary storage. It may be implemented as a dynamic data structure using one of the classes provided by Java. Before looking at the indexing assignment, read the following sections of this unit where the relevant group of classes is discussed.
Section 4, The Java Collections Framework
The Java collections framework provides a consistent set of standards for various kinds of classes, instances of which serve as containers for objects. The framework basically consists of hierarchies of interfaces for collections, along with hierarchies of classes which implement these interfaces. Using the provided framework lessens the conceptual work of figuring out the relationships between different types of collections. Using the implementations saves programming and may lead to more efficient programs. The Java API developers created implementations which are general in nature and which have well-understood performance characteristics. In the Java API, if you go to the Collections class documentation, you will find a link to the Java Collections Framework page, which contains a link to an Overview. On the main Java documentation page maintained by Sun, you will find a link to a more extensive tutorial on the collections framework. The overview and tutorial are both important sources of information and are cited freely in this section.
The tutorial gives two hierarchies of interfaces as the basis for the overall collections framework. The first of these hierarchies descends from the Collection interface:
The second hierarchy that the collections framework depends on descends from the Map interface: