Document.txt

CPSC-131 Project 3:
Document Index
This project is different from the previous two, where you were given skeleton classes with member functions to complete. Its goal is for you to become familiar with the use of the binary search tree by using the Standard Library (SL) implementation of this container: the STL map. It requires you to use existing SL classes to solve a problem, not to fill in the members of those classes as previous projects have done. The project files provide one class, DocumentIndex, with member functions that you must complete.
Application
The project requires the design and implementation of an application program that creates a page index for a text document. The index must have one line for each word in the document, with a list of page numbers following the word in that line. Here is a sample:
Class 12, 34, 56
Warning
Be very careful about the format of the lines in your output file.
Your index file will be checked by a program that expects the format of the file's lines to conform to the example shown-a single word, a space, a set of numbers separated by a comma and a space.
A nonconforming format will cause a test to fail.
Document pages are separated by two successive empty lines, created by pressing Return twice after the Return that ends a paragraph. Here is an illustration:
Last line of a page
First line of next page
Words are separated by spaces, tabs, and punctuation, which may be:
period (.)
comma (,)
colon (:)
semicolon (;)
question mark (?)
exclamation point (!)
The punctuation should not appear in the index.
Words can end with a possessive-an apostrophe followed by an s. The index must contain the word without the 's.
Word that contain digits or symbols other those mentioned above must not appear in the index.
Words can begin with opening double or single quotation marks, or parentheses; they can end with the corresponding closing character. These marks may enclose a string of words, not just a single word. You do not have to check for matching pairs; you only have strip them off like other punctuation. The marks should not appear in the index. Enclosed words, after the marks are stripped, are subject to the other constraints. For example, the word "Room-131" is not a legal word because of the hyphen and the digits.
If a word appears more than once on a page, the page number must appear only once in the index.
Words in the exception file must not appear in the index.
Words that appear more than ten times in the document must not appear in the index.
The words in the index must appear in sorted order and the page numbers for each word must appear in sorted order.
Classes and Functions
The document index program must have a DocumentFile class and a DocumentIndex class with the indicated functions:
DocumentFile
* Close: Close the document file after it's been used. This function's code will be provided to you.
* GetWord: Return the next legal word from the document file, stripping out the allowable punctuation and skipping over the illegal words (those with disallowed characters).
* GetPageNumber: Return the current page number.
* LoadExceptions: Load the word exception list from a file, given the file's name.
* Open: Open the document file, given its name. This function's code will be provided to you.
* Read: Read the next line of the document file; update the page number if appropriate.
DocumentIndex
* Create: Create the index from the document file.
* Write: Write the index to a file.
This list is a summary; the documentindex.h file contains a complete declaration of the classes and the documentindex.cpp file contains definitions of the functions-some skeletons for you to complete and some provided for you to use without having to change them.
Implementation
You are to use the Standard Library (SL) map class to hold index information. You may find it necessary to use other SL containers within the map's data elements.
Remember that the nodes of a tree, which is the structure that the SL map implements, are key/data pairs. The map's insert function takes a key and an associated data element, which can be anything, including a class or structure. The structure can have anything as a member, including other SL containers.
Resources
These files will be posted on GitHub for you to download, use or modify, and submit:
* documentindex.h: contains declarations of thDocumentIndexclass.You may add other declarations as needed for your implementation. Do not add definitions (code) to this file; definitions of functions belong in documentindex.cpp.
* documentindex.cpp: contains skeletons for the required member functions, either completely empty or partially filled in. You may add other functions, member and nonmember, as needed for your implementation.
* document.txt: contains a large multipage document to be used as input for the CreateIndex function.
* exceptions.txt: contains a list of words, one per file line, to be used as input for the LoadExceptions function.
* index.txt: contains a sample index file, created when the document.txt and exceptions.txt files are used as inputs.
* main.cpp: contains a set of test functions that will call DocumentIndex's functions. Do not modify this file. Do not submit this file as part of your project; a controlled version will be used to test your submission.
* pages.txt: contains a small document that is primarily a set of pages; it can be used to test the "next page" detection function.
* words.txt: contains a small document that is simply a collection of legal words, including some with legal punctuation and some with illegal characters such as numbers and symbols; it can be used to test the GetWord function.
All of the input files-exceptions.txt, document.txt, pages.txt, and words.txt-are samples; your final submission will be tested using different files.
A brief description of the relevant map functions-insert, find, erase-is provided in the map.pdf file, including how to insert a key/data pair.
You may find the SL set to be a container of interest. A brief description of its relevant functions-insert and size-is provided in the set.pdf file.
A more extension description of the SL classes and their functions can be found at
2
ExpectedIndex.txt
A 1,4
All 4
Application 1
Be 1
Class 1
Classes 3
Close 3
Create 3
CreateIndex 4
Do 4
Document 1
DocumentFile 3
DocumentIndex 1, 3, 4
Enclosed 3
First 2
For 3
Functions 3
GetPageNumber 3
GetWord 3, 4
GitHub 4
Here 1
If 3
Implementation 3
Index 1
It 1
Its 1
Last 1
Library 1, 3
Load 3
LoadExceptions 3, 4
Open 3
Project 1
Read 3
Remember 3
Resources 4
Return 1, 3
SL 1, 3, 4
STL 1
Standard 1, 3
The 1, 2, 3
These 3, 4
This 1, 3
Warning 1
Word 2
Words 2, 3
Write 3
You 3, 4
Your 1
about 1
above 2
add 4
after 1, 3
allowable 3
an 1, 2, 3
anything 3
apostrophe 2
appear 2, 3
appears 3
application 1
appropriate 3
are 1, 2, 3, 4
as 1, 3, 4
associated 3
at 4
because 3
become 1
been 3
begin 3
belong 4
binary 1
brief 4
by 1, 2
call 4
can 2, 3, 4
careful 1
cause 1
change 3
character 3
characters 4
check 3
checked 1
class 1, 3
classes 1, 3, 4
closing 3
code 3, 4
collection 4
colon 2
comma 1, 2
complete 1, 3
completely 4
conform 1
constraints 3
contain 2
container 1, 4
containers 3
contains 3, 4
controlled 4
corresponding 3
created 1, 4
creates 1
current 3
data 3
declaration 3
declarations 4
definitions 3, 4
description 4
design 1
detection 4
different 1, 4
digits 2, 3
disallowed 3
do 3
done 1
double 3
download 4
each 1, 3
either 4
element 3
elements 3
empty 1, 4
enclose 3
end 2, 3
ends 1
example 1, 3
exception 3
exclamation 2
existing 1
expects 1
extension 4
fail 1
familiar 1
files 1, 4
fill 1
filled 4
final 4
find 3, 4
followed 2
following 1
format 1
found 4
from 1, 3
function 3, 4
functions 1, 3, 4
given 1, 3
goal 1
have 1, 3
having 3
hold 3
how 4
hyphen 3
if 3
illegal 3, 4
illustration 1
implementation 1, 4
implements 3
including 3, 4
indicated 3
information 3
input 4
inputs 4
insert 3, 4
interest 4
is 1, 3, 4
it 3, 4
its 3, 4
just 3
key 3
large 4
legal 3, 4
like 3
line 1, 2, 3, 4
lines 1
list 1, 3, 4
map 1, 3, 4
mark 2
marks 3
matching 3
may 2, 3, 4
member 1, 3, 4
members 1
mentioned 2
modify 4
more 3, 4
multipage 4
must 1, 2, 3
name 3
necessary 3
needed 4
next 2, 3
nodes 3
nonconforming 1
nonmember 4
number 3
numbers 1, 3, 4
off 3
on 3, 4
once 3
one 1, 4
only 3
opening 3
or 2, 3, 4
order 3
other 2, 3, 4
out 3
output 1
over 3
page 1, 2, 3
pages 1, 4
pair 4
pairs 3
paragraph 1
parentheses 3
part 4
partially 4
per 4
period 2
point 2
posted 4
pressing 1
previous 1
primarily 4
problem 1
program 1, 3
project 1, 4
projects 1
provide 1
provided 3, 4
punctuation 2, 3, 4
question 2
quotation 3
relevant 4
required 4
requires 1
s 2
sample 1, 4
samples 4
search 1
semicolon 2
separated 1, 2
set 1, 4
should 2, 3
simply 4
single 1, 3
skeleton 1
skeletons 3, 4
skipping 3
small 4
solve 1
some 3, 4
sorted 3
space 1
spaces 2
string 3
strip 3
stripped 3
stripping 3
structure 3
subject 3
submission 4
submit 4
successive 1
such 4
summary 3
symbols 2, 4
tabs 2
takes 3
ten 3
test 1, 4
tested 4
text 1
th 4
than 3
their 4
them 3
they 3
this 1, 4
those 1, 2, 3
times 3
tree 1, 3
twice 1
two 1
update 3
use 1, 3, 4
used 3, 4
using 1, 4
version 4
very 1
were 1
when 4
where 1
which 2, 3
will 1, 3, 4
within 3
without 2, 3
words 3, 4
you 1, 3, 4
your 1, 4

ExcessiveAppearances.txt

first
second
second
excess
excess
excess
excess
excess
excess
excess
excess
excess
excess
excess

ExclusionTest.txt

some words to include and some to exclude or to ignore or to forget

ExclusionWords.txt

exclude
ignore
forget

GetWord.txt

a legal word and another period. comma, colon: semicolon;

(open parenthesis closed parenthesis)

"opening double quote closing double quote"

'opening single quote closing single quote'

Num9ber OK da-sh Good

PageNumber.txt

line 1-1

line 1-2

line 1-3

line 2-1

line 2-2

line 3-1

line 3-2

line 3-3

line 3-4