Recognizing Table Structure from the
Extracted Cells of Genealogical Microfilm
A Thesis Proposal Presented to the
Department of Computer Science
Brigham Young University
In Partial Fulfillment of the Requirements
for the Degree Master of Science
Kenneth Martin Tubbs Jr.
October 4, 2001
IIntroduction
Millions of people want geological information [FSO20]. A large amount of this information exists in microfilm tables. Yet, this wealth of information is nearly inaccessible because extracting and organizing it by hand is slow, error prone and tedious. Presently, it is only available by monotonous, individual explorations. Individuals that do not live near microfilm libraries must order microfilm in large rolls. This costs them a significant amount of both time and money. Imagine the ability to search thousands of microfilm documents quickly and from any computer. Algorithms are needed that can automatically extract the structure and content of these tabular documents.
The automatic recognition of table structure is difficult for three reasons. First, tables have many different layouts and styles [Lop00]. Even tables representing the same information can be arranged in various layouts of columns and rows. In addition, many documents repeat the same table several times. Second, tables factor data items in different ways. In many cases, factoring must be inferred from the layout and frequency of the data items in the table. Third, often microfilm tables lack information or are ambiguous. These situations require that automated data extraction algorithms make decisions based on what is most likely.
Present table recognition research focuses on modeling the geometry of tabular documents. The current techniques for identifying tabular structure include template matching [Hur97], regular expression creation [Lcc99], statistical modeling [Bru97] [Bru98] and grammar matching [Gre95]. These techniques have been applied to both printed text tables [Lcc99] and images containing tables [Hur97] [Has98]. While all these research efforts exploit the geometric properties of tables, they ignore the constraints of the data presented in the tables. An ontology can describe the relationships and constraints between data objects [Emb98]. Research is needed to explore the combination of geometric and ontological evidence to identify table structure.
Research is currently underway to identify and zone the cells that make up tables found in microfilm [Nie01]. Cells are the individual blocks that make up a table. These raw cells can be stored in an XML document representing the table on the microfilm [Xml]. The attributes stored for each cell in the XML file include the coordinates of each cell, whether or not each cell is empty, and the ASCII for each cell that contains printed text. This information can be processed to determine the structure of the table and the records it contains. Other research is exploring the ability to transmit pieces of microfilm images across the Internet at various resolutions [Ken01]. Once a table’s structure is known, queries about the microfilm table can be answered by transmitting only relevant parts of the original microfilm image.
IIThesis Statement
This thesis proposes research to develop an algorithmic process to identify automatically the record structure from the extracted cells of genealogical, microfilm tables using both geometric and ontological evidence.
IIIMethods
The algorithmic process is composed of three steps. First, collect evidence from the cells described in an input XML file using knowledge of table structure and a genealogical ontology. Second, apply correlation rules to update the collected evidence. Third, verify and store the resulting records.
The algorithmic process accepts raw data previously collected from zoned microfilm images. It receives this data as an XML input file that describes the coordinates of each table cell, the machine printed text in each cell, if any, and whether or not each table cell is empty. Table cells containing machine printed text are either row or column headers, called label cells. Nonempty table cells, without machine printed text, are values within rows or columns, called value cells. The set of all table cells, C, consists of the set of all label cells, L, the set of all value cells, V, and the set of all empty cells, E. We construct a genealogical ontology, O, to describe the relationships between genealogical data objects. The set O contains all the object sets in the ontology.
Collect Evidence
Let D = C O. An evidence matrix, M, is a two-dimensional matrix that relates F ⊆ Din the first dimension to S ⊆ D in the second dimension. The value at position Mij is a real number between one and zero that represents the confidence of a relationship between an element fi in F and an element sj in S. A confidence value of zero indicates that the relationship is inappropriate, while a confidence value of one indicates that the relationship is certain. We generate confidence values by measuring features about the table cells and object sets in D that support or refute the existence of the relationship expressed by M.
We discover the geometric structure of a table using evidence matrices to represent the relationships between its cells. For example, we model the relationship that every value cell is associated with one label cell using an evidence matrix M. In this case, M is a two-dimensional matrix with a set containing all of a table’s label cells, L, in one dimension, and a set containing all of a table’s value cells, V, in the second dimension. We populate the evidence matrix by extracting features about the table cells listed in an XML input file. For instance, to support the relationship that a value cell is associated with a label cell we can measure the distance between the centers of both cells.
For this research, we create five evidence matrices. The first matrix, LV, relates a table’s label cells, L, to its values cells, V. We generate LV to determine whether or not a value cell falls within the row or column headed by a label cell. The second matrix, LL, relates the label cells, L, in the table to other label cells, L. This matrix identifies label cells that are not associated with values but that aggregate other label cells. The third matrix, VV, relates value cells, V, to other value cells, V. We create VV to discover how likely value cells are in the same row or column with other values cells. The fourth matrix, LO, matches a table’s label cells, L, with the object sets in the genealogical ontology. The matrix LO allows the algorithm to identify the records contained in the table. The fifth matrix, FM, describes how confidently a label cell, li in L, factors the other label cells, L, in a table. This matrix is called the factoring matrix.
Apply Correlation Rules
A correlation rule is an expression for updating the values of one evidence matrix using the values of another evidence matrix. Since each matrix is created and populated independently, the algorithm uses the values from one matrix to support or refute the confidence values in another matrix.
The algorithm applies a set of correlation rules to update and improve the five previously populated evidence matrices. For example, value zones that are related with a VV confidence value near one are likely to occur in the same row or column. Value zones in the same row or column should likely be associated with the same label cell, described in LV. To favor this assumption, we can create a correlation rule that modifies the confidence values of LV to promote the association of similar value cell pairs in VV with the same label cell. The algorithm repeatedly applies the set of correlation rules until the number of iteration surpasses a maximum threshold or after the maximum change made to any matrix is smaller than an empirically defined threshold.
The algorithm updates the factoring matrix, FM, using a genealogical ontology, O, instead of using confidence values from other matrices. We construct the genealogical ontology to describe the relationships between objects such as Name, Age, and Location. An expected cardinality ratio, Oij, exists for each pair of objects oi and ojin O. For example, the cardinality ratio between the object set Name and Location could be equal to one third; meaning that if both Name and Location exist in a table, there should be approximately one Location for every three Names. The algorithm updates the factoring matrix, MF, using the following rule: FMij = FMij * [1 - | Oij – Ni/Nj| + C]. The term Ni indicates the number of value cells associated with the label cell, li in L. The algorithm calculates the value Ni for a label cell, li in L, by counting the value cells associated with li in matrix LV.This update rule compares the expected cardinality ratio for a pair of label cells, Oij with the observed cardinality ratio, Ni/Nj, and updates the respective value in MF.
Verify and Store
The algorithm produces a score representing the confidence of the extracted structure for a table. The algorithm extracts three types of structure. First, the algorithm associates value cells with label cells. Second, the algorithm matches the label cells to object sets in the genealogical ontology. Third, the algorithm identifies the record boundaries by observing the factoring hierarchy of the label cells. The algorithm calculates the overall confidence score for a table by taking the product of the confidences assigned to each of the three parts. In addition, the structure will be presented to a human user for verification.
Once verified, the algorithm uses the table’s structure and mapping to the ontology to create SQL insert statements that store the table’s records in a database. For each record field, the algorithm stores the coordinates of the cell that contains the handwritten data in the original microfilm table. Later, the handwritten values can be extracted and added to the database records.
Measurements
We use a set of five real-world, genealogical tables to create and refine the set of extracted features, the set correlation rules, and the cardinality values for the ontology. We measure the success of the research by observing the precision, recall, and accuracy of the algorithm on an additional 15 real-world, genealogical tables. We measure the precision by dividing the number of correctly populated database fields by the total number of database fields populated. We measure the recall by dividing the number of populated database fields divided by the number of value cells that should be extracted with respect to the ontology. We then calculate the accuracy using the formula: 2 / ((1 / precision) + (1 / recall)) [Bae99]. In addition, we compare the overall score generated by the algorithm to the calculated precision.
IV Contribution to Computer Science
This research makes two contributes to the field of computer science. First, it adds to the current techniques of table recognition by exploiting both constraints of an ontology and tabular geometry. Second, this research presents an original technique for combining extracted features using correlation rules.
V Delimitations of the Thesis
This research will not attempt to do the following:
- Study tables that exhibit forms other than rows or columns that are described by one label relating to zero or more values.
- Explore other domains that extend beyond genealogical microfilm tables.
- Investigate documents that extend to languages other than English.
- Work with tables that span multiple documents or work with multiple tables simultaneously.
VIThesis Outline
- Introduction and Related Work (3 pages)
- Input Resources (4 pages)
2.1 XML Input File
2.2 Ontology Input File
3. Methodology (20 pages)
3.1 Process Architecture
3.2 Feature Extraction
3.3 Correlation Rules
3.4 Verification
3.5 Record Storage
3.6 User-Interface
4. Experimental Results, and Analysis (10 pages)
5. Conclusions, Limitations, and Future Work (4 pages)
VIIThesis Schedule
A schedule of this thesis is as follows:
Literature Search and Reading January – August 2001
Chapters: 1- 6September – November 2001
Thesis Revision and Defense December 2001
VIIIBibliography
[Bae99] R. Baeza-Yates and Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley, 1999.
This book describes techniques for accessing information relative to a query. We measure the accuracy of our results using the accuracy formula presented in this book.
[Bru97]R. Brugger, A. Zramdini, and R. Ingold. Modeling documents for structure recognition using generalized n-grams. In the Proceedings of the International Conference on Document Analysis and Recognition, 1997.
This paper describes a statistical model for recognizing the logical structure of a document. Unlike this thesis, it describes a technique that only exploits the geometry of a table to recognize its structure.
[Bru98]R. Brugger, F. Bapst, and R. Ingold. A DTD Extension for Document Structure Recognition. University of Fribourg, 1998.
This paper describes unifying the present document models used for recognition of document structure with document type definitions. It indicates that a DTD is expressive enough to model document structure. This thesis outputs table structure in DTD format.
[Emb98] D.W. Embley, Object Database Development: Concepts and Principles, Addison-Wesley, Reading, Massachusetts, 1998.
This book describes concepts needed for creating and optimizing database schema and applications. The genealogical ontology used in this thesis is modeled after the object relationship model defined in this book.
[FSO20] Three Billion Hits on FamilySearch Web Site.
This website contains genealogical information. It shows that family history research is done by millions of people.
[Gre95]E. Green and M. Krishnamoorthy, "Model-Based Analysis of Printed Tables," Proc., IAPR 3rd Int'l Conf. on Document Analysis & Recognition, Montreal, Canada, pp. 214--217, August 14--16, 1995.
This paper describes using grammars to represent the lines that make up tables in images. This technique only uses the geometry of a table to recognize a table’s structure.
[Has98]Terrence B. Hass. “The Development of a Prototype Knowledge-Based Table-Processing System”. A Thesis presented to the Department of Computer Science, Provo, Utah, April 1998.
This thesis describes a system for automatically identifying the attribute type and value pairs given the text and geometry of a table. This paper uses Prolog rules learn facts about a table structure. Like this paper, this thesis uses rules to learn structural facts.
[Hur97]M. Hurst and S. Douglas. Layout Language: Preliminary Experiments in Assigning Logical Structure to Table Cells. In Proceedings of 5th Applied Natural Language Processing Conference, pages 217--220, Washington, D.C., 1997.
This paper describes using template to recognize structural patterns within tables. It scores how well a documents structure matches its templates. This thesis uses confidence levels to rate the strength geometric and ontological relationships.
[JAVA] The JAVA home page:
This web site gives information about Java development environment and downloads links. The tool created for this thesis is written in the java programming language.
[Ken01]D. Kennard and W. Barrett. Just-In-Time Browsing for Digital Microfilm. Workshop on Technology for Family History and Genealogical Research, 2001.
This paper describes a method for browsing images across the internet at various resolutions. This paper displays a method for displaying the results of query a microfilm using the table structure identified by this thesis.
[Lcc99] S.W. Liddle, D.M. Campbell, and C. Crawford, Automatically Extracting Structure and Data from Business Reports, In Proceedings of the 8th international conference on Information knowledge management, pp. 86-93, Kansas City, Missouri, November 1999.
This paper describes an algorithm to infer the regular structure of printed business reports and automatically create wrappers to extract relational data. This paper describes a purely geometric method for identifying table structure.
[Lop00]D. Lopresti and G. Nagy. A Tabular Survey of Automated Table Processing. Bell Labs, Lucent Technologies Inc., 2000.
This paper presents and defines the different layouts and types of tables that it exists. It validates the fact that table recognition is a difficult problem.
[Nag00]G. Nagy. Twenty Years of Document Image Analysis in PAMI. In the proceedings of IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. 22, 2000.
[Nie01]H. Nielson and W. Barrett. Automatic Zoning of Digitized Documents.
Workshop on Technology for Family History and Genealogical Research, 2001.
This paper describes zoning tables in microfilm images. This research provides the input XML files containing table cells.
[Xml] The W3C Architecture Domain.
The web site describes the standard grammar of XML, DTDs and list current XML research. This web site demonstrates how the XML input files should be structured.
IX Artifacts
We implement the proposed algorithm as a tool/demo in the Java programming language. The application can be used to extract features, specify rules, display the results and create SQL statements.
X Signatures
This proposal, by Kenneth Martin Tubbs Jr., is accepted in its present form by the Department of Computer Science of Brigham Young University as satisfying the proposal requirement for the degree of Master of Science.
______
David W. Embley, Committee Chairman
______
William A. Barrett, Committee Member
______
Kent E. Seamons, Committee Member
______
David W. Embley, Graduate Coordinator