A Context-Based Approach Towards
Content Processing of Electronic Documents

Karin Haenelt

Fraunhofer Gesellschaft e.V. – FhG
Dolivostraße 15
D 64293 Darmstadt, Germany

Abstract

This paper introduces a text-theoretically founded view on content processing of electronic documents. A central aspect is the representation of the contextual embedding of texts. It provides a basis for modelling mechanisms of the dynamic development of information and access perspectives during the process of information communication and for the management of vage and incomplete information. The paper firstly indroduces a basic concept of text production and understanding (section 2). On this basis it develpos a text model with a four-layered text representation and text-external context bindings (section 3). It then describes the components of a text analysis process from robust parsing to deep semantic analysis. It explains the establishment of conceptual and thematic access perspectives (section 4 and 5). An outlook sketches an application scenario of using the representation described in text and information retrieval and machine translation (section 6).

1 Introduction

Most of our information sources and of our publications contain essential parts in form of natural language texts. During the process of publication this information is used by authors and transformed into new documents (e.g., new texts, abstracts, translations). Basically it is the content of texts which is accessed, not just the surface structure. In order to electronically support applications which are essentially devoted to the textual content (e.g., information retrieval, machine translation, hypertext links) natural language components have to provide immediate access to the contents of the various information objects.

Natural language texts are very flexible means of information handling. They allow for the constitution of information as well as for its communication, and for the handling of heterogeneous and incomplete information as well as for the development of information in the progress of time. Successful future information systems will above all have to offer this flexibility of information handling which natural language provides.

The current state of processing of natural language texts is on the one hand characterized by different procedures and methods for individual applications, and on the other hand by results which still do not satisfy the users, and which due to increasing pretensions will less and less do so. This has been shown by practical experiences and several evaluations. Two examples may serve as illustrations:

In the area of full-text-retrieval the figures quoted again and again for some years already read as follows: ”No more than 40% precision for 20% recall” (Sparck Jones, 1987). In other words: 60% of the results are wrong, and 80% of the information available in the system is not found. More recent figures are: ”60% precision for 40% recall or 55% precision for 45% recall” (Will, 1993) (similarly (Harman, 1996), (Voorhees and Harman, 1997). Although the meaning of such figures is debatable with respect to their application relevance and their methodic basis (cf. (Kowalski, 1997)), the general tendency has been confirmed by users and developers repeatedly. Croft wrote: ”We are still doing pretty badly even with the best technique that we have found” (1988) , and: ”The most interesting thing about text, and the central problem for designers of information retrieval systems is, that the semantics of text is not well represented by surface features such as individual words.” and ”The number of retrieval errors could be reduced if information retrieval systems used better representations of the content of text.” (Croft, 1993).

In the area of machine translation the situation is similar. The Japanese JEIDA-report (Nagao, 1989: 14) describes the result of an evaluation of machine translations as follows: ”Some translations were done well. Others, however, were not translated or were translated incorrectly. In some cases, only fragments of sentences were translated and they are directly put into a sequence disregarding linguistic relationships among them.”

One major impediment to more sophisticated textual information and document handling is common to many kinds of electronic processing: the objects that really should be handled are interpreted natural language texts, that is, both the text and the knowledge communicated by those texts, rather than uninterpreted character strings. The mechanisms of text constitution or textual communication of knowledge, however, are still poorly understood. Current approaches towards content handling employ statistical methods or pre-coded knowledge bases. Lexical statistic approaches assume that the choice of vocabulary in a text is a function of subject matter. The results quoted above, however, suggest that this assumption needs refinement. Knowledge bases are utilized for two tasks, namely for concept identification for determining concepts corresponding to explictily introduced information, and for bridging inferences for closing gaps between explicitly introduced concepts in order to construct a cohesive representation.The problems with these approaches have been recognized as being twofold: The descriptions provided in a knowledge base are prepared intellectually and they are modelled under those aspects which are foreseen on the basis of a particular state of the art and for a particular task (even if a generalization is aimed at). Firstly, this procedure is very costly, and secondly experience shows, that matching texts against these schemata works satisfactory for small texts in restricted domains, but is less successful, if texts are to be processed which communicate new or newly organized knowledge. In this case either the concepts available, the granularity of their description or the contexts they appear in do not provide the information which is actually needed. The situation becomes even worse, if concept descriptions are accessed and used without consideration of any contexts (which is typically the case with the application of thesauri).

A prerequisite of managing mass data with improved application results is a better understanding of natural language mechanism of information constitution and development. The conception of the Kontext model which will be presented in this article has been motivated by the goal to explore the means natural language provides for constituting, organizing and flexibly communicating information. The model views texts in their context with other texts rather than as isolated units, because this approach provides a basis for explaining mechanisms of the development of perspectives on information. The article focusses on the representation and its use for information processing. A corresponding text analysis prototype is currently under development. Although it is not yet possible to provide a detailed specification of a completed research work on this process, some of the design considerations and insights gained from prototype development and application will be included in this article.

2 Basic Assumption: Text Production and Text Understanding are Intentional Processes

In many approaches assumptions about the understanding process have not been made explicit and it has more or less been taken for granted, that the task of a computer is to generate a ”correct” and ”objective” text representation. Much research work has been devoted towards identifying the input resources needed (rule systems, dictionaries, knowledge bases, inferences) for constructing such a representation. Although observations have been reported which do not agree with this assumption, no serious consequences have been drawn with respect to system design - at least as far as conceptual systems are concerned (in statistic approaches changes in a corpus do have effects on the processing result). Kintsch and van Dijk, for example, state: ”It is not necessary that the text be conventionally structured. Indeed the special purpose [the reader’s goals, K.H.] overrides whatever text structure there is.” (Kintsch and van Dijk (1978: 373)). Hellwig (1984) writes, that as a consequence of the hermeneutic character of text descriptions a certain freedom in text interpretation must be taken into account. Grosz and Sidner (1986: 182) report on differing text segmentations of different readers, and Passoneau and Litman (1997: 108) write: ”we do not assume that there are ”correct” segmentations.” Similarly, Corriveau (1991 and 1995) in the description of his text analysis system IDIoT states: ”there is no correct interpretation, but rather an interpretation that is reached given a certain private knowledge base and a set of time-related memory parameters that characterize the ”frame of mind” (Gardner, 1983) or ”horizon” (Gadamer, 1976) of a particular individual.” (Corriveau, 1991: 9). His consequence is a system design, in which ”all memory processes are taken to be strictly quantitativei.e.,mechanical and deprived of any linguistic and semantic knowledge” and ”all ’knowledge’, that is, all qualitative information, manipulated by the proposed comprehension tool ist assumed to be strictly user-specifiable” (Corriveau, 1991: 8). Whilst this approach leads to a consequent distinction between data and algorithms, it still uses hierarchically structured domain knowlegde bases.

The problem with assuming an ”objective” result of a text analysis process and relying on well-structured background knowledge bases is twofold: To begin with, these assumptions determine a goal which obviously cannot be reached for theoretical reasons. But, moreover, this assumption blocks the way towards the exploration of the mechanisms of the dynamic development of information and access perspectives during the process of information communication and towards the management of vage and incomplete information. It seems to be the search for the reasons of the possibility of interpreting texts in different ways - depending on background information and communication goals - which leads to basic premises of these mechanisms.

A basic assumption of the KONTEXT model is, that text production and text understanding are intentional processes with varying results depending on background information and communicative goals. Further assumptions are:

(1)A distinction is made between knowledge and information: Knowledge is understood as unintentional, i.e. as independent of integrations into particular tasks and contexts (for a similar definition cf. (Searle, 1980), (Thom, 1990), (Rich and Knight, 1991)). Knowledge which has been manifested (e.g., in natural language texts) for a particular purpose is called information (following a definition by Franck (1990))

(2)It is assumed that informative texts are manifestations of access to knowledge. They, however, do not present knowledge as a whole. They rather access and fix knowledge in a particular way which serves a particular purpose in a particular communication situation. The information presented in a text is the information which is supposed to be relevant with respect to the communicative goal of a text. It would not serve a communication purpose to communicate all knowledge equivalently and in an equally detailed manner (similarly (Lang, 1977: 81/82))

(3)Each text organizes knowledge in its own way, and besides the communication of knowledge which is supposed to be new to the communication partners, it may be a particular organization of already known facts which creates relations which suit a further communication situation better and which shed a new light on previous knowledge.

(4)Information provides a particular view on knowledge and is contextually bound in two ways: Firstly, the information presented in a text highlights pieces of knowledge rather than provides a clearly cut segment of it. The information selected for textual presentation is not necessarily self-contained. It may rather be contextually bound to further knowledge outside the actual fixing. Secondly, the knowledge fixed for a text is text-internally bound into the organization of the actual fixing.

Based on these observations textual communication of knowledge can then be explained as follows: texts are construction instructions for information (similarly (Kallmeyer, Klein, Meyer-Hermann, Netzer and Siebert, 1986: 44)). Information is not just delivered as a whole to a partner. Instead understanding is an active process. The reader has to construct information in accordance with the same principles which an author has used to fix knowledge. The author of a text has found a pragmatic solution that leads to a specific goal by a chain of operations on the own knowledge, and it is this chain of operations that is imparted to the reader. The author is guiding the process of understanding by drawing the attention to those details which are suitable for the construction of new views and relations. The guidance includes instructions, which parts of knowledge or previously constituted information are to be accessed, how these parts are to be connected, how parts of the constructions are to be changed, from which perspective the constructions are to be viewed, where the construction shall be continued, etc. In this process the individual expressions have different functions. They are used to refer to areas of knowledge or information, or to constitute contexts and structures which determine access and construction operations. Nouns, for instance, are used for accessing or introducing objects (”Opera House”), verbs are used for accessing or constituting states of affairs (”build”) and to establish relations between objects (’build (Utzon, Opera house, in(Sidney))’), anaphoric pronouns (”their shell roofs”, ”his personal style”) or definite articles (”the interiors”) are used for redirecting the reader to previously established information structures, active and passive voice are used for establishing a perspective, etc. The sequentially arranged expressions of a text function as operators which establish constructs like concepts, references to instances, contexts and thematic structures. These constructs in turn determine the access to knowledge and the composition (including changes) to a text specific information.

As can be observed, a text understanding process can have such different results with different readers as no understanding at all, partial understandings, misunderstandings, good understandings and new perspectives on previous knowledge. These differences can be explained by the assumption that each reader tries to interpret the newly communicated information on the basis of the own background knowledge in a way, that it is internally connected and contextually bound to the background knowledge. The connectedness of a view is not necessarily completely provided by the text itself. As has been mentioned, a text focusses on the information which is supposed to be relevant with respect to the communicative goal of a text, and presents this information to an extent which is supposed to be new. Further knowledge is not fixed. Contextual binding of the view presented may, however, be required for connecting the information units of the view. These connections must be provided by each reader’s own background knowledge or further accessible information (e.g., reference books). Usually, neither the knowledge area to be involved nor its extension is described (exceptions are explicit references to background information sources in scientific publications, reference books, legal texts, and others). Obviously communication succeeds on the basis of a certain breadth and depth of variation and vagueness.

3 The KONTEXT Model: Components

On the basis of the assumptions described the following components are distinguished in a formal text model which describes the textual communication of information (cf. figure 1 ):

(1)a text representation which describes the information conveyed in a text and the information describing its contextual organization. This information is structured into four layers (syntactic structure, thematic structure, referential structure, conceptual structure).Two of them (concept structure and reference structure) represent the facts which have been acquired from texts, the others represent the text (and fact) structure.

(2)a set of text representations which serves as background information. Each text representation is linked to those representation(s) which provide bridging information for the constitution of a connected view in cases where the bridges have been left implicit in the text under consideration. The link structure between individual text representations describes text-external context bindings. Each text representation may also provide a new view on background information and thus describe the development of information. In this way the link structure also describes the development of access structures.

Figure 2 shows an overview of the layers of the text representation. Proposals for structuring a linguistic text description into layers have already been made by previous approaches, and the information of the layers of the text representation has been described in more detail in numerous other approaches.
An explicit distinction of layers of text structure has been proposed in the area of text linguistics by Danes (1971 and 1974). He already distinguishes a ”semantic” and a ”thematic” structure of the ”Kommunikat” and suggests to extend the structure by a layer of ”(co-)reference structure” (Danes, 1974). Kintsch and van Dijk (1978) distinguish a microstructure, a macrostructure, schemata (also called ”superstructure” or ”hyperstructure” (van Dijk, 1980) and coherence graphs. Grosz and Sidner (1986) present a discourse model with three components: ”the structure of the sequence of utterances (called linguistic structure), a structure of purposes (called intentional structure), and the state of focus of attention (called the attentional state). In the area of lexical semantics Semantic Emphasis Theory (Kunze, 1993 and 1991) distinguishes conceptual descriptions (basic semantic forms), perspectives on these descriptions (semantic emphasis) and a referential description (structured sets of representatives of objects, situations, places and times). A further basis for structuring information has been provided by knowledge representation languages. In KL-ONE (Brachman and Schmolze, 1985), for instance, concepts, nexus (representatives of a world), and contexts have been used.

The KONTEXT model proposes an ordering of the layers under the aspects of textual communication and of content related abstraction. Under the aspect of textual communication the lower layers are independent of the sequence of an actual text, while the thematic and the syntactic layer also include information on the sequential unfolding. Under content related aspects each upper layer drops details of lower layers and represents specific connections. In addition, by means of intertextual links a view on information structures with respect to their role in the process of knowledge communication and their interplay with a dynamically conceived background information is provided.

The conceptual structure represents the conceptual fixing of knowledge in terms of natural language lexical units and their syntagmatic relations. The representation units are individual descriptions of states-of-affairs in terms of functor-argument structures (e.g. ’build(Utzon,house)’). A set of individual descriptions of the same object constitutes the concept structure of this object. The notion of a concept used here is based on definitions by Quillian (1967) and Kintsch (1988) Quillian defines: ”A word’s full concept is defined in the model memory to be all the nodes that can be reached by an exhaustive tracing process, originating at its original, patriarchical type node, together with the total sum of relationships among these nodes specified by within-plane, token-to-token units.” (p. 101). Kintsch writes: ”Concepts are not defined in a concept net, but their meaning can be constructed from their position in the net. The immediate associates and semantic neighbors of a node constitute its core meaning. Its complete and full meaning, however, can be obtained only by exploring its relations to all the other nodes in the net. Meaning must be created. [...] It is not possible to deal with the whole, huge knowledge net at once. Instead, at any moment only a tiny fraction of the net can be activated, and only these propositions of the net that are actually activated can affect the meaning of a given concept. Thus, the meaning of a concept is always situation specific and context dependent. It is necessarily incomplete and unstable: Additional nodes could always be added to the activated subnet constituting the momentary meaning of a concept, but at the cost of losing some of the already activated nodes.” (p. 165). Readings are not distinguished on this level. They must be constructed on the basis of clustering methods or on the basis of context structures (cf. sections 4 and 5).