Technische Universität Hamburg-Harburg

Arbeitsbereich Softwaresysteme

Online Help and User Manuals

A Syntactical Analysis using

Case Based Reasoning Tools

Project

submitted by

Rolando Armuelles

Information and Communication Systems

Masters Program

Supervisors:

Prof. Dr. Florian Matthes

Dipl-Inform. Ulrike Steffens

March 1st, 2000

Contents

1 /

Motivation and Goals

/ IV
2 / A Grammar for English Phrases / 5
2.1 Natural and Artificial Languages……………………………... / 5
2.2 Linguistic competence………………………………………... / 5
2.3 Word structure………………………………………………... / 6
2.4 Syntax………………………………………………………… / 6
2.4.1 The Parts of Speech……………………………….. / 6
2.4.2 Describing Sentence Structure……………………. / 7
2.5 A Grammar for English Phrases ……………………………. / 9
3 / Syntactical Analysis of Phrases in Tables of Contents / 11
3.1 User Manuals…………………………………………………. / 11
3.2 Online Help Systems………………………………………….. / 14
3.2.1 Windows Help…………………………………….. / 14
3.2.2 Getting Help in Windows 98……………………… / 15
3.2.3 Analyzing the Syntax……………………………... / 17
3.3 Comparing Online Help with Manuals….……………………. / 20
4 / Semantics Analysis of Verbs used in Information Technology / 22
4.1 Defining Semantics…………………………………………… / 22
4.2 A Classification of Verbs……………………………………... / 22
5 /

Case Based Reasoning

/ 26
5.1 Cases………………………………………………………….. / 26
5.2 Methods………………………………………………………. / 26
5.3 Decomposition of CBR……………………………………….. / 28
5.4 A Case Based Reasoning Tool………………………………... / 29
5.4.1 CBR4 Works Professional………………………… / 29
5.5 CBR Applied to Syntax Analysis…………………………….. / 31
6 / Conclusion / 36

Appendix

A /

List of English Verbs

/ 37

Bibliography

/ 38

Figures

2.1 / A labelled Constituent Structure Diagram…………………………. / 8
2.2 / A Context Free Grammar…………………………………………….. / 10
3.1 / Style I in Table of Contents of Manuals ……………………………. / 12
3.2 / Style II using Combinations of Nouns……………………………….. / 13
3.3 a, b / Styles III and IV in Software Manuals………………………………. / 14
3.4 / Main Contents in Word for Windows 98 Help……………………… / 16
3.5 / Style A, The Preferred Construction for Topics in Windows Help.. / 17
3.6 a, b / Style B, Top Level Problem Definitions in Microsoft Word Help… / 18
3.7 / Style C, Two Nouns Combine to Form an Online Help Topic……... / 19
3.8 / Style D, „Automatically“ Topics, with an Adverbial Phrase……….. / 19
4.1 / A Classification of Verbs used in Information Technology………... / 25
5.1 / The Case Based Reasoning Cycle ……………………………………. / 27
5.2 / CBR-Works Application Architecture………………………………. / 30
5.3 / Basic Structure of a CBR-Works Application………………………. / 31
5.4 / The CBR-Works 4 Concept Manager……………………………... / 33
5.5 / Concept Manager’s Graphical View…………………………………. / 34
5.6 / The CBR-Works 4 Type Manager…………………………………… / 35

II

Chapter 1

Motivation and Goals

There is an increasing number of information systems users with a limited computing experience who need easy-to-understand assistance. Likewise, more advanced users and system developers need reliable sources for technical support and learning. Current aids include: access to human experts, user manuals and, more recently, interactive online help systems.

The goal of this project is to ease and improve information retrieval in online help systems by analyzing the description of problems and solutions they contain, making them closer to the kind of results achieved through direct contact with people and those obtained by learning from user manuals.

The chosen strategy is to find structures and common patterns in the expressions used for such descriptions, dividing them into smaller constituent terms first, and then to closely examine the types of those components in order to gain a deeper knowledge, possibly improving retrieval results.

By making a syntactical analysis of the Table of Contents of a user manual and comparing it to that of an equivalent software product, the structural differences and similarities that cause particular patterns of use in humans are to be found. A classification for verbs commonly used in both types of media, based on their semantics, will be proposed.

In addition, as an alternative to traditional grammatical analysis, Case Based Reasoning is introduced. By definition, it provides not only for fast and accurate retrieval but also for improving a system by direct or indirect feedback.

Chapter 2 introduces a grammar for the analysis of expressions used for problem definitions in user manuals and online help systems. Chapter 3 applies the rules of the grammar to depict the syntactical structure of the most common styles of problem definitions in printed and online help. Secondly, a comparison of features of both media is depicted. Chapter 4 explores a list of verbs used in information technology and proposes a categorization of them, based on their semantics. In Chapter 5, a Case Based Reasoning tool and development environment is tested and evaluated as a platform for natural language analysis. Finally, the conclusions of this project are summed up and future courses of action are suggested.


Chapter 2

A Grammar for Natural Language

2.1 Natural and Artificial Languages

Natural language is something that already exists and fulfils different functions in the process of communicating with other people. It can be used to express anger, grief, commands, questions, ideas, etc. On the other hand, an artificial language is something that has to be defined. An example can be one of the many programming languages existing today. There are three main differences between natural and artificial languages[1]:

§  The rules introduced in artificial languages leave out ambiguities. This makes them easier to process than natural languages.

§  Statements in programming languages are generally kept simple, because they generally have a specific objective in mind.

§  Natural languages fulfil many functions, thus making it very difficult to represent the meaning of everything they can be used to express.

But studying the structure of sentences in a natural language is not enough, in order to understand the meaning. The context in which something is expressed also plays an important role. This aspect, however, will not be considered, in the interest of simplicity.

2.2 Linguistic Performance

Studies by Noam Chomsky observed that language use heard in everyday lifes is inconsistent, containing interruptions, mistakes and other features that make it a very poor basis for discovering underlying regularities. He called this type of linguistic behaviour “linguistic performance”. A scientific approach to language involves an explanation of our linguistic competence, not only our linguistic performance.[2]

Chomsky further argued that a generative grammar must be specified to capture linguistic competence. In general, syntax is a description of how words combine together to form sentences. A grammar can additionally cover sounds (phonology) and meaning (semantics). A generative grammar relates to syntax and is a grammar capable of generating all the sentences in a language and does not generate anything that is ungrammatical[3].

The techniques to handle artificial languages provide an efficient base for natural language processing and relate to many other fields of study: theory of grammars, automata theory, data structures, logic programming, psychology and the language philosophy. The

2.3 Word structure

Words are not always atomic. They sometimes can be broken down into pieces with meaning. The word premodifying can be broken down, while the word plot cannot. Yet, you can also derive other words from it, like plotter, plotting, etc.

The parts of words that have meaning are called morphemes. Examples of these are pre, er, ing, s, ed. Words that contain more than one morpheme are complex. Complex words have a base or stem morpheme and an attached morpheme. Morphemes can be bound or free, depending whether they can be words by themselves.

Bound morphemes can be classified in prefixes and suffixes, by whether they attach to another morpheme at the start or at the end.

The most widely used prefixes used in the English language are:

Prefixes: a, ante, anti, arch, auto, be, bi, co, counter, de, dis, em, en, ex, extra, fore, hyper, in inter, mal, mis, non, post, pre, pro, re, semi, sub, super, trans, ultra, un

Suffixes: able, age, al, ance, ate, ation, cy, dom, ed, en, ence, er, ery, est, ful, hood, ible, ion, ing, ise, ish, ist, ity, ize, less, like, ly, ment, ness, ous, s, ‘s, s’, ship, some, ster, teen, ty, ward, way, wise

The complex words in a language are formed by combining free and bound morphemes. The different word types are known as parts of speech. They represent a classification of words based on their use within a particular language. A word may have more than one part of speech classification.

For an in-depth coverage of the rules governing the formation of the parts of speech, refer to a textbook by Paul Bennet[4] or other books on linguistics.

2.4 Syntax

Syntax studies how words fit together to form structures up to the sentence level. In this section, different kinds of word and their roles in sentence structure are introduced. A context-free grammar to handle the structure of sentences will be presented later.

2.4.1 The Parts of Speech

In modern linguistics, the parts of speech are called syntactic categories. A short explanation of the most common categories will be given: noun, pronoun, determiner, adjective, preposition, verb, adverb, conjunction.

§  Nouns are known as naming words. Proper names are a sub-category of nouns and are used to name particular things, such as persons or places. Proper names behave differently from other nouns, called common nouns, which name whole classes of things. Proper names are not preceded be the or a. They do not appear in the plural form.[5] Noun phrases are groups of words based on a noun. They can be used to describe things or classes of things.

§  Pronouns are a special case, because they appear on their own in place of other noun phrases. Some common pronouns are: I, me, you, he, him, she, her, it, we, us, them. Another type of pronoun, which performs a joining function, is known as a relative pronoun. For example: who, whom, which, that. They allow a noun phrase to be extended by adding a further phrase.

§  Determiners appear at the beginning of a noun phrase and tend to determine the way the phrase refers. They include the indefinite articles a and an and the definite article the. His and her can also act as determiners[6]. Quantifiers are words like all, some, as well as the numerals. They can also act as determiners.

§  Adjectives modify a noun, thus providing a more specific description. A noun can be modified by more than one adjective.

§  Prepositions often concern possession, direction or location. Some common prepositions are: after, at, before, by, down, during, from, in, inside, of, on, outside, to, up, upon, with, without. They sometimes behave in a similar manner as relative pronouns, within noun phrases, in that they allow the modification of a noun phrase by other noun phrases.

§  Verbs are thought of as doing words. That is, they express action. For example, delay, format, copy, suspend, run. Verbal groups are groups of verbs working together, and are often regarded as the most significant part of the structure of a sentence. In English, verbal groups are used to handle variations of tense, mood and modality. The tense refers to the place in time an event occurs: past, present or future. The aspect of a verb is connected to the tense and can be progressive or perfect. The mood of a verb can be active or passive. The modality indicates possibility or necessity or a degree or certainty. This is usually expressed by auxiliary verbs such as may and can.

§  Adverbs qualify verbs in much the same way that adjectives qualify nouns.

§  Conjunctions are words that connect, e.g. and, but and so. They can be used to join two simple sentences into a compound sentence and also to enumerate elements of another syntactic category.

2.4.2 Describing Sentence Structure

A sentence can be thought of as being composed of a relatively small number of building blocks, which are called its immediate constituents[7]. These constituents themselves have their own immediate constituents. This goes on until the level of the word is reached.

A phrase is a constituent smaller than a sentence. Sometimes a constituent of a complex sentence has a sentence-like structure itself. This is commonly called a clause. There can be verbal, noun, prepositional phrases, or others.

A good way of representing this is using a tree structure[8], where all points that contain a word, or label, are called nodes. The single node at the top is called the root node and the nodes at the bottom, which have no further constituents, are called leaf nodes or terminal nodes. Where two nodes are immediately dominated by the same node, they are siblings. An example of this structure can be seen in Figure 2.1. Note that the words themselves only appear on terminal nodes and the non-terminal nodes contain labels of abstract syntactic categories.

Figure 2.1: A labelled constituent structure diagram


2.5 A Grammar for English Phrases

To build a grammar powerful enough to analyse all English sentences will not be attempted, for this task would be too large and complex. Yet, a grammar that fulfils the following criteria is needed:

a.  Able to analyse all of the sentences in our study.

b.  Should not generate too many sentences that are ungrammatical in English.

c.  With phrases and rules that are generally applicable, even if in some cases they involve simplifying the way English works.

Although the subset of the English language used for software documentation is reduced in comparison with all possible expressions, we expect this to change over time, as the evolution of information systems tend to simulate human to human interaction. [3] develops a grammar suitable for our purposes, that is strong enough to cover most problem definitions in user manuals or online help systems. Not all the rules will be used in the analysis, but they can be of interest for discovering structure or parsing within the text of a document or a larger universe of research objects (e.g. English phrases). The rules are shown in Figure 2.2.