Interval / Title page

The Interval Validation ToolKIT

LE Reference / LE2-4002
Origin / Author / US / Lee Gillam, Khurshid Ahmad
WP / Task / WP4 / T43
Task Responsible / US
Distribution / CEC / Partners / public / Jose Soler
Status / internal draft / circulated draft / final
Doc. validated by / Contributors / Partners / UC / SC / TEC
Print date / 30/07/97
Nbr pages / 12
File name / T41-0006.DOC
Revision nbr / 1

SCS[ T41-0006.doc]30/07/97

Interval / 1

Table of contents

1. Introduction...... 3

1.1 Purpose...... 3

1.2 Structure...... 3

1.3 Revision history...... 3

2. The Validation Process...... 4

3. Available resources...... 5

4. Architecture of a Validation Subsystem...... 6

4.1 Chuser/LoQator Subsystem...... 7

4.2 Analysor Module...... 7

4.2.1 Ferret...... 8

4.2.2 KWIC Analyser...... 8

4.2.3 Definition Analyser...... 8

4.2.4 Simple LANGuage...... 9

4.2.5 Domain Classifier...... 9

4.3 IIF Generator...... 9

4.3.1 IIF Generator...... 10

4.3.2 Report Generator...... 10

5. Statement of intent...... 11

SCS[ T41-0006.doc]30/07/97

Interval / 1

1.Introduction

1.1Purpose

This document describes the proposed Interval Validation Toolkit that will aid terminologists in the validation process and reduce the cost of creating validated multilingual terminology. The document was written to elicit the opinions of the partners regarding the structure, performance and functionality of the validation toolkit.

1.2Structure

Section 2 of this document describes the need for validation and some of the stages involved in it. Section 3 describes the resources that the project should make use of to create and validate terminology while reducing the costs involved in doing so. Section 4 describes the planned architecture of the Validation Toolkit and the interfacing with existing tools and technologies. Section 5 briefly states the intention to provide such a toolkit and the future direction in which such a toolkit may proceed.

1.3Revision history

  • Version 1.0 (Lee Gillam, Khurshid Ahmad / US)

2.The Validation Process

In order to validate terminological resources, there are a variety of facts about the resources which need to be known. These generally concern issues of quantity and quality, particularly, how many terms there are, how many languages do they appear in, where did the terms come from, how precise and relevant are they and how well defined are they.

To be able to validate these resources effectively, the facility must exist for the extraction of this type of data from disparate term bases. The IIF (T41) was created for exactly such a purpose. This allows rapid indexing of particular key items within the data, however it does not allow questions of precision, relevance and quality to be answered. It is hoped that the production of a workbench of tools will aid in identifying and providing evidence for some, if not all of the issues that have been raised.

The productive use of any terminology collection depends on how accurate the terminology is at various levels including the lexical and semantic. Such questions were discussed in a peripheral sense when the IIF was created. The basic premise behind the creation of the IIF was that terminology needs to be organised into a standard ‘format’ for interchange purposes. The IIF is therefore concerned with the form of the data. It does not account for the content of the data.

Validation of terminology is a crucial part of terminology management. Terminology can only be validated with the help of domain experts. The expert should be provided with term and definition information as well as evidence about the existence and usage of a given term. This evidence takes the form of: how the term occurs in the domain specific database; and how it is used by the domain community.

All this evidence should be provided to the experts and it is for them to validate the term, or not. Part of the validation toolkit developed at the University of Surrey aims to expedite evidence gathering and reduce the labour, and indeed finance, expended on such efforts.

The availability of terminological resources worldwide through departmental intranets and intercontinental networks, whether free or otherwise, provides the evidence for a term. The other important evidence is the use of the term in text. Before the advent of the Internet, access to such resources was limited. Now, due to the Internet, a vast amount of terminological information - terms and documents which use them - is now available. This makes the Internet an ideal low-cost resource which can aid the terminologist in the selection of terminologies and terminological evidence. Such resources should be embraced and utilised as extensively as possible.

3.Available resources

Along with quality term bases already in existence, the continuing expansion of the Internet has seen a variety of well respected companies publishing their own terminology, mostly in the form of glossaries, in an attempt to make the terms that they use de facto across the industry. It can be argued that these companies would not knowingly distribute such terminology unless it was in use within that company, and therefore the argument of precision and relevance can be easily made.

The terminology contained in these glossaries is also more likely to be the ‘current’ terminology. It may even contain a few neologisms. Either way, it will provide a useful mechanism for testing against existing terminology collections.

If such terminology exists at a given company site, then it is likely that there are freely available documents, also at that site, which will make use of this terminology - the all-important evidence for usage.

Even if the quality of the terminology available in the termbases was suspect, it still forms a valuable part of the evidence for that term. The same is also true of texts and the evidence of usage.

The problem with many of these resources is that they generally contain a small amount of terms, different companies will provide different definitions and they will all be published using a different set of HTML for the data format.

The amount and variety of these freely available glossaries/term bases and related texts provides valuable resources that were not so freely available at the start of the project, but which being available, should be used as both testbeds and to enrich the quality and extend the scope of existing term bases. This proposition fits wholly within the aims and objectives of the Interval project.

In order to make use of these resources therefore, there is a need to convert them into a standard form and, once converted, to allow the terminologist making these comparisons to provide evidence from the literature for the usage of these terms. One stage of this may be to compare these terms with an existing term base using the Consolidation tool already developed and in use within the project.

It has been proposed that IIF be used as the standard format in which to encode this data, although this could be easily converted to MARTIF or some other similar SGML-type format.

4.Architecture of a Validation Subsystem

The validation process therefore consists of three stages:

locating/choosing resources;

analysing the content of these resources; and

generating a report regarding quantity, quality and evidence.

These three subsystems will be referred to as chuser/loqator, analysor and report generation. The first of these systems will encode and extract data from term resources such as the Web glossaries. It will also be able to accept data from existing term bases via export using IIF. Once the terms and related information have been coded to IIF, the analysis phase will be carried out where evidence for terms is found in a variety of documents by the analysor module. From this, a more complete version of IIF, along with a report, can be created. These two items can then be consolidated and, once the consolidation is complete, using the generated report, validation can be made by the expert.

The paragraph above describes a brief overview of the utility of the system. The following sections will go on to explain further what methods will be used along the way in order to provide the required parts to this system.

4.1Chuser/LoQator Subsystem

The chuser/loqator subsystem will carry out the process of enabling IIF to be created or merged from a variety of sources. Primary sources being used are the WWW and existing term bases through export to IIF. For WWW, the encoding of the individual pages will have to be substituted for IIF style encoding. An application will be provided to enable this process to be carried out as autonomously as possible. This application will store known forms which can be modified for other resources. Creation of the IIF should be as simple as a global search/replace mechanism in a word processing package.

4.1.1System dependencies

Inputs / User selections, IIF from Termbases, IIF from other programs
Process / User modifications for structure
Outputs / Audit trail, internal form for IIF, IIF

Once this stage is complete, the 3 output forms can optionally be used further by the Analysor module.

4.2Analysor Module

The analysor module should be able to accept data in one of two forms, textual IIF, or an internal form created by the chuser/loqator module. The audit trail is such that an existing trail can be selected and amended with the data created using a previous application. The structure for such a trail will be investigated when the applications have reached a usable stage of development.

There are a variety of applications which can be invoked on the input received. This set of tools would be used both to gain evidence for the usage of the terms that have been encoded, and to check such lexical information as the simplicity of the definition being used.

4.2.1Ferret

Allows straight extraction of syntactic compounds based on syntactic cues. This approach works in English, but the application with other languages has yet to be evaluated. The tool could be used to extract from, say, definitions, as well as plain text/HTML/SGML. The utility of Ferret will have to be defined via testing.

4.2.2KWIC Analyser

Extraction of key-word in context information. This approach is known to work in a variety of languages, however again the utility will have to be defined via testing.

4.2.3Definition Analyser

It has been noted in general language lexicography literature that definitions have a particular structure which leads to a typology of definitions. Sinclair et al (1994) identified a number of definition structures existing in the Cobuild dictionary. Generally only 1 or 2 of these structures are used. Being able to identify and analyse these structures, or a lack of such structure, or suggesting how a definition could be more effectively structured would provide a measure of confidence in the quality of the definition itself. Work on this is in progress at the University.

4.2.4Simple LANGuage

Appropriate for definition analysis, the idea behind this tool is the detection of complex linguistic structures and conversion of them to reduced forms. This is a very user-interactive process at present due to the implications of the surrounding syntactic elements on the structure and meaning of the passage. This approach has only been tested using the English language, although it is believed that it could be extended to other languages. The idea itself comes from the Plain English Campaign, an independent UK organisation, although the origins can be traced back to, at least, Sir Ernest Gowers (1948). The Plain English Campaign’s ‘A-Z Guide of alternative words’ booklet is the principle lexical resource utilised by the tool.

4.2.5Domain Classifier

The domain classifier would determine the exact domain, or subdomain, to which each term belonged. Research into the utility of such a mechanism is currently in progress.

4.2.6System Dependencies

Input / Textual IIF, Internal IIF, Audit trail, text
Process / Analysis/Identification/Substitution
Output / Extended audit trail, IIF

Once again, the output in this phase can be used in either of 2 ways. The most effective mechanism however will be to allow the full generation of the IIF and the report which can then be utilised for the consolidation process and the final full validation by the domain expert.

4.3Report Generator

The final tool in this subsystem produces the full version of the Interval Interchange Format for interpretation by the Consolidation tool

4.3.1IIF Generator

If the IIF has been passed directly from other tools, it will need to be generated as either text form or some other internal format that can then be processed by the Consolidation tool. Hence a generation tool is necessary. This tool may be utilised at various stages, however, it is expected to be used prior to creation of the final report.

4.3.2Report Generation

A report is generated from the audit trail detailing the stages undertaken and the results gained from the tools so far. This report can be sent to a Domain Expert for the final validation. The audit trail will have been amended by the consolidation tool with the actions undertaken. This report should provide a useful overview of both the work that has been carried out in the identification of the terms and the current structure of the termbase.

5.Statement of intent

Because of the proposed structure of the system, the process of improving and extending existing terminological resources should, with the utility of such tools, be a continuous semi-autonomous process. The ideal tool would be one which automatically trawls the world to find new terminology, locates as much related information as possible and, only then, reports such items to a terminologist. This validation subsystem will be one step along the way to such a goal.

A prototype system (System Quirk: Tracker) has been developed which can be used for the Chuser/Loqator tasks. Other parts of the system already exist in various forms. Given the appropriate resources, we hope to integrate and refine these tools.

6.References

Sinclair, J., Hoelter, M., Peters, C. (1994). The Languages of Definition: The Formalisation of Dictionary Definitions for Natural Language Processing. Office for Official Publications of the European Communities, Brussels - Luxembourg, 1994

Gowers, Sir E. (1948). The Complete Plain Words. Penguin Books Ltd., Victoria, Australia.

SCS[ T41-0006.doc]30/07/97