A Web based

Multilingual Indexing

and Retrieval System

Technical Report

prepared for the ML-Images!

e-Content project

Yanis Maistros, Stella Markantonatou, Marina Vassiliou

Athens 2002

TABLE OF CONTENTS

TABLE OF CONTENTS 2

TABLE OF FIGURES 2

1. Introduction 3

2. Digitisation techniques and documentation schemas 3

3. Setup Consideration Issues 4

3.1 “Reference” and “Thumbnail” image 4

3.2 Standards for exchanging information 4

4. The ML-Images! System 5

5. The Back-Office System 5

6. Providing multilinguality 7

6.1 The Multilingual ML-Images! Matrix (MMM) 7

6.2 Utilization of the Multilingual ML-Images! Matrix (MMM) 7

6.2.1 Indexing 7

6.2.2 Searching 8

6.3 Implementation of the Multilingual ML-Images! Matrix (MMM) 9

6.4 Tagger 10

6.5 Lemmatiser 11

6.6 MMM construction 11

7. ML-Images! DataBase 11

8. Multilingual Query Processing Tool (MQPT) 14

8.1 Components of the Multilingual Query Interface 14

9. System Specifications and Design Issues 17

10. Glossary 20

11. References 21

12. Notes 22

13. Appendix I 24

13.1 KPA 25

13.1.1 KPA DataBase Record 25

13.1.2 ML-Images DataBase Record 25

13.1.3 KPA - Picture24 Record Fields 26

13.1.4 KPA - Other DataBases 26

13.2 CHOL 28

13.2.1 CHOL XML Files 28

13.2.2 CHOL DataBase Record Fields 29

13.2.3 CHOL Fields' Information 30

13.2.4 CHOL DB Example Record 31

13.3 GMM (GIUNTI MULTIMEDIA) 35

13.3.1 GMM Record 35

TABLE OF FIGURES

Figure 1. Overall System Architecture 5

Figure 2. Initial Setting-up / Back-Office System 6

Figure 3. Concept Id Generation at Indexing Phase, Back-Office System 8

Figure 4. Structure of the MMM (a hierarchy of concepts) 10

Figure 5. ML-Images! Database Record and its Fields' Description 14

Figure 6. The Multilingual Query Processing Tool (MQPT) 15

Figure 7. The Multilingual Query Processing Tool and the Back-Office System, sharing common resources 16

Figure 8. General System Architecture 19

19

1.  Introduction

The project deals with the exploitation of image archives by using an information system that best meets the requirements of the content providers and the image users.

The project started with a survey of the needs of users, which made clear that users are especially interested in

1.  common digitisation & documentation schema

2.  support of multilingual search

3.  being offered a variety of ways for searching (keyword search & thematic search)

4.  "intelligent" interaction with the system

The design of the system presented in this document strongly relies on these results which we will collectively refer to as “the requirements”. In the remainder of the current section we present design decisions by describing and discussing each main system component in turn. These components concern:

·  digitisation techniques and documentation schemas

·  “thumbnail” and “reference” images

·  standard communication protocols and data formats

·  the ML-Images! Indexing system (Back-Office System)

·  the ML-Images! Searching system (Multilingual Query Processing Tool)

2.  Digitisation techniques and documentation schemas

In order to satisfy point 1 of the requirements, common standards and guidelines in the field of digitisation and documentation must be adopted.

A brief overview of the digitisation methods implemented by the content providers in their archives and DataBases may be summarised as follows:

ML-Images! Archive partners (content providers) possess archives organised in DataBases (DBs). These DBs have been organised using different platforms (Oracle 8i, Claris File Maker) and operating systems (MacOs, Unix, Windows). Each record of these DBs contains information on the image among which its annotation (title, captures, keywords etc).

Record structure (schema) of these DBs differs in fields’ number and the field contents (field value). This schema implementation does not follow a unique meta-data standard (like Dublin Core).

During digitisation phase each archive partner follows its own annotating method (e.g. KPA uses IPTC). Two of them (KPA, GMM) rely on manual annotation, whereas one (CHOL) is about to adopt an automatic annotating process.

Indexing techniques used by the Content Providers' annotators do not involve any thesaurus or ontology. Annotation is stored in a single language (CHOL, KPA) or two (GMM). One partner (KPA) allows for monolingual search and the other two partners for bilingual search (GMM: Italian, English, CHOL: French, English).

Several good guidelines are available to base a digitisation and documentation project on. An ultimate standard, however, does not exist as several factors should be taken into consideration. After the discussion of the outcomes of the preparatory studies, ML-Images! project team has reached the following conclusions or action lines:

i.  English language has been selected as an interlingua

ii.  Map image information stored in the archives into a common structure, or common representation schema, following a meta-data standard (Dublin Core)

iii.  The construction of a new thesaurus has been considered essential, in an effort to unify the diverse documentation practices of the partners. This thesaurus will function as a common system of concepts and will aid both the concept-based indexing and the cross-language retrieval (cf. section 6).

iv.  The above thesaurus (iii) should be translated to the six languages of the project yielding a set of monolingual thesauri, which stand into a translational equivalence relation among them.

3.  Setup Consideration Issues

Common digital image specifications and description specifications will be developed for "resource discovery" from geographically distributed DataBases. In order to avoid duplicating work it is important that these specifications smoothly combine/co-operate with the existing databases. Derivative information from the Archives may be extracted to form a front end Inventory, which the end user may visit first.

An alternative to "harvesting" decentralised information systems and image collections with the help of standards, such as Z39.50, is not feasible, as currently not many information systems in the archival community support these kinds of protocols. This is mainly due to the fact that the archives of the project are typically designed for controlled access to the digital holdings so that hacking and misuse of the data is prevented.

3.1  “Reference” and “Thumbnail” image

Agreement on the specification of the ML-Images! "reference image" (the original Archive image) and the ML-Images! “thumbnail” image, required for the information system, as well as agreement on common elementary documentation elements (of a technical nature) is possible, as these specifications are sub-sets of the more extended specifications used by the local archives.

3.2  Standards for exchanging information

For the exchange of information, standard communication protocols and data formats should be used: The File Transfer Protocol (FTP) will be used to upload new or updated descriptions and images to the server of the ML-Images! System (Mirroring Techniques). Standard clients will be used for the communication between an archive and the end user regarding the ordering of photographs (e-commerce). The documentation of the images will be created in the XML standard format (eXtensible Markup Language) complying to the Dublin Core MetaData standard. Resource Description Framework [W3C-RDF] standard may also be implemented in encoding Dublin Core metadata semantic information.

4.  The ML-Images! System

Derivative versions of the content providing archives (DataBases) integrated in the ML-Images! Knowledge Base (cf. section 7) will play the role of an Images Inventory of the system. This derivation may be performed automatically by the Back-Office ML-Images! System (cf. section 5).

End users will browse and search this inventory (ML-Images! Knowledge Base) and will forward their orders to the original archives (kept by the content providers) via an ML-Images! component. Purchase and payment transactions may also be performed using the ML-Images! e-commerce module.

Figure 1.  Overall System Architecture

5.  The Back-Office System

The ML-Images! Back-Office System (Figure 2 and 7) will be part of the "Adminstrative Tool (AT)" of the ML-Images System. This module will be responsible for the off line building, maintaining and updating of the ML-Images! DataBase, a major component of the ML-Images Knowledge Base. Back-Office system will use an automated "mapping" procedure to create "derivative records" for central online access. These derivatives will be used by the ML-Images! System, which, eventually, will provide access to all the source collections (Archives/DataBases).

During the maintainance process, "Digital Master Files" will be generated from the existing archives exporting all the information that is relevant to the ML-Images! System. In other words, from the original records only the fields which are relevant to the online access will be mapped onto the Digital Master File. Thus, one Digital Master File will be created for each archive DataBase.

Derivative records will then be automatically extracted from all the Digital Master Files. (This can be accomplished using existing software, for instance the Fotoweb 2000.) From each Digital Master File record one record is generated as a derivative. All derivatives are sent to the ML-Images! server by the standard Internet protocol FTP. The derivatives will be built in compliance with the meta-data standard Dublin Core, whereas the original archives (DataBases) are unlikely to conform to this format.

The archive administrators, who are responsible for each local DataBase, may add, delete or replace images independently. The Back-Office System will update the ML-Images! System periodically.

Figure 2.  Initial Setting-up / Back-Office System

During the Digital Master File automatic generation, from each image record stored in the original DataBases (archives) a 'Digital Master Record' will be extracted, importing to the derivative (XML file) only the fields which are relevant to the online access. From each Digital Master Record a Derivative Record will be created. A special agent will integrate all derivative records, to build the "ML-Images! DataBase", and will relate each keyword contained to a common system of concepts. This integration or embedding is accomplished by consulting the Multilingual ML-Images! Matrix (MMM), to encode the thematic domains which are related to each image (cf. sections 6 and 7).

6.  Providing multilinguality

The Multilingual Query Processing tool (see Section 8) will be used to search the image inventories in a cross-lingual way (Soergel 1997). Under this approach images can be retrieved, although their annotation language may be different from the query language.

Concept-based retrieval is implemented instead of free text searching. Under this approach textual fields (such as title, caption, related keywords, etc) are mapped onto a special data model, the Multilingual ML-Images! Matrix" (MMM) (cf. section 6.1).

This concept-based approach is expected to increase precision and specificity.

6.1  The Multilingual ML-Images! Matrix (MMM)

At the heart of the ML-Images! System the "Multilingual ML-Images! Matrix" (MMM) is used for both indexing and searching. This Matrix, which is actually a defined set of concepts or a simple controlled vocabulary (taxonomy), will be created by unifying information from three resources:

(1)  The IPTC classification standard

(2)  thesauri which are used by the content providers;

(3)  lists of keywords used for the annotation of images in the archives (wherever that applies) and,

(4)  lists of words which will be extracted by processing the textual fields of the archives (wherever that applies).

The IPTC classification standard will be used because it has served as a basis for the development of resources (2) and (3). We propose to have the ML-Images! System tuned to IPTC (rather than relying on the providers’ thesauri only) because we want to make sure that the ML-Images! System will be able to adapt to whatever (international) developments concerning the IPTC standard and its uses. KPA and GMI have provided data for the resources (2) and (3). Images from the KPA Archives are annotated in German and the images from GMI in Italian and English. The data for the resource (4) will be delivered to ILSP, who are responsible for the multilingual component of the project.

6.2  Utilization of the Multilingual ML-Images! Matrix (MMM)

The Multilingual ML-Images! Matrix (MMM) will be used as a knowledge base module for:

¨  The integration (embedding) of all derivative records imported to the ML-Images! DataBase. They will be related to one or many thematic domains by generating a new indexing field, named "Concept id" (indexing phase) (cf. Figure 2 and Figure 7).

¨  The automatic translation of the words contained in the end user query (searching phase) (cf. Figure 6 and Figure 7).

6.2.1  Indexing

During the building or updating phase of the ML-Images! DataBase (Back-Office System) all the keywords extracted from the source (original) archive, actually a derivative equivalent of the Digital Master File, will be checked by an agent (concept agent) against the Multilingual ML-Images! Matrix (MMM). MMM will enrich every keyword found by a classification index, which we name Concept id. This index will classify each keyword in turn under a thematic domain (Figure 3). This will be accomplished because MMM encodes a taxonomy or Common System of Concepts (cf. paragraph 6.3, Figure 4). The "Concept id" will be transferred along with all the relevant to the ML-Images! project fields to a new record of the generated ML-Images DB.

In Figure 3, the "Concept id" is the name of a generated multiple value field in the target ML-Images DB. This field contains as many values (indices or concept identifying codes) as the keywords contained in both the textual fields and the keywords field of the source DB record (title, caption, keywords) and occurring in the MMM.

Derivative Record