Application of Unicode

USER DOCUMENTATION

 Ex Libris Ltd., 2002

Version 16

Last Update: August 18, 2003

Table of Contents

1Overview

2Basic changes from non-Unicode versions

3Introducing Unicode and UTF-8

4Pre-Unicode multi-script

5Unicode multi-script display

5.1WEB Browser

5.2PC Client

6Comparison of PC settings (pre-14.2 – 14.2)

7Using Unicode in ALEPH - /alephe/unicode tables

7.1tab_character_conversion_line

8MARC8_TO_UTF conversion program specifications

9Basic Unicode tables (/alephe/unicode)

10Comparison of tables pre-14 and post-14 (from 14.2) (/alephe/char_conv and /alephe/unicode)

11Cataloging

12Filing and Word Breaking

1Overview

Unicode was introduced to the ALEPH 500 system with release 14.1, in July, 2000.

Release 14.1 was the first step in Unicode implementation. All bibliographic data was stored in UTF-8, but all administrative data (e.g., patron registration, vendors, item records, etc.) was stored in the local standard (such as ISO LATIN-1). From release 14.2 (December 2000) the entire system is in Unicode, and administrative data is also stored in UTF-8.

Within this new development, ALEPH 500 retains the “char_conv” principle familiar to users of previous versions. The tables enable translation of characters from two-byte to one-byte representation and vice-versa, for sorting and display purposes. ALEPH is retaining the “alpha” indicator in fields, although this may appear to be redundant in a Unicode environment. For the moment, it is used for detecting right-to-left fields (H, A). It might have additional functionality in the future.

2Basic changes from non-Unicode versions

The /alephe/char_conv directory is no longer in use. It has been replaced by /alephe/unicode. All the character conversion tables are new and conform to UTF-8 standards.

Deciding which table is relevant for each instance has changed from program control to table control. The library installation is now able to assign the table to be used for a particular purpose (e.g., building sort keys for patron index, order index, etc.).

Another basic change is the revamping of word building and filing (sorting) procedures. This is not directly related to Unicode, but happened at the same time.

3Introducing Unicode and UTF-8

To quote from

“Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

Until the application of Unicode, characters were grouped in sets and limited to the code space of 256 characters. This does not suffice. For example, over 400 characters are required for implementing all the characters used by European languages that are based on the Latin character set. As a result, multiple national standards developed, adjusting the character repertoire of the specific language to the limited code space. The result has been multiple, inconsistent character code sets, no easy way to deal with multilingual data, and no transparent transfer of data between computer systems.

With Unicode, there is one standard encoding for all characters, theoretically using two bytes (16 bits) for each character. UTF-8 maps Unicode values, using one byte for “English” characters (i.e., A-Z, a-z, numbers, punctuation, etc.), two bytes for many other characters (accented characters, Hebrew, Arabic, etc.), and three bytes for CJK (Chinese, Japanese and Korean).

4Pre-Unicode multi-script

Multi-script functionality of bibliographic data in non-Unicode versions of ALEPH was possible due to the presence of a script identifier (“alpha”) for the field. The “alpha” designation of a field or record, together with simultaneous activation of several code pages, meant that a large range of characters could be displayed in spite of the single byte environment. A different font was assigned for each of the “alpha” sets (e.g., FontS01=18Courier New Cyr for “alpha S”). The mapping of special characters in the ANSELand MAB character sets in the char_conv tables allowed for expansion of the character set, but always within the limit of the designated font.

5Unicode multi-script display

5.1WEB Browser

Netscape 6 and Microsoft Internet Explorer 5 support display of UTF-8. However, all data on the page that is displayed must be in the UTF-8 standard. A problem might arise if a person is not using a Unicode-enabled browser. Therefore, in order to display UTF-8 data in browsers that do not support UTF-8, ALEPH uses a “fallback” 256-character code page. The mapping from Unicode to the code page is defined in a table in the Unicode directory (e.g., unicode_to_8859_1). The default table is set in the /alephe/www_server_defaults file:

setenv server_default_charset "iso-8859-1"

Note:

WEB OPAC users should always set the MS Internet Explorer Encoding to Auto-Select.

There are known problems with the Unicode implementation on Netscape 6, and therefore it is preferable to use MS Internet Explorer 5.

5.2PC Client

You can set more than one font for the PC client system, assigning a different font for different ranges of the Unicode character set. Although one Unicode font can be set for the entire range of characters, you might want to use multiple fonts, for two reasons. First, a complete Unicode font is heavy on computer resources and for many sites might not be required. Secondly, most Unicode fonts (such as Cyberbit) do not support the full range of Unicode characters, and a different font may be required for special characters (e.g., old Cyrillic).

There are two places in the client concerned with font setup:

\Alephcom\Tab\font.ini

\Catalog\Tab\catalog.ini

\Alephcom\Tab\font.ini

font.ini is used to define the font (and its attributes) to be used for a particular range of characters in the Unicode set, for each part of the GUI client window. The structure of the font.ini table is as follows:

Column 1:

-window part

-EditorField: text in the cataloging draft window

-EditorDescription: description of the tag in the cataloging draft window (taken from codes.<lng>)

-EditorTag: tag in the cataloging draft window

-ListBoxCaption: caption at the head of a column in a list box

-ListBox##: text in a list box. Each column is identified by a number; ## can be used to signify all columns

-UnicodeEdit: window for inputting text for Find, Scan, Jump

-EditorForm: cataloging

Columns 2 and 3:

-Range of characters in Unicode set (from-to inclusive). The list is read top-down, and a general catchall range (0000-FFFF) can be defined as the last range in the list.

Column 4:

-Face name of font

Columns 5, 6, 7:

-Attributes (5=bold, 6=italic, 7=underline)

Column 8:

-Font size; note that the font size must be coordinated with the grid in which it is displayed. For example, this parameter should be coordinated with the line height parameter for the cataloging draft window in the catalog.ini.

Column 9:

-Opening mode; this defines the character set within the font, and is related to the fact that one font can contain many character sets. For example, “Courier New Cyr” contains both ISO-Latin-2 and Cyrillic characters, and you can define which character set will be used with a font. The default is DEFAULT_CHARSET, which is the window default for the PC. The way to check this is through Programs -> Accessories -> Character map on your PC.

Possible values are:

ANSI_CHARSET

DEFAULT_CHARSET

SYMBOL_CHARSET

SHIFTJIS_CHARSET

HANGEUL_CHARSET

GB2312_CHARSET

CHINESEBIG5_CHARSET

OEM_CHARSET

JOHAB_CHARSET

HEBREW_CHARSET

ARABIC_CHARSET

GREEK_CHARSET

TURKISH_CHARSET

THAI_CHARSET

EASTEUROPE_CHARSET

RUSSIAN_CHARSET

MAC_CHARSET

BALTIC_CHARSET

Example:

EditorTag 0000 FFFF Courier N N N 16 DEFAULT_CHARSET

EditorField 0000 00FF Tahoma N N N 16 DEFAULT_CHARSET

EditorField 0401 045F Tahoma N N N 16 DEFAULT_CHARSET

EditorField 0384 03CE Tahoma N N N 16 DEFAULT_CHARSET

EditorField 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSET

EditorField 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

ListBoxCaption 0000 00FF Tahoma N N N 14 DEFAULT_CHARSET

ListBoxCaption 0401 045F Tahoma N N N 14 DEFAULT_CHARSET

ListBoxCaption 0384 03CE Tahoma N N N 14 DEFAULT_CHARSET

ListBoxCaption 05D0 05EA Tahoma N N N 14 DEFAULT_CHARSET

ListBoxCaption 0000 FFFF Bitstream Cyberbit N N N 14 DEFAULT_CHARSET

ListBox## 0000 00FF Tahoma N N N 16 DEFAULT_CHARSET

ListBox## 0401 045F Tahoma N N N 16 DEFAULT_CHARSET

ListBox## 0384 03CE Tahoma N N N 16 DEFAULT_CHARSET

ListBox## 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSET

ListBox## 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

UnicodeEdit 0000 00FF Tahoma N N N 16 DEFAULT_CHARSET

UnicodeEdit 0401 045F Tahoma N N N 16 DEFAULT_CHARSET

UnicodeEdit 0384 03CE Tahoma N N N 16 DEFAULT_CHARSET

UnicodeEdit 05D0 05EA Tahoma N N N 16 DEFAULT_CHARSET

UnicodeEdit 0000 FFFF Bitstream Cyberbit N N N 16 DEFAULT_CHARSET

EditorForm 0000 0000 FFFF Courier N N N 16 DEFAULT_CHARSET

Note:

The cataloging EditorForm remains in grid implementation, one character per grid square. Therefore, a proportional font is not suitable.

\Catalog\Tab\catalog.ini

The font size definition in catalog.ini defines the character grid for tag, indicator and sub-field codes, all of which must be set in non-proportional font. It also sets the character grid for cataloging forms, as long as these forms remain non-proportional. (This will change in the future.)

FontSizeX=10

FontSizeY=16

6Comparison of PC settings (pre-14.2 – 14.2)

-Font definitions have been moved fromalephcom.ini and catalog.ini to alephcom\tab\font.ini.

-F9 in the client, for setting font and colors, is no longer in use. Colors are now set by windows setup.

-Alephcom\tab\charset.dat is no longer in use; it is now part of font.ini.

7Using Unicode in ALEPH - /alephe/unicode tables

Character conversion is required by various aspects of the system. The tab_character_conversion_line table defines which conversion table is to be used for each of these aspects.

7.1tab_character_conversion_line

This table defines the procedure and table to be used in various instances when character conversion is needed. The character conversion procedure is system-set, but the character conversion table is determined by the library application. The system continues to use “alpha” for fields, and the table is set up taking this into account. Most of the lines in the table translate the data for communication with other systems.

The columns of the table are the following:

col. 1:instance

(e.g., LOCATE - translation of data when creating string for locate)

col. 2:environment

(e.g., PC, WWW or “any”)

col. 3:alpha code of the line or record, for further refinement of col.1

(e.g., H, L, R, S, A)

col. 4:name of the procedure to run

- line_sb2line_utf (translates line of data from single byte character to UTF-8)

- line_utf2line_sb (translates line of data from UTF-8 to single byte character)

col. 5: character conversion table

col. 6: backslash notation indicator

This is used for transposition from single byte to utf and vice versa

(sb_to_utf and utf_to_sb) for import/export, in order not to lose data.

The notation is a backslash and the hexadecimal value of the character.

The instances for column 1 are:

RLIN_TO_UTF

Translation of data imported from RLIN (UE_03) to UTF-8

YBP_TO_UTF

Translation of data imported from YBP (p-file-96) to UTF-8

UTF_TO_URL

translation of the URL link in field 856 from UTF-8 to standard required for URL

UTF_TO_WEB_MAIL

translation of UTF-8 bibliographic data for MAIL and PRINT options in WEB OPAC

LOCATE

translation of data for the locate query; this data can be further translated according to the setup of the particular conf file in /alephe/gate

FILING-KEY-nn

translation for filing purposes. This is not system set, but it must be coordinated with the char_conv line of the library’s /tab/tab_filing table. The filing table listed for FILING-KEY is created using UTIL P/3. (See further under Basic Unicode tables: 7. Unicode_to_filing_01_source.)

VENDOR_NAME_KEY

translation for sorting the Vendor index by name

COURSE_NAME_KEY

translation for sorting the Course Reading index

ADM_KEYWORD_KEY

translation for keyword indexing in ADM clients, such as budget, vendor, etc.

BORROWER_NAME_KEY

translation for sorting the Patron index by name

ACQ_INDEX

translation for the Acquisitions Order Index

OCLC_To_UTF

translation of data imported from OCLC to UTF-8

MARC8_TO_UTF translation of MARC-8 data to UTF-8.

The following routines are defined for clients such as Z39.50, which work in a single character set environment.

8859_1_TO_UTF

UTF_TO_8859_1

8859_8_TO_UTF

UTF_TO_8859_8

8859_7_TO_UTF

UTF_TO_8859_7

8859_5_TO_UTF

UTF_TO_8859_5

UTF_TO_MARC8UTF_TO_MAB

MARC8_TO_UTF conversion is different from the above procedures. For this procedure, col.6 (the character conversion table) should be left blank, since the procedure is set to use the following tables:

marc8_ara_to_unicode

marc8_heb_to_unicode

marc8_eacc_to_unicode

marc8_lat_to_unicode

marc8_greek_to_unicode

marc8_rus_to_unicode

In addition, some of the conversion values are set in the program itself, and not in the tables.

8MARC8_TO_UTF conversion program specifications

The program takes a string of up to 2000 characters in MARC-8 encoding.

Each record can contain sequences in more, than one character set. Such a sequence start is identified by an Escape character (X"1B") plus 1 or 2 additional characters, that define a specific character set as follows:

X"1B" + "(B" Latin character set,

X"1B" + "(2"Hebrew character set,

X"1B" + "(3" Arabic character set,

X"1B" + "(N"Cyrillic character set,

X"1B" + "(S"Greek character set,

X"1B" + "$1"EACC character set,

X"1B" + "s"Latin character set,

X"1B" + "g"Greek symbol set,

X"1B" + "b"Subscript set,

X"1B" + "p"Superscript set.

When there is no character set escape sequence, the default set is Latin.

Each sequence is translated to UNICODE and then to UTF, using a table specific to the character set. The tables are:

marc8_ara_to_unicode

marc8_lat_to_unicode

marc8_eacc_to_unicode

marc8_greek_to_unicode

marc8_rus_to_unicode

marc8_heb_to_unicode

In the Hebrew, Arabic, Cyrillic, Greek character sets, Greek symbol set, Subscript and Superscript sets each character is translated to one UNICODE (single byte translation).

EACC character set translates each three MARC-8 characters to one UNICODE character.

The MARC-8(Latin) character set can contain combining characters (between X"E0" and X"FE" except X"FC" and X"FD"). For these characters, the translation is done on sequences up to the end of the field or subfield, whichever comes first.

In MARC-8, combining characters always precede the character with which they are combined. In Unicode, combining characters always come after the character with which they are combined. Some character sequences that contain a combining character can be translated to a single Unicode character (e.g. “a” with grave accent).

The marc8_lat_to_unicode table defines sequences of combining + base characters that translate to a single Unicode character. The left-hand column of the table is the Unicode character. The right-hand column can include up to 4 characters, which, taken together, are equivalent to one UNICODE character. For example:

01E3 e5b5

When the program finds a combining character, it examines the next character(s) until it finds a non-combining character, within the next 3 characters. This results either in a pair, a triplet or a quadruplet of characters. The group is checked against the marc8_lat_to_unicode table, and translated accordingly. If the pair, triplet or quadruplet is not found in the table, the combining characters are transposed after the non-combining character, and each of the characters is individually translated from MARC-8 to UNICODE. For example, the MARC-8 input is: X"E0" + X"E1" + X"41". If no such combination is found in the table, the output will be: X"41" + X"E0" + X"E1" (when each character is translated to Unicode, this will become U+0041 U+0309 U+0300)).

If no non-combining character is found within the 4-character string, the program continues to look for a non-combining character in the string (until end of field or subfield), and positions the combining characters after a non-combining character, as described, bypassing the marc8_lat_to_unicode table check.

If there is a sequence of combining characters with no following non-combining character, the characters are translated to UNICODE, and left in their original order.

If combining characters appear before an Escape (denoting character set change):

if the Escape is to-Latin, the group of combining characters is dealt with in relation to the first character after the Escape sequence; in other words, the Escape is ignored.

if the Escape is to a non-Latin single-byte character set (Hebrew, Greek, etc.), then the first character after the Escape is translated to Unicode, and the combining characters listed before the Escape are transposed (placed after the character) and translated from MARC-8 to Unicode. There is no attempt to translate to a combined character (as is done with Latin, using the marc8_lat_to_unicode table).

if the Escape is to the EACC character set, each of the combining characters that preceded this sequence is translated from MARC-8 to UNICODE and the EACC sequence is translated to Unicode. In other words, the combining characters are not transposed.

The treatment of combining half marks (ligature and double tilde) is:

input: X"EB" (left ligature) <character> X"EC" (right ligature) < character>
output:<character> X"EB" <character> X"EC" (after which each character is translated to Unicode)

input: X"FA" (left double tilde) <character> X"FB" (right double tilde) <character>
output:<character> X"FA" < character> X"FB" (after which each character is translated to Unicode)

In other words, the combining half marks behave exactly the same as combining characters. Thus, there is no special handling for them.

Whenever single character translation to UNICODE fails as a result of a missing value in a table, this character is translated to X"FFFD" (replacement character, used to replace an incoming character whose value is unknown or unrepresentable in Unicode).

9Basic Unicode tables (/alephe/unicode)

Some of the following tables are used directly by the system, without referring to the alephe/unicode/ tab_character_conversion_line table. These tables must retain the name listed here. Other tables are defined in alephe/unicode/ tab_character_conversion_line, which defines which table to use in each instance where conversion is required. For these tables, the name scheme listed here does not have to be adhered to. However, for easier maintenance of the system, it is recommended that libraries retain the Ex Libris naming conventions.

In all the tables, the Unicode value is in the left-hand column, in hexadecimal notation.

unicode_case

This 3 column table lists the Unicode character mapping in the left-hand column, the corresponding uppercase character in the middle column, and the corresponding lowercase character in the right-hand column. This table is used by the utf_change_case procedure.