EVA’99-Moscow

E. Kolmanovskaia

Yandex.Ru – search and research engine

Elena Kolmanovskaia

Yandex project manager

Phone: (095) 785-25-25

E-mail:

Internet: www.yandex.ru

www.comptek.ru

Yandex.Ru – Russian Web Search Engine

Yandex.Ru is a unique product for indexing Russian-language resources (sites) on the Web (something like AltaVista, eXsite, etc). Search area is "Russian" Internet (it means 'su' & 'ru' domains, former USSR domains (e.g. 'ua', 'kz') and Web-sites in other domains containing Russian texts of any kind). Russian Web consists now of about 35 thousands servers, more than 60 Gb texts. Approximate number of online users - about 1,5 million. Russian Web has two main languages, Russian and English. Web grows quickly, a year ago there were less than 5 thousands servers.

Yandex.Ru includes web spider, HTML-parser, indexing module. CompTek developed all algorithms except Porter algorithm for English morphology. All software is done by CompTek.

For the first time Yandex was announced as a full-text morphological retrieval product line at the 18 of October 1996. Yandex - "yet another indexer" in English transcription; or "language indexer" in Russian ("Ya" is the last letter of the Russian alphabet and the first letter in Russian word "language" [yazyk, yazykovyi]). We use to write it with the first Russian letter and "ndex" in Latin to underline the local meaning (and proud) of the product - Яndex. Yandex.Ru was opened for public access at the 23 of September 1997.

Additional problem for Russian Web-search (unknown for English sites) is peaceful coexistence of different Russian charsets. The most spread are Windows-1251 and UNIX KOI8-R, than ISO-8859-5, Alt-866 and Macintosh. Some sites are clever enough to present the same information in the requested charset, some - not. For example, AltaVista search for Russian words presents two different results in Windows and KOI. It means that Russian web-search engine must understand all charsets, recognize if they represent the same information (site) or not and be able to show to user the results in correspondent charset. Yandex.Ru can do it and even more, It's able to calculate uniqueness of documents not only concerning charsets but also concerning mirrors.

Today (September 1999) YANDEX.RU statistics:

-  41 635 indexed Web-servers

-  10 949 302 indexed pages

-  more that 99,12 Gb indexed information (index data base less than 25 GB)

-  more than 25'000 unique IP every day

-  more than 150'000 unique IP every week day

Yandex kernel

All the products with Yandex prefix have the same Yandex-kernel. The difference between products is the different application (external interface).

The Yandex-kernel features are:

-  Russian morphology module (90,000 vocabulary, correct treatment of unknown and new words, one of the best world linguistic schools -- Melchuk/Apresian, morphological analysis and synthesis, learning vocabulary) + English morphology

-  indexing module (size of the index = 35% of text size, i.e. very small - very important for huge texts; stores full word's address, including token number; highlighting of found words; indexing speed 2MB/min on a PC; very fast retrieving)

-  parsing tools (SGML-like text mark-up language, external text mark-up language)

-  complicated query language (Boolean operators, distances between words and paragraphs, text zones)

-  high-quality sorting algorithm for query result (very important for huge texts and heavy queries)

-  natural language query, search of similar document

All Russian and English words are normalizes at indexing and at search. Not only words are indexing but also numbers and marks (mixture of letters and digits). Natural language query simplifies search engine usage. The simplest way to ask Yandex search engine is just to write in query field exactly what you need.

Yandex product line

Yandex.Site - the tool for indexing and search on user's own Web-site

Doesn't matter how wonderful did you organize your Web-server, the real life is usually more complicated than the scheme. Your site grows – it means that your visitors will need to go deeper and deeper – it means that they can be bothered by clicking again and again. Yandex.Site can provide the information of any level in a couple of clicks – visitor has only to ask and look at the result.

Yandex.Site can be easy designed for your server conditions – administrator can tell which directories must be indexed and which not, what file types must be excluded. Additional feature is catalog search – any directories can be logically united in one catalog item and Yandex.Site can provide independent search only inside one or several items. The same directory can be included in different items. It's also possible to create different catalog sets.

User's site can be reindexed as often as it is necessary. Usually it's enough to do it once a day (at night) but in case of news-site it can be done every hour or even faster. Indexing process does not stop search process, they are transparent for each other, and the only exception is a couple of seconds when a new index base becomes available. During these seconds search queries are waiting in line, usually it's unnoticeable.

A special version of Yandex.Site was created for ISP – it supports several virtual servers at the same computer (or local network). From the ISP point of view it's just one program Yandex.Site, from the host owner's point of view there is an independent Yandex.Site for every host.

Yandex.CD - the tool for search through static texts

Yandex.CD is quit similar to Yandex.Site. The main difference is that Yandex.CD does not need Web-server – the search part can be installed on any computer with Windows 32 and Internet browser (IE or Netscape 3.0 or higher). The idea is that the texts are not changeable so they must be indexed only once. Index data base is enclose to the texts. This product is used at CD-editions.

Yandex.Lib – full-functional Yandex library.

Yandex.Lib is the stand-alone module and the library (correspondingly), ready to be build in different third-party retrieval systems. It includes three groups of functions: indexing, search and highlighting. Yandex.Lib can works with several databases simultaneously.

Yandex.Dict - a Russian morphology module only (without indexing).

Yandex.Dict is also a library to be built in third-party products (usually with pre-indexed texts). As an example of Yandex.Dict we show an extension to Digital's AltaVista search engine. Just imagine – a simple query "new Russians" ("новый русский") in all Russian forms looks like:

(((новый | нов | новейший) ~ русский) | ((нового | новейшего) ~ русского) | ((новому | новейшему) ~ русскому) | ((новым | новейшим) ~ русским) | ((новом | новейшем) ~ русском) | ((новые | новы | новейшие) ~ русские) | ((новых | новейших) ~ русских) | ((новыми | новейшими) ~ русскими))

Yandex.Ru – Russian Web ReSearch Engine

Search engine provides a possibility to research Russian Internet – both content and users.

What kind of information you can find in Russian Web? According to sites in Yandex.Ru base (data for the beginning of 1999 year):

·  Business and marketing (including advertising and public relations) – about 35%

·  Self-expression (home pages) -13,5 %

·  Internet-life (download programs, Internet projects, on-line libraries etc.) - 11,8%

·  Science, medicine (schools, universities etc.)- 10,2 %

·  Culture (theatres, museums) - 9,5 %

·  Mass-media (newspapers, magazines, radio, TV) - 6,7 %

·  Adult (sex) – 2 %

·  Services (mails, trading, soft delivery) – 1,3 %

·  Officials – 1 %

Who lives in Russian Internet? The pioneers are as usual hi-tech companies. Then – consulting and advertising. They quickly realized that Internet presentation is much less expensive than mass media one. Travel agencies and hotels, real estate, cars and various device vendors learned alredy to use Internet as a powerful weapon to find more clients. Internet users now represent about three percents of all-Russian population, but they are it's most active and educated (at least technical) part, and mainly middle class. Off-line research made by Gallup and Comcon confirm this impression.

Yandex.Ru also uses to study queries. For example, we found out that the words "bank" and "currency rate" have extremely grown in queries and overcame usual top 5, such as "Moscow", "sex", "porno", "Russia", "referat", a week before august crisis. Now we began to study queries systematically. We invented NINI-index (Internet Users' Interest inconstancy). This index consists of it's value, 5 words which mostly grew in queries during last week in comparison with previous one and 5 words that mostly fell. These ten words represent users' interest changing. It’s possible to restrict study area by any word set, for example, by politics (we publish also polit-NINI), or trade marks and so on.

Yandex.Ru is an Internet product of common use – so a lot of Internet users come there. It’s not only advertising place but also a place to make queries. We asked people what information source they entrust. The answers were:

Internet 35.99%

TV 16.99%

Newspapers and magazins 10.34%

Rumours 1.50%

Give no credence to anybody 35.18%

We can also analyze for any word its most frequent neighboors in queries. For example the word “art” (“искусство”) is usually asked with the following words:

·  battle

·  museum, figurative

·  applied

·  contemporary

·  decorative

·  history, love

Culture in Internet

I was proposed to say a few words about «Culture» resources in Russian Internet. To investigate this problem I examined thematic catalogues. At @Rus (former «Au» - www.atrus.ru/rus/) in section «Culture and art» there are 2917 resources. At Rambler counter (counter.rambler.ru/top100/) – 2576 resources. At List.Ru (www.list.ru) – 3744. It represents the same percentage (9-10%) which I announced before by Yandex.Ru data.

What are main culture resources? The most peculiar Internet content – texts (Moshkov's text collection exists already for 5 years), images (foto, pictures) and music (mp3 format). All these resources are collected mainly by independent persons. Then organisations are present – theatres, museums, libraries, artist unions, then editons and information (posters, encyclopedia).

Here is the list of the best resources from the three catalogues.

@Rus, composite criterion: elite league +popularity

·  Библиотека Максима Мошкова http://lib.ru/

·  Литература http://www.litera.ru/

·  Центр современного искусства Сороса www.sccamoscow.ru/

·  Государственный академический Большой театр России http://www.bolshoi.ru/

·  Союз архитекторов России http://www.uar.ru/

·  Gumilevica: гипотезы, теории, мировоззрение http://kulichki.rambler.ru/~gumilev

·  Госфильмофонд http://www.aha.ru/~filmfond

·  Государственный Эрмитаж http://www.hermitage.ru/

·  Государственная Третьяковская галерея http://www.tretyakov.ru/

·  Государственный музей изобразительных искусств им. А. С. Пушкина http://www.museum.ru/gmii

·  Музей Рериха в Нью-Йорке http://www.roerich.org/ru/home_ru.html

·  Культура - информационное агентство http://www.guelman.ru/culture

·  Кирилл и Мефодий - досуг http://www.km.ru/

Here are Rambler counter «culture sites» positions – it represents to some extend @user's demand@ but also catalogue «folk caracter».

58  Music phone www.cdru.com (Музыка)

65 Referat.Ru - сервер для студентов и школьников (Образование)

74 Библиотека Максима Мошкова (lib.ru) (Литература)

80 Full Albums in MP3 (Музыка)

95 MP3 European & American Charts. Full Albums MP3. (Музыка)

100 Cyber Archive of Mp3z, Gamez, Appz (Музыка)

102 Музыка! Гитара! Блюз! Система запроросов! ЖМИ! (Музыка)

Here are List.Ru data, sorted by Yandex citation index.

CI, Citation Index is a usual measure of science work or person significance. Index value represents the number of references at this work (or name) by other scientists in their works.

As applied to WWW, citation Yandex is a measure of Web page or Web-site publicity among other Web-resources creators, i.e. among “writers”. It’s the main differemce between CY and counters, such as Rambler Top100, Top List, Count.ru, which are a measure of publicity among “readers”. CY, Citation Yandex is the number of Interenet-resources, where there are links at this resource, measured by Yandex data.

1194 Библиотека Мошкова www.lib.ru

1077 Все музеи России www.museum.ru

653  Music.Ru www.music.ru

647  Музыкальная Шкатулка www.cdru.com

520 http://www.mtv.com www.mtv.com

456  Гос.Эрмитаж www.hermitage.ru

413 Современное искусство в сети www.guelman.ru

Elena S. Kolmanovskaia

Elena Kolmanovskaia, Yandex project manager, was graduated from Moscow Institute of Oil and Gas. She received a MS degree in Applied Mathematics in 1987 and has conducted considerable research in the area of data analysis and structure simulation at the All-Russian Scientific Oil Geological Research Institute. For two years she worked in USA as chief programmer (East Cost Sheet Metal Corporation). From 1996 she is the chief of Yandex (full-text retrieval search systems concerning Russian morphology) team and project. She is also the author of "Tales of Russian Internet", published at Russian Internet search engine Yandex.Ru.

7 ~ 7 ~ 4