Compositional vs. Frozen Sequences

Jorge Baptista

University of Algarve, Portugal

LabEL- IST-UTL

1. Introduction

Compound words and frozen expressions constitute a major part of the lexicon of any language. Their definition is not easy, and conceptual and terminological discussions abound in the literature. Traditionally, compound words are usually defined on semantic grounds using the criterion of non-compositionality, that is, there is a compound word when the global meaning of a multiword expression can not be calculated based on the meaning of its individual elements when they are used separately in the language. Accessorily, it is also sometimes noticed that compound expressions often present some formal, syntactic (or combinatorial) constraints.

In fact, in many cases, compound words are semantically ‘opaque’: you may be surprised at learning that a dog-collar is to be used by people (priests), and that a dogfight may involve fighter planes and no dogs at all. It is clear that in some of these word combinations, the form dog is remotely, but still somehow related to a general meaning of the word dog (the animal). However, in other combinations that relation, although it might be historically explainable, has been completely lost, and the new lexical entry has been formed.

In many cases, compound words are only ‘half opaque’: even if you know that a dogfish is a fish, you may ignore what kind of fish it precisely is; even if you know that a fish knife may be a kind of knife used to chop fish, you may not be able to describe it unless you already know its shape beforehand. When you speak of an radioactive element’s half-life, this means the period of time (life) it takes to loose half of its radiactivy.

Notice also that spelling rules – i.e. the orthographical agglutination of two words or the use of hyphen instead as opposed to the use of a blank separator – are just writing conventions (you do not use neither blanks nor hyphens when you speak!), and cannot be consider an infallible guide to determine if a word combination is a compound or not. At most, orthography consecrates writing habits, as the variation attested in dictionaries can easily confirm (fish knife/fish-knife; fish finger/fish-finger).

There are many compound words that are (or at least seem to be) semantically ‘transparent’: a heavy element is indeed heavy but in a physical and not in an ordinary sense. When archeologist talk about some date before present, one has to know that present was conventionally defined at 1950. In many ordinary expressions, like time adverbs, the choice of determinants and prepositions and the resulting meaning are completely unpredictable:

at noon but not *at morning

in the evening but not *on the evening

in the morning but not *in morning

by morning but also by the morning

(please refer to Prof. Machonis’ lecture on the distinction between idioms of encoding and of decoding).

Defining compound words based solely on their non-compositional meaning presupposes that it is possible to identify clearly the meaning of individual, isolated words. It seems like common sense that people usually know what a words means. However, as M. Gross has often showed, the meaning of a word is inextricably related to the word’s syntax, i.e. the words it co-occurs with. The only way for determining the meaning of a given word is by inserting it in several, different sentences and, by carefully controlling formal changes on those sentences, looking for changes (or invariance) in meaning.

In fact, I expect that most of you may have disagreed on which of the previous examples were to considered ‘transparent’, half-transparent’ or even ‘opaque’ word-combinations. Intuitions about meaning are almost always vague a too imprecise to be used in a reproducible way. We, therefore, would rather use syntactic, formal criteria to identify compounds, so that, in these examples, we may say that the words are ‘frozen’ together, even if the meaning of the adverbs is relatively ‘transparent’.

By ‘frozen’ we mean that two or more elements of the expression do not show any distributional variation. If we consider the set of timerelated nouns (dawn, morning, afternoon, sunset, evening, night), only some of these nouns can appear with a given preposition or with some determiners and modifiers. This blocking of distributional variation cannot be predicted and the acceptable combinations have to included in the lexicon, therefore they should be treated as compound lexical units.

Every part-of-speech shows both simple and compound words. For example, word-combinations such as the man in the street could very well be accounted as an indefinite pronoun (similar to everyone):

Politicians always cared about the opinion of the man in the street

Usually, many compound prepositions and conjunctions have already been included in current dictionaries:

John stopped in the middle of the street

John came to Paris by way of Madrid

John came to Paris in spite of my warnings against it

John came to Paris because of my warnings

There are some (productive?) rules to produce compound adjectives:

-like : to be life-like, Algol-like languages

-proof : to be (bullet + water + …) -proof

Other compound adjectives are frozen on purely combinatorial ways:

John is (sick and tired + *tired and sick) of saying that

Moreover, in English, verb+particule combinations forming phrasal verbs, can be considered a especial case of compound verb (as these were already presented in much detail by Prof. Machonis, we will not speak further of them):

John ran (for a mile)

John ran away (to Brazil)

The batteries are running down

John ran into Mary

John ran off to Brazil

John ran off with a book

John’s lecture ran on

The printer ran out of paper

The truck ran over the dog

John ran through the entire proceeding

Some compound words can be described in a regularly way, by means of finite-state transducers, as, for example, the (potentially infinite) set of compound numerals:

twenty-one,

one hundred and twenty-one,

twenty-one thousand two hundred and twenty-one

The number of compound words in a text, particularly in scientific and technical texts, is usually very high. They constitute meaning units that must be identified as a block and not as a string of simple words. This becomes even more crucial if one considers that the majority of compounds have an unpredictable overall meaning, that cannot be directly calculated from the meaning their internal elements.

In this lecture, we will focus on syntactic properties that can be used to identify compounds. Being a major part of many languages’ lexicon, the task of retrieving and describing them into dictionaries is not trivial, especially if these dictionaries are meant to be used in natural language processing.

While many statistical methods have been put in place to retrieve compound (or multiword) lexical units from texts, it remains the linguist’s task to validate those word combinations as compound lexical units and to build the dictionaries for them. In order to do this, linguists have to rely on syntactical properties, which can only be done by learning the language’s syntactic general rules. It is only then that linguists can find out the combinatorial constraints on those rules shown by multiword expressions.

This presentation is structured in two parts: first we will present some of the major syntactical properties distinguishing compound nouns from ordinary noun phrases, and in the second part we will give some examples of how the same methodology can be applied to the identification of compound adverbs.

1. Compound nouns.

Probably the most known case of compounding, compound nouns constitute the largest of all compound word classes. There is a linguistic reason for it: compound nouns must surely represent the larger class of compound words. In every domain (scientific, technical, political, etc.) there is a constant need for coining new denominations for new objects, tools, concepts, products and so on, the nouns being the most natural POS to accommodate such new designations.

Available lists of compound nouns show that these are formed by sequences of grammatical categories identical to those appearing in ordinary (i.e. not frozen) noun phrases (see G. Gross et al. 1986 for a comprehensive typology of French compound nouns):

a nice dog (a dog) / a hot dog (a sandwich)

a square table (a table) / a square root (a mathematical function)

Adam’s orange (an orange) / Adam’s apple (a part of the body)

In view of this formal identity, defining compound words becomes a matter of stating the differences between compounds and free word combinations, especially in the case of non-opaque compounds. As we shall see, this distinction is not as clear-cut as dictionaries and grammars sometimes could lead one to believe. This presentation will show some of the basic syntactic properties that can help distinguishing compounds from free word combinations.

It is clear from what has been said before, that we have moved away from the strict framework of traditional grammar studies, which place compounding as part of Morphology. In the Lexicon-grammar approach, compounds are described with the very same tools used to describe the syntax of noun phrases.

In order to identify a compound as such it is therefore necessary to check if that particular word combination shows any constraints to the combinatorial properties that one would expect to find in a noun phrase (NP) formed by the same internal POS sequence (G. Gross 1988, 1989). This corresponds to describing the grammar of noun phrases and then to compare those syntactical properties to the properties of any word-combination that is a candidate for the status of compound word. To make things clearer in this presentation, our examples here will consist of already well-known compound noun. By analogy, the same methodology can be extended to other, more complex, word combinations.

Let’s take the examples square table / square root. In a free NP with the internal structure Adjective + Noun (AN), where the adjective is often a free modifier of the noun, the predicative function of the adjective over the noun is obviously in an explicit paraphrased with relative clause with auxiliary verb be:

a square table : a table that is square

This is not the case with the compound square root:

a square root : *a root that is square

and also with many other compound nouns where we say that the adjective looses his predicativity. Also, adjectives can be further modified by an adverb:

a square table : a perfectly square table

a square table : a table that is perfectly square

a square root : * a perfectly square root

a square root : *a root that is perfectly square

When the AN combination is free, both the adjective and the noun can vary, provided that basic distributional constraints are respected. Therefore, table can be replaced by other nouns:

a square (table + door + carpet + …)

in the same way as square can be replaced by other distributionally similar adjectives:

a (square + oval + triangular + oblong + …) table

However, when an AN combination forms a compound noun, distributional variation is blocked:

a square (root + *twig + *branch + …)

a (square + *oval + *triangular + *oblong + …) root

In some cases, the same string is ambiguous. For example, round table can be analyzed either as a free combination or as a compound noun. In this case, only the syntactic environment, i.e. the remaining words it appears with in the sentence may help to disambiguate it:

I have bought a round table for my dining room (= a table)

I have attended a round table on Chinese syntax (=an event)

While in the free AN combination, the noun table is a concrete object that can be purchased, the compound noun designates an event, making it possible for it to appear as a complement of to attend. Albeit there are many compound nouns that may be ambiguous with free word combinations, usually they are much less ambiguous then simple words.

Usually, in a free NP, adjectives are just facultative modifiers of the noun. They can be deleted without changing the overall meaning of the NP (nor the meaning of the sentence where the NP is inserted):

John bought a (E + square) table

However, with some abstract nouns that express predicates and are hence called predicative nouns (M.Gross 1981; see below), the presence of a modifier is often obligatory (Meunier 1981; Giry-Schneider 1995; Laporte 1997):

He had an immense esteem for tradition (Henry James, Portrait of a Lady)

*He had esteem for tradition

*He had an esteem for tradition

When the adjective is not a mere modifier of the noun, usually it cannot be deleted, for it is the AN combination that forms a compound lexical unit. This is particularly clearer with semantically opaque compound nouns:

John attended a round table on Chinese Syntax

*John attended a table on Chinese Syntax

John calculated the square root of 9

*John calculated the root of 9

But in some compounds, even if the adjective is frozen with the noun, it can be deleted. For example, most of the times people calculate square roots, so that in some languages – Portuguese, for instance –, unless otherwise stated, the adjective equivalent to square can be zeroed without any loss of information: