Formalised Representation of Collocations in a Danish Computational Lexicon
Anna Braasch & Sussi Olsen
In this connection, lexicographers are concerned with the following
In the following, we use the term 'dictionary' for published lexical data collections for humans, and the term 'lexicon' for computational lexical data collections for machine use, e.g. for natural language processing, henceforth NLP.
In NLP, non-compositionality is a crucial, but up till now a less studied problem for the lexicon. Generally, NLP systems are based on linguistic rules and regular patterns that describe the predictable and systematic behaviour of language; supplementary non-predictable behaviours and arbitrary choices are treated as exceptions to these rules. Linguistic information represented in a lexicon must be very detailed, unambiguous, explicit, exhaustive and formalised. Therefore, e.g. for machine translation, the lexicographer has to consider some additional questions originating from the specific requirements of computational applications. For example it is obvious that word sense disambiguation is crucial especially to machine translation as well as to ontology-based information extraction. To this end, information on the variety and frequency of co-occurrences is essential. In order to discover which words co-occur and how frequent a collocation is, it is important to work both with existing dictionaries and large corpora, as discussed in [FONTENELLE 1992a].
2. Approach and method
Well aware that a considerable number of lexical units in a text are recurring bound word combinations, we regard encoding such word combinations, including collocations, in the lexicon as one of the most important tasks in order to extend the lexical and linguistic coverage. With the exception of subcategorisation information in the form of valency patterns (cf. section 3), these have until now not been incorporated into the STO lexicon.
Our investigation into recurrent bound word combinations is based on two Danish corpora. The first and largest one comprises 20 million tokens from newspaper texts; the second one is a corpus of 4 million tokens from newspapers, magazines and books. None of the corpora are part-of-speech-tagged nor lemmatised, therefore the processing of corpus evidences involves several manually controlled steps, e.g. the manual partitioning of concordances into subsets based the part-of-speech information. Extension of the available corpora as well as tagging of the corpora is in progress. We use the XKWIC corpus tool [CHRIST 1993] for the corpus investigations. We are aware that our main corpus is not well balanced and that the predominance of newspaper text might result in a biased concordance.
In order to verify our findings, we also wanted to trace candidate collocations identified in our corpora in published dictionaries. To this end we chose three dictionaries of various types. For Danish, unfortunately there does not exist any collocational dictionary as such, unlike the situation for English. The only dictionary of a related type is Dansk Sprogbrug ('Danish Usage'), henceforth DS, which is a 'Dictionary of style and constructions'. We used the 1st edition (1976); however, a 2nd (only slightly modified) edition was published in 1995. The second monolingual dictionary used is Nudansk Ordbog, NDO [Politiken 1999], a medium-size dictionary of current Danish. Being the only one on the market, it is used both by native speakers and language learners. The third one is the largest Danish-English dictionary, Dansk-Engelsk Ordbog, henceforth D-E [Vinterberg/Bodelsen 1991]. The last two dictionaries are also published on CD-ROM. These dictionaries were created for rather different purposes, but all of them rely on the native language competence of the users. Although all of them incorporate a number of collocations, none of them provides thorough and systematic descriptions that would be sufficient for language learners to avoid defective or inappropriate production of collocations in Danish.
This is not the right place to discuss this topic in more detail, therefore
we just mention a few general observations that are valid for all dictionaries
This illustrates the lack of and need for a quantitatively and qualitatively reliable dictionary of collocations - although we also found a large amount of valuable information in the electronically searchable dictionaries, which make related information given in different places easily accessible.
3. Collocations vs. valency structures from an NLP point of view
A preliminary definition of a bound word combination was formulated as follows: a frequently co-occurring word combination of two or more components showing a certain degree of structural and meaning cohesion. We deal especially with word combinations having a more specific cohesion than valency constructions. The term 'valency' is used in our approach not only for the number of arguments for which a particular verb subcategorises, but more generally for subcategorisation requirements of verbs, nouns and adjectives.
A distinction between syntactically subcategorised or valency structures (described by patterns) and lexical collocations is not always clear cut, because subcategorisation is often further specified wrt the selection of a particular co-occurring lexical item. A valency structure contains one content word (verb, noun or adjective) and a governed grammatical structure (such as a prepositional phrase, infinitive, finite or infinite clause). Collocations consist of two main immediate constituents that are (groups of) content words (nouns, verbs, adjectives and adverbs). The nominal constituent typically carries the meaning and is the semantically fixed part ('base'), whereas the verbal one has a weakened meaning ('collocate') which modifies the semantic aspect of the collocation and provides it with properties of verbal inflection.
In the following, we will focus on verbal collocations consisting of the verb 'tage' - that can combine with a wide selection of words, leaving us with very heterogeneous material - and a noun phrase, which in some cases is preceded by a preposition. We outline a method to describe the heterogeneous constraints to which these collocation types are subject. The selection of a representative number of collocations was based on an exhaustive analysis of the concordances extracted from the two corpora.
At the current stage of the STO lexicon, the most important properties to be analysed are constraints on syntactic and lexical variability, which basically differentiate bound word combinations from free combinations. In this respect, it is also important to discern collocations consisting of a verb and a prepositional phrase from valency instances of a verb, having particularly strong subcategorisation and selectional restrictions. The examples below illustrate that collocations in (1) and (2) look very similar to instances of structures subcategorised for in (3) and (4) on the surface:
(1) tage til genmæle
The head noun of the PP in (1) does not allow for any modification at all, neither determination nor the insertion of an attributive, and the only modification allowed for in (2) is an adverbial modification of the verb. The valency bound PP's in (3) and (4) allow for all kinds of modifications of the head noun.
4. The PAROLE-model of lexical description
Thus, this model does not operate with a pre-defined lexical unit similar to that in paper dictionaries. However, a 'dictionary entry' containing the lemma with all represented morphological and syntactic (and semantic) information can be compiled from the relevant units of the three layers of description. This description method has the advantage of not being static with regard to a presentation of the lexical item together with all related information in a single dictionary entry. In printed dictionaries, information is only linearly accessible beginning from the top of the entry. Decisions regarding the point of representation of fixed expressions and collocations in the structure of the lexicon (as lemmas or sub-lemmas) are therefore in our context not of primary theoretical relevance, in contrast to the questions discussed in [MOON 1992] and [HEER HENRIKSEN 1995].
4.1 Perspectives for the use of the PAROLE lexicons
5. Towards a formalised representation of collocations in the PAROLE
In the following section, we give a number of simplified examples in order to illustrate the pattern construction procedure. The linguistic properties described in these examples are recognised for each of the selected search words in a large number of corpus occurrences. One of the frequent Danish verbs, tage 'to take' has in its various inflected forms roughly 29,000 instances, of which the most frequent eight collocations make a total of approximately 8,000 occurrences, including the collocation tage ansvar 'take/shoulder the responsibility' with 3,128 occurrences. However, we are aware of the fact that such findings have rather limited value because of the size and a too homogenous composition of the corpus.
5.1 Selected constraint types
Vcoll subtype Number and Definiteness of Object Collocation example
in canonical form
5.1.2 Passive transformation of the collocation
Vcoll subtype Passivisation of the collocation as a whole Collocation
example in canonical form
Ø tage stilling <til [ngt]>
'make up one's mind about [sth]'
(lit.: take attitude to sth)
Modification of the base noun by attributively used adjective is either not possible (marking no_a) or semantically restricted (marking r_a) depending on the particular Vcoll-subtype. We expect to find some instances of collocations, which allow for unrestricted or at least very flexible sets of modifying adjectives but the material studied so far shows no instances of such a behaviour.
The marking r_a means that the insertion of adjectives is semantically highly restricted to a finite set of intensifying lexical items, e.g. særlig 'special', stor 'big', afgørende 'decisive'.
Vcoll subtype Adjective insertion Collocation example in canonical form
r_a tage stilling <til [ngt]> 'make up one's mind
(lit.: take attitude to sth)
V+[NP(obj)]+PP no_a tage [ngt] i §brug 'put [sth] into service'
(lit.: take [sth] into use)
5.2 Formalised description
but allows for the grammatical, impersonal sentence
tage æren for [ngt]
but allows for well-formed sentences, like
Further properties that can be subject to constraints on the collocation
are e.g. topicalisation, making sentences like the following ungrammatical
and also pronominalisation of the object which prevents the following
kind of sentences
Another aspect, which we have not dealt with here, is the choice of
preposition following the noun in many Vcoll subtypes. The preposition
of the collocation is often but not always the preposition which the noun
subcategorises for. The noun hensyn 'consideration' normally appears with
the complex preposition over for 'towards' but in the collocation tage
hensyn til ('show consideration for someone'/ 'take sth into consideration')
the preposition has changed. In our approach we simply bind the preposition
to the collocational pattern, an approach which is appropriate and practical
We can think of an opposite direction of reuse, that is an exploitation
of the material contained in a lexical database. Although this subject
seems to be less frequently discussed, it is obvious that detailed information
given in an unambiguous and explicit way readily could be utilised for
other purposes than NLP. The STO lexicon will provide a more comprehensive
description of collocations than can be found in existing Danish dictionaries.
In our opinion, particularly lexicographers working on learner's dictionaries could take advantage of well-structured information that can be computationally derived from the lexical database. On the other hand, meaning descriptions and other semantic information - which make up an essential part of dictionaries - must be dealt with additionally.
The approach presented brings together results of linguistic analysis, computational methods and application requirements. The general strategy we opted for was firstly, to subdivide information on complex linguistic features into many parts in accordance with the layers of description, secondly to formalise the information pieces in accordance with the descriptive language and finally, to link them coherently together through the layers. This strategy, developed in details for the encoding verbal collocations, can be applied to further types of complex lexical items since it is adapted to a conceptual model that allows for complex and structured descriptions. The selection and linguistic analysis of further frequent types bound word combinations are still outstanding and a design of practical encoding routines as well. Moreover, the quantitative and qualitative impact of the extension methods on the lexicon needs to be verified.
In a wider context, STO is the first national follow-up of the PAROLE-project but probably also other national groups will follow. Therefore, it is important to be consistent with the PAROLE-model and descriptive methods in order to ensure that the nationally produced lexicons remain compatible to each other. Multilingual linking of the lexicons for NLP applications is an actual and challenging perspective.
Allan, R., P. Holmes. T. Lundskær-Nielsen (1995). Danish, A Comprehensive Grammar, Routledge, London and New York.
Bahns, J. (1996). Kollokationen als lexikographisches Problem, Niemeyer, Tübingen.
Benson, M., E. Benson, R. Ilson (1986). The BBI Combinatory Dictionary of English. A Guide to Word Combinations, Benjamins, Amsterdam, Philadelphia.
Blom, B. (1998). "A statistical and structural approach to extracting collocations likely to be of relevance in relation to an LSP sub-domain text". In Nodalida '98 Proceedings.
Boguraev, B. & T. Briscoe (eds.) (1989). Computational Lexicography for Natural Language Processing, Longman, London and New York.
Boje, F. & A. Braasch (1991). 'Hvad får man skudt i skoene? Flerordsenheder i aktive ordbøger for mennesker og maskiner'. In R. Vatvedt Fjeld (Ed.) Nordiske studier i leksikografi, Oslo.
Braasch, A. (1994). "How far do Printed Dictionaries and MT-Lexicons Share Information?" In Studies in machine translation and natural language processing, Vol.8., Lexical Issues in Machine Translation (eds. Alberto, P. & P. Bennet), EC, Luxembourg.
Braasch, A., A. B. Christensen, S. Olsen & B.S. Pedersen (1998). "A Large-Scale Lexicon for Danish in the Information Society". In Proceedings from First International Conference on Language Resources & Evaluation, Granada.
Calzolari, N., U. Heid, H. Khachadourian, J. McNaught, B. Menon, N. Modiano (1994). EAGLES LEXICON. Report on Architecture.
Christ. O. (1993) The Xkwic User Manual, IMS, Universität Stuttgart.
Cowie, A.P, R. Mackin, I. R. McCaig (1983). Oxford Dictionary of Current Idiomatic English. Vol. 2, Oxford University Press, Oxford.
Cruse, D. A. (1986). Lexical Semantics, Cambridge University Press, Cambridge.
Fontenelle, Thierry. (1992a). "Collocation acquisition from a corpus or from a dictionary: a comparison". In EURALEX '92 Proceedings II, Tampere.
Fontenelle, Thierry. (1992b). "Co-occurrence Knowledge, Support Verbs and Machine Readable Dictionaries". In Papers in Computational Lexicography, COMPLEX '92. Budapest.
Heer Henriksen, Berit (1995). "Korpusbaserede relationsoplysninger og lemmatisering af flerordsforbindelser". In Nordiske studier i leksikografi III. Reykjavik.
Heid, Ulrich. (1998). "Towards a corpus-based dictionary of German noun-verb collocations". In Euralex '98 Proceedings, Université de Liège.
Heyn, Matthias (1992). Zur Wiederverwendung maschinenlesbarer Wörterbücher. Lexicographica Series Maior 45. Niemeyer, Tübingen.
LE-PAROLE (1998). Report on the Syntactic Layer. Internal Report, Erli, Paris.
LE-PAROLE (1998). Danish Lexicon Documentation. Internal report, CST, Copenhagen.
Moon, Rosamund (1992). "Fixed expressions in native-speaker dictionaries", in EURALEX '92 Proceedings, I-II. Tampere
Navarretta, Costanza (1997). "Encoding Danish Verbs in the PAROLE Model". In R. Mitkov, N. Nicolov & N.
Nicolov (Eds.), Proceedings of Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.
Sinclair, John (1991). Corpus, Concordance, Collocation, Oxford University Press, Oxford.
Dansk Sprogbrug. En stil- og konstruktionsordbog af Erik Bruun (1978). Gyldendal, København.
Politikens Nudansk Ordbog med etymologi (1999). Politikens Forlag A/S, Denmark
Dansk-Engelsk Ordbog, Vinterberg, H., C.A. Bodelsen (1991). Gyldendal,
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089