A Large Scale Lexicon for Danish in the Information Society
Anna Braasch, Sussi Olsen, Bolette S. Pedersen
Structure of the project
Danish is spoken by approx. 5 mill. people and thus a "less widely used" language; a fact which increases the need for the use of language engineering technology to support Danish participation in international communication. The production of various language processing tools creates rapidly growing demands for computerised, large size reusable lexical data collections. Being a relatively small language community, the cost effectiveness is an essential factor in the development of relevant products. This need can only be met by the development of reusable multipurpose resources produced by means of goal-oriented and coordinated national efforts.
In Denmark, a group of users and developers of language technology products together with representatives for relevant authorities have defined a number of areas of national initiatives to promote this field - one of them being the development of a large size reusable lexical resource. To this end, CST has been proposed to undertake the task because of our previous experience within the field of computational lexicography. CST also has the responsibility of initiating, organising and coordinating a network of co-operating partners. CST has recently initiated the establishment of a consortium of (approx. 10) partners that should form the kernel of the network.
1.2 The STO project seen within a broader context
The STO project has primary relevance to language industry applications. A high quality lexicon component forms a kernel part of products like MT systems, style and grammar checkers, language education software, systems for intelligent information retrieval and automatic abstracting. Also language research, e.g. investigations into computational linguistics, testing of various research methods of descriptive linguistics, statistically oriented lexicology and grammar projects can benefit from the STO lexical data collection. Additionally, the project will hopefully lead to mutual inspiration with dictionary publishing companies.
The formulation of linguistic specifications for STO complies with the relevant national Danish Standard in order to ensure the interchangeability and reusability of the data to be produced. The Danish Standard (Part 1 in print March 1998) contains a general taxonomy of lexical information categories and types. The linguistic information types specified for STO are related to this standard providing a basis e.g. for comparison with other data collections and external user requirements. They also follow established international recommendations, e.g. the morphosyntactic part of the lexical descriptions are in accordance with the general and the Danish language specific recommendations of EAGLES (1995).
Adaptation and integration of reusable lexical data from internal and external sources is an important piece of work in the STO project. The Danish PAROLE lexicon (PAROLE-DK) - an outcome of the PAROLE EU project , completed by the end of April 1998, serves as the point of departure. This is the largest and most systematically encoded Danish lexicon for language engineering purposes. It contains approx. 20.000 general language entries that are provided with morphological, morphosyntactic and syntactic information. This internal material, elaborated at CST, makes up a valuable initial capital for a large size lexicon project. STO is planned to contain approx. 45.000 general and specialised language entries provided also with semantic information in addition to the above mentioned information types. This will result in approx. 100.000 semantic readings.
As regards lexical data from external sources, the LSP parts of the Danish METAL lexicon, developed for machine translation by Southern Business School, is now being integrated into the STO format (discussed in section 3.2).
1.3 Feasibility study
The first introductory action was to perform a feasibility study aiming at the exploration of the basic conditions for the implementation of the initiative and through that to procure the necessary basis for decisions before a start-up of the project. A concluding report of this feasibility study (August 1997) came up with
guidelines for the composition of the general language vocabulary to be covered (extensions of the PAROLE-DK coverage especially concerns lexicalised compounds and multi-word units)
a first proposal for the selection of domains and text types to be covered
draft guidelines for the composition of the LSP (language for specific purposes) vocabulary to be covered
a summing-up of the investigations of various corpus tools wrt. their functionalities, technical requirements, user-friendliness, etc. with a view to suitability for the project
a rough outline of the selection and organisation of linguistic information types.
The report gives an indication of the central points and working areas where we need to work up a particular knowledge and an appropriate practice. These tasks are subject to further investigations that also are expected to provide the lines for the composition of a well-functioning consortium comprising all the types of professional competence that are required in order to carry the project into effect.
1.4 Pilot project
A pilot project (completed in April 1998) followed up the recommendations of the feasibility study. Although the work mainly concentrated on the formulation of linguistic specifications, we also investigated some important administrative and organisational tasks. We came up with an overview of the field of professional co-operation comprising potential project partners and data suppliers, users and customers. Also a number of preparatory contacts have been established. An outline of various co-operation models is prepared to be implemented within a national network. These models concern tasks like encoding new lexicon entries, getting access to reusable external lexical and corpus material and involving expert assistance. Actually, it is important to establish a flexible but powerful network structure as a prior condition for the efficiency of the consortium.
1.5 The main project
After the completion of the pilot project and an evaluation of its outcome the main STO project will be launched. The duration of the project will be approx. 6 years. The core task of the main project is to compile the full-size lexicon. We plan to make the lexicon development modular in the sense that we aim at producing self-contained parts (e.g. a module covering a given domain vocabulary or a module containing syntactic information only).
2.1 Basic features of the PAROLE model
Since LE-PAROLE constitutes the point of departure for STO, we in this section briefly sketch out the main features of the PAROLE model (see also Calzolari et al. 1994, Normier & Nossin 1990 and Navarretta 1997). The model was designed to fulfil the needs of a wide range of NLP applications representing different kinds of information in an integrated and coherent model without commitment towards a particular linguistic theory. One of the basic features of the model is its modularity with respect to morphological, syntactic and semantic information as illustrated in Figure 1:
Figure 1: The PAROLE model
This division into layers with particular units connected to each implies that there exists no such thing as a lexical unit in the traditional lexicographical sense; in contrast each level of representation is described independently although coherently connected the one to the other. In this way, the model permits to distinguish different syntactic behaviours on pure syntactic criteria, and independently of whether they share the same meaning or not. Furthermore, it permits the refinement of one level (i.e. syntax and semantics) without changing the description of others. Genericity and explicitness are two of the central requisites aimed at by choosing this architecture.
2.2 Tailoring the PAROLE model for Danish
In STO, we plan to extend the use of the facilities inherent in the PAROLE model e.g. when expanding the linguistic coverage to include also the treatment of collocations. In particular, this is important for the coverage of the LSP vocabulary which usually contains a relatively high percentage of various types of multi-word units. Furthermore, some changes will be made in order to simplify the data model. The reasons for this are firstly, that the PAROLE model has been designed to cope with 12 rather different (Romance, Germanic, Finnish, etc.) languages and therefore it has a number of linguistic descriptive facilities that are not used for encoding Danish entries. Secondly, STO is going to use its own lexicon encoding tools and database system and thus is not restricted by the tools used within the PAROLE project. The database changes include structural as well as logical changes.
The structural changes primarily include a reduction of the number of database entities, i.e. the number of "boxes" into which the database is divided. As mentioned earlier, the PAROLE model is designed to cope with many languages; therefore the PAROLE model includes global as well as local features that have been used differently within PAROLE. Since STO deals with Danish only we can exclude the features that are not used in Danish and thereby render superfluous the distinction between global and local PAROLE features. By this reduction of the number of features, the structure becomes unnecessarily deep for Danish since some of the database entities include none or only very few features. As a result of this we foresee a simplification of the structure by collapsing some of the entities, though maintaining the overall three-layered structure of the PAROLE model as described above.
The logical changes include a reduction of the number of valency frames. The Danish PAROLE model covers more than 500 valency frames for verbs alone (called 'Descriptions' in the PAROLE model (Navarretta 1997)). As the valency frames include valency bound prepositions and particles, their number can be reduced drastically by coding the prepositions and particles somewhere else, i.e. have the valency frames describe the basic valency patterns of the verb/noun/adjective and include the preposition under e.g. each verb/noun/adjective. The general aim of doing this is to simplify the model; in this way,
2.3 Linguistic specifications
The linguistic specifications for STO also originate from the PAROLE-DK specifications although they are modified in several ways. In particular, two aspects of the dictionary are being reconsidered, namely level of linguistic details and lexical generalisations. Apart from these central points, some new aspects which were not part of the PAROLE-DK are being considered, namely collocations, multiword units, word classes that were not completed in PAROLE (adverbs, conjunctions etc.) and last but not least: the semantic layer of the dictionary which is only just now being initiated in PAROLE's successor SIMPLE .
As regards level of linguistic details, the 'maximalist' approach adopted in PAROLE-DK has been taken over in STO with a few exceptions. With the general assumption that linguistic information can always been disregarded if it is considered unnecessary for a specific application, whereas it is always more complicated to deepen the level of details at a later stage, a rather nuanced and wide-ranging view on valency has been adopted, both with regard to verbs, nouns and adjectives (see Christensen et al. 1998).
Another case for reconsideration is phrasal verbs which is also a very frequent phenomenon in Danish, partly because of a relatively small verb vocabulary where different senses of a verb are often expressed by means of different particles, partly because of the fact that Danish is a typical 'satellite-framed' language which generally expresses a large amount of verbal content through particles and prepositions . In PAROLE-DK, all particle constructions are in essence lexicalised as phrasal verbs (with particular 'selfs', see Navarretta 1997). This can be seen as a contrast to traditional Danish lexica, which normally distinguish between phrasal verbs on the one hand and particle constructions on the other. Phrasal verbs are lexicalised and defined as verbs with a meaning that cannot be predicted on the basis of the meaning of the core verb and the particle in composition or as verbs that occur more often with a particular particle than without and therefore candidate for a fully lexicalised status. Particle constructions, in contrast, are defined as predictable and are therefore described by means of valency patterns in the 'core' entry (e.g. +DIRECTION). To illustrate this distinction consider the following examples of a phrasal verb and a particle construction, respectively: dukke op (appear, lit. duck up) and grave (ngt) ned (dig (something) down).
A similar distinction should be introduced in STO since exactly the degree of compositionality is highly relevant in several NLP contexts, such as for instance machine translation where a phrasal verb can only rarely be translated compositionally into another language (c.f. *'he ducked up at four o'clock'). The distinction is however a semantic one and should therefore be made at the semantic level. Seen in this perspective, we opt for a valency approach to all particle constructions at the syntactic level (in contrast to PAROLE-DK) opening then for a lexicalised view on the 'real' phrasal verbs at the semantic level .
As regards lexical generalisations, we greatly benefit from the internal validation of the PAROLE lexicon. The large resource of computable data of Danish represented in PAROLE-DK has enabled us to reconsider the generalisation aspects of the lexicon on a more empirical basis. As an example it can be mentioned that in the case of predicative nouns, we now have the empirical (i.e. corpus-based) experience as to consider the establishment of a set of alternation rules, in Danish typically illustrated by valency-bound genitives alternating with obliques. Also in relation to verbs and adjectives we opt for a wider use of generalisation facilities in relation to alternations.
3. Technical issues
In relation to the linguistic tasks mentioned above a few technical issues must be considered. Among these are the establishment of an LSP corpus to be used when encoding LSP vocabulary, conversion of reusable data as well as the development of encoding tools.
3.1 Establishing an LSP corpus
In the feasibility study we arrived at the conclusion that the LSP corpora of STO primarily should consist of texts written by experts for semi-experts or laymen, i.e. text books, instructions, manuals etc. (see Bergenholtz et al. 1994, p. 156). The reason for this is mainly that texts written by experts for experts mostly are in English or if in Danish, the vocabulary is so technical and specific that it does not belong to the vocabulary which STO has the intention to cover. The size of each corpus is planned to be between half a mill. to 1 mill. tokens distributed on several text samples which can be of different size. It is important to get various authors represented in order to reduce in significance the stylistics of one particular author, which we consider a relevant text selectional criterion.
The first domain selected for the LSP corpus is 'computer technology'. The selection of domains for other corpora is still outstanding but two strong candidates are 'environment' and 'economy'.
It has been decided to use the corpus tool XKWIC , as the most suitable for our purposes. The installation and testing of this has been carried out with promising results though we also found some shortcomings, regarding the possibility to see from exactly which text sample a certain excerpt is taken.
3.2 Conversion of reusable data
The development of a large size computational lexicon is costly and time-consuming work; therefore existing lexical resources will be integrated into the STO database to the greatest possible extent provided that it is technically and legally feasible. Such resources are e.g. lexical data collections that have not been integrated into the PAROLE lexicon because they, covering LSP vocabulary, were outside the scope of the PAROLE project. However, the number of suitable resources of Danish is somewhat limited thus we have to take each accesible data collection of a reasonable size and quality into consideration. Therefore, both machine readable dictionary material and a lexicon component of machine translation systems are assumed to be suitable resources. These external sources show a great diversity wrt. computational properties, lexical coverage, formalism and descriptive language, linguistic approach, depth of lexical description, quality of encoding etc. because they are developed for rather different purposes and systems. As long as they are well-structured and well-documented it is possible to get an idea of their potential usefulness.
Thus, as a starting point a general approach and some basic preparatory actions had to be defined for conversion of reusable lexical data. Without entering into details here, we state the following fundamental steps:
However, there is still at least one crucial factor to be considered carefully, i.e. the cost/benefit rate. Roughly said, the reuse of existing resources is only cost-efficient provided that the amount of work to be performed (deepening and completion of descriptions, filling gaps, correcting mistakes, etc.) is worthwhile compared to starting from scratch.
The approach described above has been adapted in the reusability investigations of the METAL lexicon where the first three steps are completed and the fourth is in progress now. It was necessary systematically to remove any unclarities from the STO specifications and to resolve a few contradictions detected in the source material. We feel that our first experiments are encouraging although much work remains to be done before the METAL lexicon is fully compatible with the specifications of STO especially because of some differences in information granularity and structure.
3.3 Encoding tools
The STO database is physically going to be located at CST, and the coming partners will be located at their respective sites. This fact gives us at least two possible solutions for the encoding of new words. It can either take place directly in the STO database by means of a distributional database system, i.e. every partner encodes new words into the same central database; or the encoding can be carried out decentrally, i.e. at each partner's site by collecting all the words encoded and then fill them into the STO database afterwards.
A distributional database could have an encoding tool running on the net. This way the partners can log onto their usual net-browser and from there start the STO encoding tool. The advantages of this solution are that each coder is using the same tool, i.e. there will be no problems with different releases of the tool, and integrity problems can be checked and solved on-line. A disadvantage for CST, though, is that we will have to find a way of coping with external users; a situation that we at present are not used to. A distributional database encoding tool is illustrated in Figure 2:
Figure 2: Distributional database encoding
A decentral encoding tool would have all the partners encode at their own platform with the same encoding program. The program collects the encoded words in a dump file to be filled into the STO database later. An advantage of this method is that the encoding can start whenever the encoding tool has been programmed. A disadvantage, though, is that one must keep track of different releases of the encoding program and dump file format and that the integrity checking cannot take place before the words are actually filled into the database, which - obviously - is at a different place and time than the actual coding.
The development process The decision of whether to encode new words with a distributional or a decentral encoding tool should be regarded in relation to the development process of the STO project. If we decide to encode via a distributional database encoding tool, we cannot encode any new words until the STO database has actually been implemented. If we decide to encode decentrally, we can start encoding as soon as the decentral encoding tool has been developed, and then - at a later stage - fill the dump file into the STO database. A third and probably more feasible solution is a combination of the two, i.e. to start off by programming a decentral encoding tool, then design and implement the STO database and then develop an encoding tool on a net application for a distributional database. In this way, we can start encoding new words more or less immediately, not having to wait until the STO database has been designed and implemented since the filling of the words into the STO database will not be an integrated part of the encoding. In this case, though, one must cope with the time delay between encoding and filling and the integrity problem this might cause; however, considering the time- consuming task of word encoding we find that this might be the best solution.
4 . Conclusions
The main incentive to the work presented is that we are aware of the increased relevance of effective communication and of the national responsibility for being a full member of the Information Society. CST and the consortium contribute with the STO project to the development of language engineering products for Danish quite in parallel to other national initiatives that integrate the results of the PAROLE project into their follow-up projects.
Allan, R., P. Holmes, T. Lundskær-Nielsen (1995). Danish, A Comprehensive Grammar, Routledge, London and New York.
Bergenholtz, H., J. Pedersen, S. Tarp (1994). Basic Issues in LSP Lexicography, in: H. Bergenholtz, A.L. Jakobsen, B. Maegaard, H. Mørk & P. Skyum Nielsen (eds.): Translating LSP Texts, OFT Symposium, Copenhagen Business School.
Calzolari, N., U. Heid, H. Khachadourian, J. McNaught, B. Menon, N. Modiano (1994). EAGLES LEXICON. Report on Architecture.
Christ, O. (1993). The Xkwic User manual, Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.
Christensen, A.B., S. Olsen, B. S. Pedersen (1998). Lingvistiske specifikationer for en stor Sprogteknologisk Ordbog, Technical report, Center for Sprogteknologi.
Christoffersen, E. (1995). Testning af fagsproglige valensrammer i et maskinoversættelsessystem, in: UDOG Rapport no.3, Odense Universitet.
Danish Standard DS 30564. (In press). Collections of Lexical Data - description of data categories and data structure. Part 1: Taxonomy for the classification of information types.
EAGLES (1995). Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Language. Monachini, M. and Calzolari, N. (Coords.) Istituto di Linguistica Computazionale, Pisa.
Paggio, P. (1996). The Treatment of Information Structure in Machine Translation, PhD Thesis, University of Copenhagen.
Navarretta, C. (1997). Encoding Danish Verbs in the PAROLE model, in: R. Mitkov, N. Nicolov & N.
Nikolov (eds.) Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.
Normier, B., M. Nossin (1990). GENELEX Project, EUREKA for Linguistic Engineering. Proceedings of the International Workshop on Electronic Dictionaries, Oiso, Kanagawa, Japan.
Talmy, L. (1985). Lexicalisation Patterns: Semantic Structures in Lexical Forms, in: T. Shopen (ed.) Grammatical Categories and the Lexicon vol. 3, Press Syndicate of the University of Cambridge.
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089