STO - A Large Computational Lexicon for Danish - Ready for Applications
The Danish lexicon resource for language technology applications is available world-wide for research and commercial purposes at Evaluations & Language resources Distribution Agency (ELDA) www.elda.org.
The Centre for Language Technology (CST) is in charge of a national project developing a large-scale Danish lexicon for HLT and NLP applications. The short name of the project is STO, which stands for "SprogTeknologisk Ordbase" (Lexical Database for Language Technology). The project gets funding from the Danish Ministry for Information Technology and Research for a period of three years (2001-03).
The objective of the STO project is to develop a comprehensive, generic lexical database from which various, dedicated lexicon modules can be derived and adapted to particular applications.
In order to fulfil the project objectives of data production with a fixed scheme, we defined three areas of work - namely the organisational, the computational and the linguistic area. In the following we give an account of the most important, finished and ongoing tasks of each area.
CST is – as the project manager – responsible for the proper performance of the work processes. Currently, the lexicon encoding is carried out by sixteen project members (mainly part-timers) employed by four different institutes. In order to run the project effectively, the work processes must be carefully co-ordinated.
The goal is to populate the lexicon: coverage at least 50,000 lemmas (current figures: 40,000 entries with morphological descriptions, 24,000 entries provided also with syntactic descriptions, furthermore 10,000 related semantic readings originating from the SIMPLE project)
Some features of the linguistic aspect
In order to enhance the applicability of the lexicon we implemented e.g. the linguistic description of noun compounding which is the most productive way word formation in Danish; in the next step derivational information will be added - both of them contribute to a dynamic exploitation of the data.
The area also incorporates research work into particular linguistic tasks where there is no systematic and exhaustive description accessible that could be implemented in a straightforward way.
It deserves notice that the STO lexicon complies with the recommendations for classification of information types laid down by the Danish Standard for lexical data collections. This is an essential prerequisite for a broad applicability of the lexicon.
Although the STO project focuses on the monolingual information content and data structure, we are also aware of the need for a Danish lexicon that can be integrated into multi-lingual lexical resources. To this end, the lexical data produced are kept compatible with the PAROLE descriptive language and - as regards esp. the semantic layer- we remain attentive to structures produced within other follow-up projects, like SIMPLE.
Braasch, Anna & Sussi Olsen (2000):
A.Braasch, A. Buhr Christensen, S. Olsen, B. Pedersen (1998):
The model for STO.
Online user interface (in Danish)
Anna Braasch anna @ cst.dk
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089