elx

EURALEX 2002

Pre EURALEX tutorial

Title Sketching words. Organized by Adam Kilgarriff, Brighton, U.K.

Goal

That participants acquire the technical understanding, motivation and confidence to prepare word sketches for their lexicography project.

Corpus lexicography background

In the pre-computer age, a corpus comprised filing cabinets full of index cards, each with a citation. These are valuable resources, but focus on the unusual rather than the usual, say very little about common words, and give little indication of the norms of a word's behaviour.

Since COBUILD inaugurated the age of computerised corpora for lexicography in the early 1980s, the Key Word In Context (KWIC) concordance has been the corpus lexicographer's main tool. Various well-designed interfaces, with flexible sorting criteria and support for sophisticated searches, have been developed (eg WordSmith, Xkwic).

The limitations of KWIC become apparent for large corpora and common words: there is too much data.

Church and Hanks (1989) introduced the Mutual Infomation statistic to summarise the data, for cases where there was too much to read it all, and so that high-salience patterns were brought to the lexicographer's attention. This (and other statistics offered since) have offered improved access to the information implicit in the corpus. However, the statistics have not been ideal, and the lists have thrown together items taking different grammatical roles in relation to the word being investigated. Thus the same list will include a verb's subjects, objects, modifiers, and other material which has frequently been found in the vicinity of the keyword but does not stand in any linguistically significant relation to it.

Word sketches are a response to these shortcomings. They provide different lists for each grammatical relation (and use an improved statistic for lexicographic salience).

Corpus requirements

Availability of an appropriate corpus is the greatest limitation on the development of word sketches. As for other lexicographic uses, the corpus needs to be very large, and balanced: it needs to reflect the language types that the dictionary aims to described in appropriate proportions.

The following issues will be discussed with reference to the British National Corpus (BNC):

number of instances of a word required for a useful word sketch
Zipf's Law and the corpus size required for n useful word sketches
frequency list distortions and specialist text types

NLP technology

NLP=Natural Language Processing, also known as Computational Linguistics, Language Engineering, Human Language Technology (HLT).

NLP provides various tools for enriching text: turning implicit linguistic information within a text into explicit data, which can be used by the computer. The tools of interest for corpus lexicography are, in order of increasing sophistication, tokenisers; sentence-splitters; lemmatisers; part-of-speech taggers; and parsers. Each will be defined and described.

Regular expressions

The regular expression formalism provides a straightforward way to implement some or even all of these technologies. Regular expressions will be introduced.

This part of the tutorial will be aimed at people who are not familar with regular expression use. It will cover the ground rapidly, pointing people to web pages for fuller accounts, and will not take more than 30 mins.

Choosing and implementing grammatical relations

In Brighton, a set of 27 grammatical relations have been used. The goal is that they cover the significant grammatical and collocational behaviour of English nouns, verbs and adjectives. We shall describe in detail how they have been implemented, and invite discussion on whether the same strategies would work for other languages.

Sorting: Salience statistics

We shall describe the different statistics which have been used for sorting collocations by lexicographic salience, and present the statistic we use.

Building Word sketches

Once all the grammatical relations in the corpus have been identified, many options remain regarding how the information should be presented to a lexicographer. In Brighton we have developed two kinds of word sketches; these will be displayed and various design choices about, inter alia ordering of lists, presentation of statistics, links to the corpus, salience of grammatical relation vs. salience of collocation, will be discussed.

Wordsketches for English can be seen here.

adam@itri.bton.ac.uk

Date, place, practicalities

- Date: 13 August 2002. Sessions: 09:30 – 12.30 and 13:15 – 16:15
- Place: University of Copenhagen
Details will be announced later.

To register

Please fill in the Tutorial Registration Form and return it with the Congress Registration Form.
Registration for the tutorial is independent from the registration for the Congress.


Registration fee: DKK 1,100
Lunch, coffee/tea and tutorial material is included
Deadline: 15 June 2002
Number of participants is limited to 40 ('first come - first served')


Blå linie
Njalsgade 80 - DK-2300 KBH S - Tlf: +45 35329090 - Fax: +45 35329089 - webmaster@cst.dk
Valid XHTML 1.0!