CSTLEMMA - the CST Lemmatiser


This distribution contains the following directories and files:
---------------------------------------------------------------

bc5           This directory contains a project file for Borland C++ 5.02

bin           This directory contains two executables for DOS:
                 bin\bc5\cstlemma.exe is compiled with Borland C++ 5.02
                 bin\vc6\Release\cstlemma.exe is compiled with Visual C++ 6.0

doc           This directory contains documentation of the program.

src           This directory contains the source code and a Makefile.

vc6           This directory contains a project file for Microsoft Visual 
              C++ 6.0

Changelog     A document describing changes between versions.

COPYING       The full text of the GNU public licence.

Readme        This file.



CSTLEMMA has been compiled and run on the following platforms:
----------------------------------------------------------------

Platform      Compiler(s)

DOS (32 bit)  Borland C++ 5 and Microsoft Visual C++ 6.0
Solaris       GNU C++ 2.95.3
Linux         GNU C++ 3.3.1


Important notice: the binary dictionary is NOT portable between platforms.
However, the Borland- and Microsoft-compiled executables generate identical
binary dictionaries.


Installation:
-------------


DOS:
      Either use one of the supplied executables in the 'bin' directory or
      compile and link the program using the project file in the top-level
      'bc5' or 'vc6' directory.

Solaris and Linux:

      Change directory to the 'src' directory.
      Run 'make' or 'make cstlemma'. To get rid of object files, run 
      'make clean'.

Running:
--------

For running the CST lemmatiser you need as a minimum a file containing flex
rules. The absolute minimal set of flex rules is the empty set, in which case
the lemmatiser assumes that all words in your input text are perfectly
lemmatised already.

Thus, for checking that the lemmatiser runs OK, you could do the following:

touch my_empty_rule_file
cstlemma -L -t- -f my_empty_rule_file -i my_text_file.txt

This would create a file my_text_file.txt.lemma that has two tab-separated
columns: the left column contains a word from your text and the right column
contains the same word, converted to lower-case. The -L option tells the
program lemmatise (as opposed to generating flex rules or creating a binary
dictionary). The -t- option tells the program not to expect tagged input. The
-f and -i options tells the program which rules and which input text to read.

You can hand-craft the rules or let the lemmatiser generate flex rules from
a full-form dictionary. The full-form dictionary can also be used to generate
a binary dictionary, which the program can use to even better lemmatise your
input text.

If you want to lemmatise a Danish text, please contact us. We have a full form
dictionary with 70000 head words that we have used to train the lemmatiser for
the Danish language.


Contact info:
-------------

For questions and remarks about the program, please feel free to contact us.

Our postal address is:

Center for Sprogteknologi
University of Copenhagen
Njalsgade 80
2300 Copenhagen S.
Denmark

On the internet, you can visit us at http://www.cst.dk
Here you can also try the CST lemmatiser for Danish.
