CSTLEMMA - the CST Lemmatiser This distribution contains the following directories and files: --------------------------------------------------------------- bin This directory contains executables: bin\windows32\cstlemma.exe bin\linux\64\cstlemma doc This directory contains documentation of the program. src This directory contains the source code and a Makefile. danish Directory containing flex rules for untagged Danish text. (As of 20101122). The rules have been tested in a 64 bit Linux system and in a 32 Windows system, but as the file is in a binary format, there is no guarantee that they will work in all environments. Changelog A document describing changes between versions. COPYING The full text of the GNU public licence. Readme This file. CSTLEMMA has been compiled and run on the following platforms: ---------------------------------------------------------------- Platform Compiler(s) DOS (32/64 bit) Borland C++ 5 and newest Microsoft Visual C++ Solaris GNU C++ 2.95.3 Linux newest GNU C++ Important notice: the binary dictionary is NOT portable between platforms. However, the Borland- and Microsoft-compiled executables generate identical binary dictionaries. Installation: ------------- DOS: Either use a compiled version in the 'bin' directory or compile and link the program. Linux: Change directory to the 'src' directory. Run 'make' or 'make cstlemma'. To get rid of object files, run 'make clean'. Running: -------- For running the CST lemmatiser you need as a minimum a file containing flex rules. The absolute minimal set of flex rules is the empty set, in which case the lemmatiser assumes that all words in your input text are perfectly lemmatised already. Thus, for checking that the lemmatiser runs OK, you could do the following: touch my_empty_rule_file cstlemma -L -t- -f my_empty_rule_file -i my_text_file.txt This would create a file my_text_file.txt.lemma that has two tab-separated columns: the left column contains a word from your text and the right column contains the same word, converted to lower-case. The -L option tells the program lemmatise (as opposed to generating flex rules or creating a binary dictionary). The -t- option tells the program not to expect tagged input. The -f and -i options tells the program which rules and which input text to read. You can hand-craft the rules or let the lemmatiser generate flex rules from a full-form dictionary. The full-form dictionary can also be used to generate a binary dictionary, which the program can use to even better lemmatise your input text. If you want to lemmatise a Danish text, please contact us. We have a full form dictionary with 70000 head words that we have used to train the lemmatiser for the Danish language. Contact info: ------------- For questions and remarks about the program, please feel free to contact us. Our postal address is: Center for Sprogteknologi University of Copenhagen Njalsgade 140 2300 Copenhagen S. Denmark On the internet, you can visit us at http://cst.ku.dk/ Here you can also try the CST lemmatiser for Danish and several other languages.