Download of software

GNU The Dynamic Duo: The Gnu and the Penguin in flight

The following software is under the GNU General Public License (GPL).

CST's RTFreader
The package consists of the source code (C++) of CST's RTFreader. After compilation for your preferred platform (Linux, Unix, Windows) the program can read a flat text file or RTF (Rich Text Format) file and convert it to segmented text with one sentence per line. Where full stops are missing the program uses lay-out information such as character size and typeface to decide where a sentence is ending. Optionally the program delivers tokenised output, and uses some heuristics to decide whether dots belong to a token or are sentence delimiters. You can supplement the tokenisation process with lists of abbreviations and MWU's. Such language dependent resources are not part of the package.

Get the most recent version of the source code at GitHub .

CST's lemmatiser
The package covers the source text (C++) of CST's lemmatiser. After compilation for your platform (Linux, Unix, Windows) you can train the programme. For languages with rich morphology you need large full form word lists (>100 000) in order to attain a reasonable good result. Flex rules for untagged Danish text are included as of 2010/11/22, but not the (optional) built-in dictionary. Moreover, from 2012 flex rules and built-in dictionaries for Russian and Polish can be downloaded here.

Get the most recent version of the source code at GitHub . If you wan to train affix rules for the lemmatiser (think of German abgebrannt → abbrennen) you need the program affixtrain besides cstlemma.

Contact CST if you are interested in CST's resources that are not covered by the GPL, for example to train the lemmatiser. (See CST's linguistic resources for Danish )
Bracmat
Bracmat is an interpreted programming language that is developed by Bart Jongejan since 1986. Originally it was designed as a Computer Algebra system, but it has shown its merits in natural language processing as well. It has been used in the field of General Relativity for the algebraic computation of Ricci tensors from given space-time metrics, for the implementation of a dialogue-manager in the Staging-project, for the analysis of texts in the "Controlled Language"-part of the VID-project, for automatic error correction of CST's many html-pages and for many corpus validation tasks. Bracmat has also shown its utility in some real-world applications: for example to identify persons, companies etc. in pre-tagged texts that must be anonymised. The to date most advanced application of Bracmat is as workflow planner and executor in DK-Clarin 's Tools module. Instead of letting the user choose between tools, which the user may not know very well, the Tools module asks the user to specify what kind of output she wants. With this information the Tools module computes all (not necessarily sequential) combinations of tools and their parameter settings that combine into workflows that are guaranteed to produce the specified output from the given input. The computed list is condensed into a short format that highlights the differences between the workflows for the user and leaves out all that is of less importance.

Get the most recent version of the source code at GitHub .

Read more about Bracmat.

Other licenses than GNU

CST uses free third-party software that we have adapted to our needs, typically to enable us to run the software on platforms that it originally had not been written for. We are happy to pass on these programmes to other users under the same licenses:

POS-tagger written by Eric Brill
CST uses this POS-tagger in many applications for the analysis of English texts (using Eric Brill's linguistic resources, in some cases with small adaptations) as well as Danish texts (with CST's linguistic resources). The distribution comprises Eric Brill's original distribution and a Zip-file with CST's software adaptations. Note that the training part of Brill's tagger is unchanged! We have made the following adaptations:
  • Reformatting from UNIX-style C to standard C++,
  • Replacement of some UNIX-specific functions with standard C functions,
  • Better handling of capitals in (supposedly) headings, and
  • The introduction of an optionfile "xoptions" to make the source code independent of language and tagset.
  • Reading and writing XML formatted files, storing the POS tag in an attribute of the element containing the word.


Get the most recent version of the source code at GitHub .

The CASS parser written by Steven Abney
CST has used the CASS-parser in the VID-project for marking up noun phrases in large text corpora. The distribution comprises Steven Abneys original distribution and a Zip-file with CST's adaptations. The adaptations are minimal but relevant if you want to compile the programme with one of the newer GNU-C++ compilers. (UPDATE: after we did these adaptations, Steven Abney has made adaptations to solve the same compatibility problems. However, nor CST's distribution, nor Steven Abney's seem to compile with the newest generation of GNU C++ compilers (version 4 and above)!)

Linguistic resources

If you are interested in obtaining linguistic resources that have been produced under the auspices of CST (STO, training data for the POS-tagger or for the lemmatiser, grammars for the NP-recogniser, rules for the name recogniser), please contact Hanne Fersøe (hanne@cst.dk).


Blå linie
Njalsgade 140-142, building 25, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089
Valid XHTML 1.0 Strict