Download of software
GNU
The following software is under the GNU General Public License (GPL).
- CST's lemmatiser
-
The package covers the source text (C++) of CST's lemmatiser. After compilation
for your platform (Linux, Unix, Windows) you can train the programme. For languages
with rich morphology you need large full form word lists (>100 000) in order
to attain a reasonable good result. Flex rules for untagged Danish text are included
as of 2010/11/22, but not the (optional) built-in dictionary. Contact CST if you
are interested in CST's resources that are not covered by the GPL, for example to
train the lemmatiser. (See CST's linguistic resources
for Danish)
- Bracmat
-
Bracmat is an interpreted programming language that is developed by one of CST's
staff members since 1986. Originally it was designed as a Computer Algebra system,
but it has shown its merits in natural language processing as well. It has been
used in the field of General Relativity for the algebraic computation of Ricci tensors
from given space-time metrics, for the implementation of a dialogue-manager in the
Staging-project, for the analysis of texts in the "Controlled
Language"-part of the VID-project and for automatic error correction
of CST's many html-pages. Bracmat has also shown its utility in some real-world
applications: for example to identify persons, companies etc. in pre-tagged texts
that must be anonymised. The to date most advanced application of Bracmat is as
workflow planner and executor in DK-Clarin's Tools
module. Instead of letting the user choose between tools, which the user may not
know very well, the Tools module asks the user to specify what kind of output she
wants. With this information the Tools module computes all (not necessarily sequential)
combinations of tools and their parameter settings that combine into workflows that
are guaranteed to produce the specified output from the given input. The computed
list is condensed into a short format that highlights the differences between the
workflows for the user and leaves out all that is of less importance.
Read more about Bracmat.
Other licenses than GNU
CST uses free third-party software that we have adapted to our needs, typically
to enable us to run the software on platforms that it originally had not been written
for. We are happy to pass on these programmes to other users under the same licenses:
- POS-tagger written by
Eric Brill
-
CST uses this POS-tagger in many applications for the analysis of English texts
(using Eric Brill's linguistic resources, in some cases with small adaptations)
as well as Danish texts (with CST's linguistic resources). The distribution comprises
Eric Brill's original distribution and a Zip-file with CST's software adaptations.
Note that the training part of Brill's tagger is unchanged! We have made the following
adaptations:
- Reformatting from UNIX-style C to standard C++,
- Replacement of some UNIX-specific functions with standard C functions,
- Better handling of capitals in (supposedly) headings, and
- The introduction of an optionfile "xoptions" to make the source code independent
of language and tagset.
- The CASS parser written by
Steven Abney
-
CST has used the CASS-parser in the VID-project for marking up
noun phrases in large text corpora. The distribution comprises Steven Abneys original
distribution and a Zip-file with CST's adaptations. The adaptations are minimal
but relevant if you want to compile the programme with one of the newer GNU-C++
compilers. (UPDATE: after we did these adaptations, Steven Abney has made adaptations
to solve the same compatibility problems. However, nor CST's distribution, nor Steven
Abney's seem to compile with the newest generation of GNU C++ compilers (version
4 and above)!)
Linguistic resources
If you are interested in obtaining linguistic resources that have been produced
under the auspices of CST (STO, training data for the POS-tagger or for the lemmatiser,
grammars for the NP-recogniser, rules for the name recogniser), please contact Hanne
Fersøe (hanne@cst.dk).
|