Download of software
GNU
The following software is under the GNU General Public License (GPL).
-
CST's RTFreader
-
The package consists of the source code (C++) of CST's RTFreader. After compilation for your
preferred platform (Linux, Unix, Windows) the program can read a flat text file or RTF
(Rich Text Format) file and convert it to segmented text with one sentence per line.
Where full stops are missing the program uses lay-out information such as character size and typeface
to decide where a sentence is ending. Optionally the program delivers tokenised output,
and uses some heuristics to decide whether dots belong to a token or are sentence delimiters.
You can supplement the tokenisation process with lists of abbreviations and MWU's.
Such language dependent resources are not part of the package.
Get the most recent version of the source code at
GitHub
.
-
CST's lemmatiser
-
The package covers the source text (C++) of CST's lemmatiser. After compilation
for your platform (Linux, Unix, Windows) you can train the programme. For languages
with rich morphology you need large full form word lists (>100 000) in order
to attain a reasonable good result. Flex rules for untagged Danish text are included
as of 2010/11/22, but not the (optional) built-in dictionary. Moreover, from 2012
flex rules and built-in dictionaries for Russian and Polish can be downloaded here.
Get the most recent version of the source code at
GitHub . If you wan to train affix rules for the lemmatiser (think of German
abgebrannt → abbrennen)
you need the program affixtrain besides cstlemma.
Contact CST if you are interested in CST's resources that are not covered by the
GPL, for example to train the lemmatiser. (See
CST's linguistic resources for Danish )
-
Bracmat
-
Bracmat is an interpreted programming language that is developed by Bart Jongejan
since 1986. Originally it was designed as a Computer Algebra system, but it has
shown its merits in natural language processing as well. It has been used in the
field of General Relativity for the algebraic computation of Ricci tensors from
given space-time metrics, for the implementation of a dialogue-manager in the Staging-project, for the analysis of texts in the "Controlled
Language"-part of the VID-project, for automatic error correction
of CST's many html-pages and for many corpus validation tasks. Bracmat has also
shown its utility in some real-world applications: for example to identify persons,
companies etc. in pre-tagged texts that must be anonymised. The to date most advanced
application of Bracmat is as workflow planner and executor in
DK-Clarin
's Tools module. Instead of letting the user choose between tools,
which the user may not know very well, the Tools module asks the user to specify
what kind of output she wants. With this information the Tools module computes all
(not necessarily sequential) combinations of tools and their parameter settings
that combine into workflows that are guaranteed to produce the specified output
from the given input. The computed list is condensed into a short format that highlights
the differences between the workflows for the user and leaves out all that is of
less importance.
Get the most recent version of the source code at
GitHub
.
Read more about Bracmat.
Other licenses than GNU
CST uses free third-party software that we have adapted to our needs, typically
to enable us to run the software on platforms that it originally had not been written
for. We are happy to pass on these programmes to other users under the same licenses:
-
POS-tagger written by
Eric Brill
-
CST uses this POS-tagger in many applications for the analysis of English texts
(using Eric Brill's linguistic resources, in some cases with small adaptations)
as well as Danish texts (with CST's linguistic resources). The distribution comprises
Eric Brill's original distribution and a Zip-file with CST's software adaptations.
Note that the training part of Brill's tagger is unchanged! We have made the following
adaptations:
- Reformatting from UNIX-style C to standard C++,
- Replacement of some UNIX-specific functions with standard C functions,
- Better handling of capitals in (supposedly) headings, and
- The introduction of an optionfile "xoptions" to make the source code independent
of language and tagset.
- Reading and writing XML formatted files, storing the POS tag in an attribute of
the element containing the word.
Get the most recent version of the source code at
GitHub
.
-
The CASS parser written by
Steven Abney
-
CST has used the CASS-parser in the VID-project for marking up
noun phrases in large text corpora. The distribution comprises Steven Abneys original
distribution and a Zip-file with CST's adaptations. The adaptations are minimal
but relevant if you want to compile the programme with one of the newer GNU-C++
compilers. (UPDATE: after we did these adaptations, Steven Abney has made adaptations
to solve the same compatibility problems. However, nor CST's distribution, nor Steven
Abney's seem to compile with the newest generation of GNU C++ compilers (version
4 and above)!)
Linguistic resources
If you are interested in obtaining linguistic resources that have been produced
under the auspices of CST (STO, training data for the POS-tagger or for the lemmatiser,
grammars for the NP-recogniser, rules for the name recogniser), please contact Hanne
Fersøe (hanne@cst.dk).
|