Extremley Efficient High-throughput Lemmatization
This section provides links to the downloadable content and gives some basic information about it.
The software was developed at three steps which are directly connected to the three main versions described below. Incremental versions that we do not mention here (e.g.: v1.1) are not stable releases. When downloading the software you should always pick the latest version for your target platform: v2.2 for C++, v2.1 for C++ wrapped in .Net framework, 2.5 for Python wrapper and v3.0 for C#.Net implementation. Older versions are listed here only if somebody wants to repeat exact experiments described in listed papers or dissertation.
The image above briefly shows the development of LemmaGen system with the reference to the targeted platform. The rest of this section lengthy describes this image and provides links to the downloadable content.
- v1.0: (get) The first version of the LemmaGen developed for the bachelor’s dissertation. It is written in C++ with the aim of being self sufficient command line application and/or minimalistic lemmatisation class library to be included into other projects. The solution can be built on Windows using included makefile or Microsoft Visual Studio project file. On Linux one can use makefile to build an executable version.
- v1.5: (get) Incremental upgrade of v1.0, many bugs fixed and added functionality. However, the main algorithms behaviour is the same as in v1.0.
- v2.0: (get) Incremental upgrade of v1.5, but, with significant changes in lemmatisation algorithm. Some new heuristics were added for improved accuracy
- v2.2: (get) Incremental upgrade of v2.0, bugs fixed and new functionality with the same behaviour as v2.0.
- v2.1: (get) The first version that was ported to .Net framework. This is done by wrapping existing code into C++ managed code which can be built with included Visual Studio project file. Wrapped functionality is now used as a code library that can be included in other project and not used as a standalone command line application.
- v2.5: (get) The first version that was ported to .Net framework. This is done by wrapping existing code into C++ managed code which can be built with included Visual Studio project file. Wrapped functionality is now used as a code library that can be included in other project and not used as a standalone command line application.
- v3.0: (get) Complete rewrite and improvement of the library in the C# language under .Net framework. Works also on Linux using Mono framework. Similarly as v2.5 this is also not standalone command line version but only a code library to be used by other project. The most famous library where LemmaGen was used is Latino.
Models and Data for Pretrained Lemmatisers
The Models and Data section is devided into two main section:
- Input data for training lemmatizers (lexicons)
- Prebuild lemmatizer models for different versions of applications
Multext & Multext-East lexicons (which we are using to build our models) are not licenced as open source so we can not provide all of them. Here are just two language examples: slovene an english lexicon (Multext-East v3). This two examples are not modified in any way.
One can find additional and updated lexicons at Multext-East or Multext website. Furthermore, some of the latest verisons of the lexicons have free licence or at least free for research use. It is worth checking. Of course you can also use any arbitrary resources (lexicons) as long as you gave a set of wordform-lemma pairs.
Unfortunately (from the user-friendliness perspective) the models are not the same for all versions of our application. As we were developing the application we were constantly faced with new functionalities to be added into it, however, the old models did not support them. Nevertheless we constrained ourselves to the level where each major version of application (e.g.: v1, v2, v3, …) can handle the same models (or at least models that are compatible with each other). Thus, we list below the models for all three major versions up until now.
The code downloads for both published v1 versions of the application (v1.0 and v1.5) allready contain the models for all - at that time - available lexicons (14 basic lexicons / 12 languages). We do not provide support for v1 versions anymore. However, if you realy want to use this "old" version of the software, you can build your own models by taking available lexicons and follow the instructions on the tutorial page.
There are two types of models: human readable textual models and optimised binary ones. One who has gone through the tutorial of creating new model knows that the first step is the creation of lemmatisation tree with the tool lemLearn. The output for this tool is human readable well formated model of a lemmatiser. Later on, this model is inserted into lemBuild procedure which builds optimized binary representation which is finaly used for actual high speed lemmatisation with the lemmatize tool. Therefore, we offer both files (textual and binary) for each of the 12 languages (14 lexicons) in the table below. Alternative you can download all textual models together (get) and all binary models together (get).
Lexicon Type Language Textual
Lexicon Size Wordforms Lemmas Multext-East
English (get) (get) 71,784 27,467 Slovene (get) (get) 557,970 16,389 Bulgarian (get) (get) 55,200 22,982 Czech (get) (get) 184,628 23,435 Estonian (get) (get) 135,094 46,933 French (get) (get) 306,795 29,446 Hungarian (get) (get) 64,042 28,090 Romanian (get) (get) 428,194 39,359 Serbian (get) (get) 20,294 8,392 Multext English (get) (get) 66,216 22,874 French (get) (get) 306,795 29,446 German (get) (get) 233,858 10,655 Italian (get) (get) 145,530 8,877 Spanish (get) (get) 510,709 13,236 All All (get) (get) NA NA
v3.x ApplicationsIf you have followed link for v3 software than you understand the philosophy behind different releases (Data/Compact/Compressed/Full) of the prebuilt library ().