Lemmatisation decision tree output example

Let's say one wants to get a file with all lemmatisation rules used in a selected lemmatiser. Further on, let's say one would want the same format as LemmaGen v2.x (C++ code) outputs and takes as an input. What is the procedure? (However, beware if you really use the file that below procedure outputs in LemmaGen v2.x since the encoding may be incorrect - for C++ version you should use ascii encoding while C# outputs utf8 by default - I think, so be careful!)

So the procedure goes like this:

  1. Create a new empty console project in Visual Studio.
  2. Copy/paste the source code below (Image 1) into the main program file.
  3. Download sources (at least Base + PrebuiltCompact from here) and add next project to your solution: "LemmaSharp", "LemmaSharpPrebuilt", "LemmaSharpPrebuiltCompact".
  4. Add also the references to the mentioned projects in your original project.
  5. Change next variables from private to public in project "LemmaSharp":
    • class LemmaTreeNode, variables: bWholeWord, sCondition, dictSubNodes and lrBestRule
    • class LemmaRule, variables: iFrom and sTo
  6. Execute. The output "ExampleFile.txt" should like similar as the beginning of file shown below on the Image 2.
  7. Extend the solution, experiment and have fun.
 using  System;
 using  LemmaSharp;
 using  System.Text;
 using  System.IO;
 
 namespace  LemaOutTree
 {
     class  Program 
     {
         static  void  Main(string [] args)
         {
             ILemmatizer  lmtz = new  LemmatizerPrebuiltCompact (LemmaSharp.LanguagePrebuilt .Czech);
             StreamWriter  tw = new  StreamWriter (File .OpenWrite("ExampleFile.txt" ));
             Output(((Lemmatizer )lmtz).RootNode, tw, 0, false );
         }
 
         private  static  void  Output(LemmaTreeNode  ltn, TextWriter  sb, int  iLevel, bool  first) {
             sb.Write(new  string ('\t' , first?1:iLevel));
             sb.Write("RULE: " );
             sb.Write("i\""  + (ltn.bWholeWord ? "#"  : "" ) + ltn.sCondition + "\" " );
             sb.Write("t\""  + ltn.sCondition.Substring(ltn.sCondition.Length-ltn.lrBestRule.iFrom) + "\"->\""  + ltn.lrBestRule.sTo + "\";" );
             sb.WriteLine();
             if  (ltn.dictSubNodes != null ) {
                 sb.Write(new  string ('\t' , iLevel));
                 sb.Write("{:" );
                 bool  firstInner = true ;
                 foreach  (LemmaTreeNode  ltnChild in  ltn.dictSubNodes.Values) {
                     Output(ltnChild, sb, iLevel + 1, firstInner);
                     firstInner = false ;
                 }
                 sb.Write(new  string ('\t' , iLevel));
                 sb.Write(":}" );
                 sb.WriteLine();
                 sb.WriteLine();
             }
         }
     }
 }
 
 

Image 1: Example source code

RULE: i"" t""->"";
{:	RULE: i"#'" t"'"->"have";
	RULE: i"'E" t"'E"->"he";
	RULE: i"a" t""->"";
	{:	RULE: i"nda" t""->"";
		{:	RULE: i"oranda" t"a"->"um";
			RULE: i"enda" t"a"->"um";
			{:	RULE: i"genda" t""->"";
			:}
		:}
		RULE: i"ia" t""->"";
		{:	RULE: i"nnia" t"a"->"um";
			RULE: i"ria" t""->"";
			{:	RULE: i"teria" t""->"";
				{:	RULE: i"cteria" t"a"->"um";
					RULE: i"iteria" t"a"->"on";
				:}
				RULE: i"atoria" t"a"->"um";
			:}
			RULE: i"osia" t"a"->"um";
		:}
		RULE: i"ima" t""->"";
		{:	RULE: i"nima" t"a"->"um";
			RULE: i"xima" t"a"->"um";
		:}
		RULE: i"omena" t"a"->"on";
		RULE: i"zoa" t"a"->"on";
		RULE: i"ra" t""->"";
		{:	RULE: i"pora" t"ora"->"us";
			RULE: i"ctra" t"a"->"um";
		:}
		RULE: i"ta" t""->"";
		{:	RULE: i"rata" t"a"->"um";
			RULE: i"uanta" t"a"->"um";
		:}
		RULE: i"nua" t"a"->"um";
	:}
 
... continues ...

Image 2: First few lines from the output file