NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Tesseract 4.0 C# .Net Wrapper Released!

This article is about the Tesseract 4.0 C# .Net Wrapper that is only a few days old as of April 2017.


You are probably familiar with the Tesseract 3.04 C# .Net Wrapper found here:

https://github.com/charlesw/tesseract

That is already available as a Nuget package and has many downloads.


Just about a week ago, an Alpha release of the Tesseract 4.0 C# .Net wrapper was published here:

https://github.com/tdhintz/tesseract4win64

This is an x64 only .Net assembly. 


Find the Tesseract 4.0 language packs here:

https://github.com/tesseract-ocr/tessdata

When I load English only language pack, it uses a reasonable 180MB of RAM. I tried to load "all languages", and it was using over 8GB of RAM. 


This build is incredibly slow for debug mode. It runs 5-8X slower in debug mode than release mode, so watch out for that.


Amazingly, the .Net wrapper API works exactly the same as the Tesseract C# .Net 3.0 wrapper! (When you read about how the engine changed a huge amount and using LTSM networks, this will be more amazing to you.)


A very simple usage example works like this:

var tessEngine = new TesseractEngine(tessdataPath, "eng");
using (Page page = tessEngine .Process(myImage))
{
    string resultText = page.GetText();


Be sure to drop these two files in your \bin\debug or \bin\release folder at a x64 sub-folder like this::

.\bin\release\x64\libtesseract400.dll
.\bin\release\x64\liblept1741.dll

When the Tesseract.dll 4.0 assembly loads, it needs to find those DLLs else it will throw an exception in your application.


There is a very nice Accuracy and Performance overview report of 3.04 versus 4.0 here:

https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

I agree with it's findings generally, but my own personal tests are not nearly as "improved" versus 3.04. I have a regression test that contains about 2200 pages, and I'm observing plenty of slower and less precise OCR results with Tesseract 4.0. It is certainly not all "better and faster" as of April 2017. Since this is an extremely new Alpha release, I have high hopes that it will improve over time.