NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Text Extraction using C# .Net and Apache Tika


You want to using C# to extract text from documents and web pages. You want it to have high quality and be free. Try the .Net wrapper to the Apache Tika library!


Let's build a sample app and show the use case. First step, start a C# console application with Visual Studio. Use the Nuget package manager and install the TikaOnDotNet.TextExtractor packages.



Then, try this sample code. It shows an example of text extraction examples for a file, Url, and byte array sources.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using TikaOnDotNet.TextExtraction;

namespace TikaTest
{
    class Program
    {
        static void Main(string[] args)
        {

            TextExtractor textExtractor = new TextExtractor();

            // Fun Utf8 strings found here: http://www.columbia.edu/~fdc/utf8/
            string utf8InputString = @"It's a small village in eastern Lower Saxony. The ""oe"" in this case turns out to be the Lower Saxon ""lengthening e""(Dehnungs-e), which makes the previous vowel long (used in a number of Lower Saxon place names such as Soest and Itzehoe), not the ""e"" that indicates umlaut of the preceding vowel. Many thanks to the Óechtringen-Namenschreibungsuntersuchungskomitee (Alex Bochannek, Manfred Erren, Asmus Freytag, Christoph Päper, plus Werner Lemberg who serves as Óechtringen-Namenschreibungsuntersuchungskomiteerechtschreibungsprüfer) for their relentless pursuit of the facts in this case. Conclusion: the accent almost certainly does not belong on this (or any other native German) word, but neither can it be dismissed as dirt on the page. To add to the mystery, it has been reported that other copies of the same edition of the PLZB do not show the accent! UPDATE (March 2006): David Krings was intrigued enough by this report to contact the mayor of Ebstorf, of which Oechtringen is a borough, who responded:";
            // Convert string to byte array
            byte[] byteArrayInput = Encoding.UTF8.GetBytes(utf8InputString);
            // Text Extraction Example for Byte Array
            TextExtractionResult result = textExtractor.Extract(byteArrayInput);
            Console.WriteLine(result.Text);

            // Text Extraction Example for Uri:
            result = textExtractor.Extract(new Uri("http://blog.novelessay.com"));
            Console.WriteLine(result.Text);

            // Text Extraction Example for File
            result = textExtractor.Extract(@"c:\myPdf.pdf");
            Console.WriteLine(result.Text);

            // Note that result also has metadata collection and content type attributes
            //result.Metadata
            //result.ContentType
        }
    }
}

Notice that the TextExtractionResult has a Metadata collection and also a content type attribute. Here's an example of the metadata provided along with the extracted text. It contains many things including author, dates, keywords, title, and description.


      

I've been very pleased with Tika's quality and ability to handle many different file types. I hope you try it out and enjoy it too.