NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Extracting important snip-its with C# and Log Likelyhood


How does text summary software pick out the most important ideas to present?

One solution is to use Log Likelyhood to generate a summary of the sentences that contain the important terms and cover the most different topics.


What Log Likelyhood is and why it works can be read here: https://en.wikipedia.org/wiki/Likelihood_function


This article will focus on implementing it in C#. Here's the code I wrote as a translation from the Mahout version. I tried to make it as readable as possible rather than optimizing for performance.

// Log Likelyhood code roughly translated from here:
// http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-math/0.3/org/apache/mahout/math/stats/LogLikelihood.java#LogLikelihood.logLikelihoodRatio%28int%2Cint%2Cint%2Cint%29
static private double ShannonEntropy(List<Int64> elements)
{
    double sum = 0;
    foreach (Int64 element in elements)
    {
        sum += element;
    }
    double result = 0.0;
    foreach (Int64 element in elements)
    {
        if(element < 0)
        {
            throw new Exception("Should not have negative count for entropy computation (" + element + ")");
        }
        int zeroFlag = (element == 0 ? 1 : 0);
        result += element * Math.Log((element + zeroFlag) / sum);
    }
    return result;
}
/*
    Calculate the Raw Log-likelihood ratio for two events, call them A and B. Then we have:
 	    Event A	Everything but A
        Event B	A and B together (k_11)	B, but not A (k_12)
        Everything but B	A without B (k_21)	Neither A nor B (k_22)
    Parameters:
    k11 The number of times the two events occurred together
    k12 The number of times the second event occurred WITHOUT the first event
    k21 The number of times the first event occurred WITHOUT the second event
    k22 The number of times something else occurred (i.e. was neither of these events
*/
static public double LogLikelihoodRatio(Int64 k11, Int64 k12, Int64 k21, Int64 k22)
{
    double rowEntropy = ShannonEntropy(new List<Int64>() { k11, k12 }) + ShannonEntropy(new List<Int64>() { k21, k22 });
    double columnEntropy = ShannonEntropy(new List<Int64>() { k11, k21 }) + ShannonEntropy(new List<Int64>() { k12, k22 });
    double matrixEntropy = ShannonEntropy(new List<Int64>() { k11, k12, k21, k22 });
    return 2 * (matrixEntropy - rowEntropy - columnEntropy);
}

Now, we have a simple LogLikelihoodRatio function we can call with 4 parameters and get the score result.


Let's say we want to pick out the most important sentences from a particular Wikipedia article in order to summarize it. (See this article for loading Wikipedia in to ElasticSearch: http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch)

Follow these steps:

  1. Pick a Wikipedia article.
  2. Get a Term Frequency dictionary for the whole article.
  3. Parse the article in to sentences.
  4. For each token in each sentence, calculate the Log Likelyhood score with the above LogLikelihoodRatio function.
  5. If the result of LogLikelihoodRatio is less than -10, give that sentence +1 to a weight value.
  6. At the end of each sentence, you have a +X weight value. That can be normalized by the number of words in the sentence.
  7. After you've obtained the weight score from #6 for all of the sentences in an article, you can sort them and pick the most important ones.
For extra credit, you'll want to avoid redundant important sentences. In order to do that, you'll need to score the candidate sentences against the summary's output as you build it.


Here's some code with comments about populating the input values passed to the LogLikelihoodRatio function. Be sure to check the result score is less than -10 before adding a +1 weight.

// http://www.cs.columbia.edu/~gmw/candidacy/LinHovy00.pdf - Section 4.1
Int64 k11 = // frequency of current term in this article
Int64 k12 = // frequency of current term in all of Wikipedia - k11
Int64 k21 = // total count of all terms in this article - k11
Int64 k22 = // total count of all terms in Wikipedia - k12
double termWeight = LogLikelihoodRatio(k11, k12, k21, k22);

if(termWeight < -10)
{
    weightSum++;
}

Obviously, in the above you don't want to be calculating Term Frequency across Wikipedia on-the-fly. K11 and K21 will get calculated as you process an article, but K12 and K22 should be calculated in advance and cached in a lookup dictionary. 


I use LevelDb as my Term Frequency look up dictionary. You can read about using that here: http://blog.novelessay.com/post/fast-persistent-key-value-pairs-in-c-with-leveldb


In order to build your Term Frequency look up dictionary chace, you could process each documents and create your own term frequency output, or use the ElasticSearch plugin for _termList here: https://github.com/jprante/elasticsearch-index-termlist



Querying Wikipedia in ElasticSearch with C# Nest client



This article assumes that you've already loaded the Wikipedia articles in to your local ElasticSearch as described in this previous article. Please follow the instructions in this article on how to load your ElasticSearch with the entire content of Wikipedia:

http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch


Start a Visual Studio C# console application project, and install the ElasticSearch Nest Nuget package. 


In your code, create a Nest ElasticClient instance that is configured for your ElasticSearch instance. We are using localhost:9200 and the index named "mywiki" as the location of our Wikipedia data. 

var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(
    node,
    defaultIndex: "mywiki"
).SetTimeout(int.MaxValue);
ElasticClient esClient = new ElasticClient(settings);


The Wikipedia index schema has a particular field format. We'll need a Page class like this for Nest to map fields in to:

public class Page
{
    public List<string> category { get; set; }
    public bool special { get; set; }
    public string title { get; set; }
    public bool stub { get; set; }
    public bool disambiguation { get; set; }
    public List<string> link { get; set; }
    public bool redirect { get; set; }
    public string text { get; set; }
}


Now, we can start querying our Wikipedia ElasticSearch index using our Nest client. Here's a simple example that pulls down the first 10 Wikipedia articles:

var result = esClient.Search<Page>(s => s
    .From(0)
    .Size(10)
    .MatchAll()
    );

You can check the response for errors and loop through the Page hits like this:

if (result.IsValid)
{
    foreach (var page in result.Hits)
    {
        // page.Source.text contains the wikipedia article text

After this, you can loop through all Wikipedia documents by changing the arguments passed to From and Size in the ElasticSearch query call.


Here's a query example that emulates a Google-like search via the use of a QueryString. Notice the use for Operator.And. I suggest you change it to Operator.Or and observe the difference effect on your results.

var result = esClient.Search<Page>(s => s
    .Take(10)
    .Query(q => q
        .QueryString(p => p.Query("cats dogs birds").DefaultOperator(Operator.And))
    )
);


If you're ready to start getting fancy, you can write a function that builds a Nest SearchDescriptor based on your query criteria. Then use the SearchDescriptor in your query to ElasticSearch. I wanted to search Wikipedia without getting redirect link results, so I set some ignore options in the example below that exclude #redirect terms for my search descriptors.

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
    string ignoreA = "#redirect";
    string ignoreB = "redirect";

    var searchDescriptor = new SearchDescriptor<Page>()
            .Query(q =>
                q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
                && !q.Term(p => p.text, ignoreA)
                && !q.Term(p => p.text, ignoreB)
                && !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
            );
    return searchDescriptor;
}

If this SearchDescriptor example is a little confusing, stay tuned for the ElasticSearch Wikipedia clustering future article that I intend to write. In the mean time, you should be set up to query your Wikipedia ElasticSearch index with the C# Nest client.



Loading Wikipedia in to ElasticSearch



This article gives instructions for loading Wikipedia articles in to ElasticSearch. I did this on Windows, but all of these steps should work on any java friendly platform.
  1. Download ElasticSearch
  2. Download stream2es
  3. Download Wikipedia articles
  4. Start ElasticSearch
  5. Run stream2es


Download ElasticSearch

Go to Elastic.co and download ElasticSearch here: https://www.elastic.co/downloads/elasticsearch
Download and unzip the elasticsearch download in to a folder of your choice.

Download stream2es

Go here and download the stream2es java jar file: http://download.elasticsearch.org/stream2es/stream2es
Optional: See stream2es on github for options: https://github.com/elastic/stream2es

Download Wikipedia articles

Go here and download the wikipedia article archive: https://dumps.wikimedia.org/enwiki/latest/
There are many options, but the specific one I downloaded was this: enwiki-latest-pages-articles.xml.bz2
(It's over 12GB, so be sure you have plenty of disk space.)

Start ElasticSearch

I'm on Windows, so I opened a command window and ran this: 
c:\elasticsearch-1.5.2\bin\elasticsearch.bat
That starts up your local ElasticSearch instance at localhost:9200

Run stream2es

  • Move the stream2es file to your ElasticSearch bin folder. I put stream2es here c:\elasticsearch-1.5.2\bin\
  • Move the Wikipedia archive (enwiki-latest-pages-articles.xml.bz2) to your ElasticSearch bin folder too.
  • Run the stream2es java file: 
C:\elasticsearch-1.5.2\bin>java -jar stream2es wiki --target http://localhost:9200/mywiki --log debug --source /enwiki-latest-pages-articles.xml.bz2

Notes:
  1. You can change the "mywiki" to whatever you want your specific ElasticSearch index name to be.
  2. I had some trouble getting stream2es to find my wikipedia archive path on Windows, but the / in front of the file name worked.



I ran this all local on my Windows desktop, and it took 6-8 hours. It appears to be locked up near the end, but it did eventually exit. 

Now, you should have over 16 million Wikipedia articles loaded in to your local ElasticSearch index. Enjoy.

I plan on doing future articles on using this Wikipedia data for machine learning, natural language processing, and topic clustering.