Extracting important snip-its with C# and Log Likelyhood


How does text summary software pick out the most important ideas to present?

One solution is to use Log Likelyhood to generate a summary of the sentences that contain the important terms and cover the most different topics.

What Log Likelyhood is and why it works can be read here: https://en.wikipedia.org/wiki/Likelihood_function

This article will focus on implementing it in C#. Here’s the code I wrote as a translation from the Mahout version. I tried to make it as readable as possible rather than optimizing for performance.

// Log Likelyhood code roughly translated from here:
// http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-math/0.3/org/apache/mahout/math/stats/LogLikelihood.java#LogLikelihood.logLikelihoodRatio%28int%2Cint%2Cint%2Cint%29
static private double ShannonEntropy(List<Int64> elements)
{
    double sum = 0;
    foreach (Int64 element in elements)
    {
        sum += element;
    }
    double result = 0.0;
    foreach (Int64 element in elements)
    {
        if(element < 0)
        {
            throw new Exception("Should not have negative count for entropy computation (" + element + ")");
        }
        int zeroFlag = (element == 0 ? 1 : 0);
        result += element * Math.Log((element + zeroFlag) / sum);
    }
    return result;
}
/*
    Calculate the Raw Log-likelihood ratio for two events, call them A and B. Then we have:
 	    Event A	Everything but A
        Event B	A and B together (k_11)	B, but not A (k_12)
        Everything but B	A without B (k_21)	Neither A nor B (k_22)
    Parameters:
    k11 The number of times the two events occurred together
    k12 The number of times the second event occurred WITHOUT the first event
    k21 The number of times the first event occurred WITHOUT the second event
    k22 The number of times something else occurred (i.e. was neither of these events
*/
static public double LogLikelihoodRatio(Int64 k11, Int64 k12, Int64 k21, Int64 k22)
{
    double rowEntropy = ShannonEntropy(new List<Int64>() { k11, k12 }) + ShannonEntropy(new List<Int64>() { k21, k22 });
    double columnEntropy = ShannonEntropy(new List<Int64>() { k11, k21 }) + ShannonEntropy(new List<Int64>() { k12, k22 });
    double matrixEntropy = ShannonEntropy(new List<Int64>() { k11, k12, k21, k22 });
    return 2 * (matrixEntropy - rowEntropy - columnEntropy);
}

Now, we have a simple LogLikelihoodRatio function we can call with 4 parameters and get the score result.

Let’s say we want to pick out the most important sentences from a particular Wikipedia article in order to summarize it. (See this article for loading Wikipedia in to ElasticSearch: http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch)

Follow these steps:

  1. Pick a Wikipedia article.
  2. Get a Term Frequency dictionary for the whole article.
  3. Parse the article in to sentences.
  4. For each token in each sentence, calculate the Log Likelyhood score with the above LogLikelihoodRatio function.
  5. If the result of LogLikelihoodRatio is less than -10, give that sentence +1 to a weight value.
  6. At the end of each sentence, you have a +X weight value. That can be normalized by the number of words in the sentence.
  7. After you’ve obtained the weight score from #6 for all of the sentences in an article, you can sort them and pick the most important ones.

For extra credit, you’ll want to avoid redundant important sentences. In order to do that, you’ll need to score the candidate sentences against the summary’s output as you build it.


Here’s some code with comments about populating the input values passed to the LogLikelihoodRatio function. Be sure to check the result score is less than -10 before adding a +1 weight.

// http://www.cs.columbia.edu/~gmw/candidacy/LinHovy00.pdf - Section 4.1
Int64 k11 = // frequency of current term in this article
Int64 k12 = // frequency of current term in all of Wikipedia - k11
Int64 k21 = // total count of all terms in this article - k11
Int64 k22 = // total count of all terms in Wikipedia - k12
double termWeight = LogLikelihoodRatio(k11, k12, k21, k22);

if(termWeight < -10)
{
    weightSum++;
}

Obviously, in the above you don’t want to be calculating Term Frequency across Wikipedia on-the-fly. K11 and K21 will get calculated as you process an article, but K12 and K22 should be calculated in advance and cached in a lookup dictionary. 

I use LevelDb as my Term Frequency look up dictionary. You can read about using that here: http://blog.novelessay.com/post/fast-persistent-key-value-pairs-in-c-with-leveldb

In order to build your Term Frequency look up dictionary chace, you could process each documents and create your own term frequency output, or use the ElasticSearch plugin for _termList here: https://github.com/jprante/elasticsearch-index-termlist

Using C# Nest with ElasticSearch Carrot2 Clustering Plugin


ElasticSearch Carrot2 Clustering Plugin:

https://github.com/carrot2/elasticsearch-carrot2

This article will walk you through setting up ElasticSearch and Carrot2 clutering, so that you can implement something awesome like clustering topics on the publicly available Hillary Clinton email data set. Then, use foam tree to visually display it like this:

On to the example!

Let’s say we want to get clusters for our Wikipedia index that we have loaded in to ElasticSearch, and we want to be able to also get clusters based on queries.

First, we’ll want to build a SearchDescriptor based on query input. Let’s just use a simple query string example for now. Here’s example code to build a SearchDescriptor (which uses a special ignore “redirect” string that is custom for our Wikipedia index):

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
	string ignoreA = "#redirect";
	string ignoreB = "redirect";

	var searchDescriptor = new SearchDescriptor<Page>()
			.Query(q =>
				q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
				&& !q.Term(p => p.text, ignoreA)
				&& !q.Term(p => p.text, ignoreB)
				&& !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
			);
	return searchDescriptor;
}

Next, we need to build our ElasticSearch query with connection strings along with size and from values. Then, we use the Nest client serializer to convert our request to JSON:

public static EsCarrotClusters GetSearchCarrotClusters(string esUrl, string queryString, int from, int size, bool queryAnd, string ignoreQuery)
{
	ConnectionSettings Settings = new ConnectionSettings(new Uri(esUrl));
	ElasticClient Client = new ElasticClient(Settings);
	var searchDescriptor = GetDocumentSearchDescriptorFromSearchParameters(queryString, queryAnd, ignoreQuery);
	searchDescriptor.Fields(f => f.Add("text"));
	searchDescriptor.Size(size);
	searchDescriptor.From(from);

	var jsonQuery = Encoding.UTF8.GetString(Client.Serializer.Serialize(searchDescriptor));
	jsonQuery = jsonQuery.Replace("\"query\": {},", "");
	jsonQuery = "{ \"search_request\" : " + jsonQuery;
	jsonQuery += ", \"query_hint\" : \"";
	jsonQuery += queryString == null ? "" : queryString;
	jsonQuery += "\",\"field_mapping\":{\"title\":[\"fields.text\"],\"content\":[]}}";

	string esClusterQueryRequestJson = jsonQuery;

	EsCarrotClusters clusters = null;

	string json = GetEsJsonFromAPI(esUrl, "_search_with_clusters", "", esClusterQueryRequestJson);
	if (json.Length > 0)
	{
		try
		{
			clusters = JsonConvert.DeserializeObject<EsCarrotClusters>(json);

Example calling code that uses “cats and dogs” as a query string input:

EsCarrotClusters esCarrotClusters = EsHttpWebRequestApi.GetSearchCarrotClusters("http://localhost:9200/mywiki", "cats and dogs", 0, 10, true, "");
Special Note:
The GetEsJsonFromAPI function simlpy does a HttpWebRequest POST to the ElasticSearch Uri with the JSON content written to the stream like this:
            using (var streamWriter = new StreamWriter(request.GetRequestStream()))
            {
                streamWriter.Write(esRequestJson);
                streamWriter.Flush();
                streamWriter.Close();
            }

Lastly, you’ll want to see my EsCarrotClusters classes, so you can deserialize the HttpWebRequest’s response back to a C# friendly object. Enjoy:

    public class Shards
    {
        public int total { get; set; }
        public int successful { get; set; }
        public int failed { get; set; }
    }

    public class Fields
    {
        public List<string> filename { get; set; }
    }

    public class Hit
    {
        public string _index { get; set; }
        public string _type { get; set; }
        public string _id { get; set; }
        public double _score { get; set; }
        public Fields fields { get; set; }
    }

    public class Hits
    {
        public int total { get; set; }
        public double max_score { get; set; }
        public List<Hit> hits { get; set; }
    }

    public class Cluster
    {
        public int id { get; set; }
        public double score { get; set; }
        public string label { get; set; }
        public List<string> phrases { get; set; }
        public List<string> documents { get; set; }
        public bool? other_topics { get; set; }
    }

    public class Info
    {
        public string algorithm { get; set; }
        [JsonProperty("search-millis")]
        public string searchmillis { get; set; }
        [JsonProperty("clustering-millis")]
        public string clusteringmillis { get; set; }
        [JsonProperty("total-millis")]
        public string totalmillis { get; set; }
        [JsonProperty("include-hits")]
        public string includehits { get; set; }
        [JsonProperty("max-hits")]
        public string maxhits { get; set; }
    }

    public class EsCarrotClusters
    {
        public int took { get; set; }
        public bool timed_out { get; set; }
        public Shards _shards { get; set; }
        public Hits hits { get; set; }
        public List<Cluster> clusters { get; set; }
        public Info info { get; set; }
    }


After I run this against my Wikipedia ElasticSearch index for “cats and dogs”, I get clusters like these:

  • Polynueropath in Dogs and Cats
  • Album Cats
  • Canine
  • Domestic Cats
  • Missing Disease

Notice that you also are given a Score property, which you can use to weight topics or visually show them differently to the user.