Using C# Nest with ElasticSearch Carrot2 Clustering Plugin


ElasticSearch Carrot2 Clustering Plugin:

https://github.com/carrot2/elasticsearch-carrot2

This article will walk you through setting up ElasticSearch and Carrot2 clutering, so that you can implement something awesome like clustering topics on the publicly available Hillary Clinton email data set. Then, use foam tree to visually display it like this:

On to the example!

Let’s say we want to get clusters for our Wikipedia index that we have loaded in to ElasticSearch, and we want to be able to also get clusters based on queries.

First, we’ll want to build a SearchDescriptor based on query input. Let’s just use a simple query string example for now. Here’s example code to build a SearchDescriptor (which uses a special ignore “redirect” string that is custom for our Wikipedia index):

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
	string ignoreA = "#redirect";
	string ignoreB = "redirect";

	var searchDescriptor = new SearchDescriptor<Page>()
			.Query(q =>
				q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
				&& !q.Term(p => p.text, ignoreA)
				&& !q.Term(p => p.text, ignoreB)
				&& !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
			);
	return searchDescriptor;
}

Next, we need to build our ElasticSearch query with connection strings along with size and from values. Then, we use the Nest client serializer to convert our request to JSON:

public static EsCarrotClusters GetSearchCarrotClusters(string esUrl, string queryString, int from, int size, bool queryAnd, string ignoreQuery)
{
	ConnectionSettings Settings = new ConnectionSettings(new Uri(esUrl));
	ElasticClient Client = new ElasticClient(Settings);
	var searchDescriptor = GetDocumentSearchDescriptorFromSearchParameters(queryString, queryAnd, ignoreQuery);
	searchDescriptor.Fields(f => f.Add("text"));
	searchDescriptor.Size(size);
	searchDescriptor.From(from);

	var jsonQuery = Encoding.UTF8.GetString(Client.Serializer.Serialize(searchDescriptor));
	jsonQuery = jsonQuery.Replace("\"query\": {},", "");
	jsonQuery = "{ \"search_request\" : " + jsonQuery;
	jsonQuery += ", \"query_hint\" : \"";
	jsonQuery += queryString == null ? "" : queryString;
	jsonQuery += "\",\"field_mapping\":{\"title\":[\"fields.text\"],\"content\":[]}}";

	string esClusterQueryRequestJson = jsonQuery;

	EsCarrotClusters clusters = null;

	string json = GetEsJsonFromAPI(esUrl, "_search_with_clusters", "", esClusterQueryRequestJson);
	if (json.Length > 0)
	{
		try
		{
			clusters = JsonConvert.DeserializeObject<EsCarrotClusters>(json);

Example calling code that uses “cats and dogs” as a query string input:

EsCarrotClusters esCarrotClusters = EsHttpWebRequestApi.GetSearchCarrotClusters("http://localhost:9200/mywiki", "cats and dogs", 0, 10, true, "");
Special Note:
The GetEsJsonFromAPI function simlpy does a HttpWebRequest POST to the ElasticSearch Uri with the JSON content written to the stream like this:
            using (var streamWriter = new StreamWriter(request.GetRequestStream()))
            {
                streamWriter.Write(esRequestJson);
                streamWriter.Flush();
                streamWriter.Close();
            }

Lastly, you’ll want to see my EsCarrotClusters classes, so you can deserialize the HttpWebRequest’s response back to a C# friendly object. Enjoy:

    public class Shards
    {
        public int total { get; set; }
        public int successful { get; set; }
        public int failed { get; set; }
    }

    public class Fields
    {
        public List<string> filename { get; set; }
    }

    public class Hit
    {
        public string _index { get; set; }
        public string _type { get; set; }
        public string _id { get; set; }
        public double _score { get; set; }
        public Fields fields { get; set; }
    }

    public class Hits
    {
        public int total { get; set; }
        public double max_score { get; set; }
        public List<Hit> hits { get; set; }
    }

    public class Cluster
    {
        public int id { get; set; }
        public double score { get; set; }
        public string label { get; set; }
        public List<string> phrases { get; set; }
        public List<string> documents { get; set; }
        public bool? other_topics { get; set; }
    }

    public class Info
    {
        public string algorithm { get; set; }
        [JsonProperty("search-millis")]
        public string searchmillis { get; set; }
        [JsonProperty("clustering-millis")]
        public string clusteringmillis { get; set; }
        [JsonProperty("total-millis")]
        public string totalmillis { get; set; }
        [JsonProperty("include-hits")]
        public string includehits { get; set; }
        [JsonProperty("max-hits")]
        public string maxhits { get; set; }
    }

    public class EsCarrotClusters
    {
        public int took { get; set; }
        public bool timed_out { get; set; }
        public Shards _shards { get; set; }
        public Hits hits { get; set; }
        public List<Cluster> clusters { get; set; }
        public Info info { get; set; }
    }


After I run this against my Wikipedia ElasticSearch index for “cats and dogs”, I get clusters like these:

  • Polynueropath in Dogs and Cats
  • Album Cats
  • Canine
  • Domestic Cats
  • Missing Disease

Notice that you also are given a Score property, which you can use to weight topics or visually show them differently to the user.

Querying Wikipedia in ElasticSearch with C# Nest client

This article assumes that you’ve already loaded the Wikipedia articles in to your local ElasticSearch as described in this previous article. Please follow the instructions in this article on how to load your ElasticSearch with the entire content of Wikipedia:

http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch

Start a Visual Studio C# console application project, and install the ElasticSearch Nest Nuget package. 

In your code, create a Nest ElasticClient instance that is configured for your ElasticSearch instance. We are using localhost:9200 and the index named “mywiki” as the location of our Wikipedia data. 

var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(
    node,
    defaultIndex: "mywiki"
).SetTimeout(int.MaxValue);
ElasticClient esClient = new ElasticClient(settings);

The Wikipedia index schema has a particular field format. We’ll need a Page class like this for Nest to map fields in to:

public class Page
{
    public List<string> category { get; set; }
    public bool special { get; set; }
    public string title { get; set; }
    public bool stub { get; set; }
    public bool disambiguation { get; set; }
    public List<string> link { get; set; }
    public bool redirect { get; set; }
    public string text { get; set; }
}

Now, we can start querying our Wikipedia ElasticSearch index using our Nest client. Here’s a simple example that pulls down the first 10 Wikipedia articles:

var result = esClient.Search<Page>(s => s
    .From(0)
    .Size(10)
    .MatchAll()
    );

You can check the response for errors and loop through the Page hits like this:

if (result.IsValid)
{
    foreach (var page in result.Hits)
    {
        // page.Source.text contains the wikipedia article text

After this, you can loop through all Wikipedia documents by changing the arguments passed to From and Size in the ElasticSearch query call.

Here’s a query example that emulates a Google-like search via the use of a QueryString. Notice the use for Operator.And. I suggest you change it to Operator.Or and observe the difference effect on your results.

var result = esClient.Search<Page>(s => s
    .Take(10)
    .Query(q => q
        .QueryString(p => p.Query("cats dogs birds").DefaultOperator(Operator.And))
    )
);

If you’re ready to start getting fancy, you can write a function that builds a Nest SearchDescriptor based on your query criteria. Then use the SearchDescriptor in your query to ElasticSearch. I wanted to search Wikipedia without getting redirect link results, so I set some ignore options in the example below that exclude #redirect terms for my search descriptors.

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
    string ignoreA = "#redirect";
    string ignoreB = "redirect";

    var searchDescriptor = new SearchDescriptor<Page>()
            .Query(q =>
                q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
                && !q.Term(p => p.text, ignoreA)
                && !q.Term(p => p.text, ignoreB)
                && !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
            );
    return searchDescriptor;
}

If this SearchDescriptor example is a little confusing, stay tuned for the ElasticSearch Wikipedia clustering future article that I intend to write. In the mean time, you should be set up to query your Wikipedia ElasticSearch index with the C# Nest client.