Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Using C# Nest with ElasticSearch Carrot2 Clustering Plugin

ElasticSearch Carrot2 Clustering Plugin:

This article will walk you through setting up ElasticSearch and Carrot2 clutering, so that you can implement something awesome like clustering topics on the publicly available Hillary Clinton email data set. Then, use foam tree to visually display it like this:

On to the example!

Let's say we want to get clusters for our Wikipedia index that we have loaded in to ElasticSearch, and we want to be able to also get clusters based on queries.

First, we'll want to build a SearchDescriptor based on query input. Let's just use a simple query string example for now. Here's example code to build a SearchDescriptor (which uses a special ignore "redirect" string that is custom for our Wikipedia index):

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
	string ignoreA = "#redirect";
	string ignoreB = "redirect";

	var searchDescriptor = new SearchDescriptor<Page>()
			.Query(q =>
				q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
				&& !q.Term(p => p.text, ignoreA)
				&& !q.Term(p => p.text, ignoreB)
				&& !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
	return searchDescriptor;

Next, we need to build our ElasticSearch query with connection strings along with size and from values. Then, we use the Nest client serializer to convert our request to JSON:

public static EsCarrotClusters GetSearchCarrotClusters(string esUrl, string queryString, int from, int size, bool queryAnd, string ignoreQuery)
	ConnectionSettings Settings = new ConnectionSettings(new Uri(esUrl));
	ElasticClient Client = new ElasticClient(Settings);
	var searchDescriptor = GetDocumentSearchDescriptorFromSearchParameters(queryString, queryAnd, ignoreQuery);
	searchDescriptor.Fields(f => f.Add("text"));

	var jsonQuery = Encoding.UTF8.GetString(Client.Serializer.Serialize(searchDescriptor));
	jsonQuery = jsonQuery.Replace("\"query\": {},", "");
	jsonQuery = "{ \"search_request\" : " + jsonQuery;
	jsonQuery += ", \"query_hint\" : \"";
	jsonQuery += queryString == null ? "" : queryString;
	jsonQuery += "\",\"field_mapping\":{\"title\":[\"fields.text\"],\"content\":[]}}";

	string esClusterQueryRequestJson = jsonQuery;

	EsCarrotClusters clusters = null;

	string json = GetEsJsonFromAPI(esUrl, "_search_with_clusters", "", esClusterQueryRequestJson);
	if (json.Length > 0)
			clusters = JsonConvert.DeserializeObject<EsCarrotClusters>(json);

Example calling code that uses "cats and dogs" as a query string input:

EsCarrotClusters esCarrotClusters = EsHttpWebRequestApi.GetSearchCarrotClusters("http://localhost:9200/mywiki", "cats and dogs", 0, 10, true, "");
Special Note:
The GetEsJsonFromAPI function simlpy does a HttpWebRequest POST to the ElasticSearch Uri with the JSON content written to the stream like this:
            using (var streamWriter = new StreamWriter(request.GetRequestStream()))

Lastly, you'll want to see my EsCarrotClusters classes, so you can deserialize the HttpWebRequest's response back to a C# friendly object. Enjoy:

    public class Shards
        public int total { get; set; }
        public int successful { get; set; }
        public int failed { get; set; }

    public class Fields
        public List<string> filename { get; set; }

    public class Hit
        public string _index { get; set; }
        public string _type { get; set; }
        public string _id { get; set; }
        public double _score { get; set; }
        public Fields fields { get; set; }

    public class Hits
        public int total { get; set; }
        public double max_score { get; set; }
        public List<Hit> hits { get; set; }

    public class Cluster
        public int id { get; set; }
        public double score { get; set; }
        public string label { get; set; }
        public List<string> phrases { get; set; }
        public List<string> documents { get; set; }
        public bool? other_topics { get; set; }

    public class Info
        public string algorithm { get; set; }
        public string searchmillis { get; set; }
        public string clusteringmillis { get; set; }
        public string totalmillis { get; set; }
        public string includehits { get; set; }
        public string maxhits { get; set; }

    public class EsCarrotClusters
        public int took { get; set; }
        public bool timed_out { get; set; }
        public Shards _shards { get; set; }
        public Hits hits { get; set; }
        public List<Cluster> clusters { get; set; }
        public Info info { get; set; }

After I run this against my Wikipedia ElasticSearch index for "cats and dogs", I get clusters like these:

  • Polynueropath in Dogs and Cats
  • Album Cats
  • Canine
  • Domestic Cats
  • Missing Disease

Notice that you also are given a Score property, which you can use to weight topics or visually show them differently to the user.