NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Using IEqualityComparer for finding near duplicates with custom business logic in C#

Example use case:

Let's say you're crawling the web gathering up information about people, and you want to group any matches of "John Smith" that might actually be the same person. 


In the generic case, we're walking through how to manage business logic cases for finding near duplicates using good software design principles.


We'll start by having a class like the Person example that follows:

    public class Person
    {
        [Key]
        public long Id { get; set; }
        public string Name { get; set; }
        public string LinkedInUrl { get; set; }
        public string TwitterUrl { get; set; }
        public string FacebookUrl { get; set; }


Let's say we get a hit from some blog about "John Smith", and his TwitterUrl is twitter.com/jsmith. We'll populate a Person object like that and toss it in our database, repository, or whatever storage we're using like this:

Person foundPerson = new Person() { Name = "John Smith", TwitterUrl = "twitter.com/jsmith" };
List<Person> allPeopleFound = new List<Person>();
allPeopleFound.Add(foundPerson);

Later, we find another "John Smith", and his LinkedInUrl is linkedin.com/in/jsmith. We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "linkedin.com/in/jsmith" };
allPeopleFound.Add(foundPerson);

Finally, we find another "J. P. Smith", and his LinkedInUrl is linkedin.com/in/jsmith and Facebook URL is facebook.com/jps. We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "linkedin.com/in/jsmith", FacebookUrl = "facebook.com/jps" };
allPeopleFound.Add(foundPerson);



We could group allPeopleFound by Name, but that's certainly going to have many false positives in our "John Smith" group. That approach will also not let us group "J. P. Smith" with "John Smith".

Let's show the code we want to happen before we show the solution we need.

var nearDupePeople = allPeopleFound.GroupBy(c => c, new PersonComparer());
foreach (var nearDupePerson in nearDupePeople)
{
    foreach (var person in nearDupePerson)
    {
        // Here we are iterating through all person objects that were grouped to gether by the PersonComparer above
        // TODO: Now, that "JP Smith" and "John Smith" are found equal, we need to have business rules about multi-valued fields

Now, you should be thinking - What's with that PersonComparer class?


Not messing around, we'll show off the PersonComparer class that implements an IEqualityComparer.

public class PersonComparer : IEqualityComparer<Person>
{
    public bool Equals(Person p1, Person p2)
    {
        // Social media matches, various social network identity matching here:
        if (!string.IsNullOrEmpty(p1.LinkedInUrl) && !string.IsNullOrEmpty(p2.LinkedInUrl) && p1.LinkedInUrl.Equals(p2.LinkedInUrl))
        {
            return true;
        }
        if (!string.IsNullOrEmpty(p1.TwitterUrl) && !string.IsNullOrEmpty(p2.TwitterUrl) && p1.TwitterUrl.Equals(p2.TwitterUrl))
        {
            return true;
        }
        if (!string.IsNullOrEmpty(p1.FacebookUrl) && !string.IsNullOrEmpty(p2.FacebookUrl) && p1.FacebookUrl.Equals(p2.FacebookUrl))
        {
            return true;
        }
        return false;
    }
    public int GetHashCode(Person p)
    {
        return (p.LinkedInUrl + p.TwitterUrl + p.FacebookUrl).GetHashCode();
    }
}

Notice that our IEqualityComparer implementation needs to have two functions implemented: Equals and GetHashCode.


In our code, we'll call two Person objects equal if their LinkedIn Urls are the same, or their Twitter Urls are the same, or their Facebook Urls are the same. We don't consider two Person instances equal if their Name is equal.


Our GetHashCode function needs to account for all 3 properties we are using for equating Person objects, so we concatenate our 3 Url properties to get our object's hash code.


That's all there is to executing a custom "near duplicate" grouping and easily handling the business logic inside your implementation for IEqualityComparer.


C# Entity Framework implementing 'Contains' Query for MySQL

This article will show some tricks on how to use C# Entity Framework queries with 'Contains' predicates for a MySQL database.

The goal is to use C# queries like this against a MySQL database:

var result = dbContext.People.Where(p => p.Name.Contains("john"));

First, we need to create a Full-Text Search Interceptor. We'll start with the one found at the following link:

http://www.entityframework.info/Home/FullTextSearch

That one is specific to SQL Server, so we'll need to make some changes. After my changes, a version for MySQL is this:

static class LanguageExtensions
{
    public static bool In<T>(this T source, params T[] list)
    {
        return (list as IList<T>).Contains(source);
    }
}
public class FtsInterceptor : IDbCommandInterceptor
{
    private const string FullTextPrefix = "-FTSPREFIX-";
    public static string Fts(string search)
    {
        return string.Format("({0}{1})", FullTextPrefix, search);
    }
    public void NonQueryExecuting(DbCommand command, DbCommandInterceptionContext<int> interceptionContext)
    {
    }
    public void NonQueryExecuted(DbCommand command, DbCommandInterceptionContext<int> interceptionContext)
    {
    }
    public void ReaderExecuting(DbCommand command, DbCommandInterceptionContext<DbDataReader> interceptionContext)
    {
        RewriteFullTextQuery(command);
    }
    public void ReaderExecuted(DbCommand command, DbCommandInterceptionContext<DbDataReader> interceptionContext)
    {
    }
    public void ScalarExecuting(DbCommand command, DbCommandInterceptionContext<object> interceptionContext)
    {
        RewriteFullTextQuery(command);
    }
    public void ScalarExecuted(DbCommand command, DbCommandInterceptionContext<object> interceptionContext)
    {
    }
    public static void RewriteFullTextQuery(DbCommand cmd)
    {
        string text = cmd.CommandText;
        for (int i = 0; i < cmd.Parameters.Count; i++)
        {
            DbParameter parameter = cmd.Parameters[i];
            if (parameter.DbType.In(DbType.String, DbType.AnsiString, DbType.StringFixedLength, DbType.AnsiStringFixedLength))
            {
                if (parameter.Value == DBNull.Value)
                    continue;
                var value = (string)parameter.Value;
                if (value.IndexOf(FullTextPrefix) >= 0)
                {
                    parameter.Size = 4096;
                    parameter.DbType = DbType.AnsiStringFixedLength;
                    value = value.Replace(FullTextPrefix, "");
                    value = value.Substring(1, value.Length - 2);
                    parameter.Value = value;
                    cmd.CommandText = Regex.Replace(
                        text,
                        string.Format(@"`(\w*)`.`(\w*)`\s*LIKE\s*@{0}\s?", parameter.ParameterName),
                        string.Format(@"match(`$1`.`$2`) against(@{0})", parameter.ParameterName)
                    );

                    if (text == cmd.CommandText)
                        throw new Exception("FTS was not replaced on: " + text);
                    text = cmd.CommandText;
                }
            }
        }
    }
}

The most significant difference is in the Regex.Replace. For MySQL, we need to make our SQL use Full Text indices using the Match-Against combo like this:

select * from people WHERE MATCH (Name) against ('john')

Note: Be sure that your table has a Full Text index on the columns you are querying. You can always add a Full Text index to your table like this:

ALTER TABLE people ADD FULLTEXT INDEX people_name_FT_IDX (Name);


Next, we'll need to add our FtsInterceptor to our DbContext like this:

public partial class MyAppMySqlDbEntities : DbContext
{
    public MyAppMySqlDbEntities()
        : base("name=MyAppMySqlDbEntities")
    {
        DbInterception.Add(new FtsInterceptor());
        Configuration.UseDatabaseNullSemantics = true;

You'll also need a using statement to go with those DbInterception calls.

using System.Data.Entity.Infrastructure;
using System.Data.Entity.Infrastructure.Interception;

Finally, we're ready to make some queries:

var personNameFtsI = FtsInterceptor.Fts('john');
var query = myDbContext.People.Where(p => p.Name.Contains(personNameFtsI));
int count = query.Count();

I suggest you set a break point in the FtsInterceptor.RewriteFullTextQuery function, and when the query.Count line is executed that breakpoint should get hit. Then, you can step through that function and see how your SQL is changing to use the Match-Against combo that MySQL needs for a Full Text index search.


A special big thank you goes out to our friends at AlphaProspect.com for help figuring all of this out.


One final "gotcha" to watch out for is that MySQL seems to only like using one Full Text index per query, so if you do something like this:

var personNameFtsI = FtsInterceptor.Fts('john');
var personJobFtsI = FtsInterceptor.Fts('engineer');
var query = myDbContext.People.Where(p => p.Name.Contains(personNameFtsI) && p.JobTitle.Contains(personJobFtsI));
int count = query.Count();

The SQL result will look correct, and it should have two Match-Against clauses, but MySQL still won't use both Full Text indices. I suggest you capture your SQL from in the FtsInterceptor.RewriteFullTextQuery function, and then execute it in the MySQL workbench. Get the execution plan or "explain" result for your query. That'll show that MySQL is really only using one of the Full Text indices.


There's a way to rewrite the queries to make use of both Full Text indices, but I have no idea how to make that happen using C# Entity Framework and Linq. Here's a link about how MySQL can do it with temp tables:

http://stackoverflow.com/questions/42097593/mysql-slow-query-when-combining-two-very-fast-queries

If you solve this, I'd love to hear from you about how you did it. Thank you.


AngularJs Pagination with custom Total Count reuse.

I have an AngularJS application that paginates through a table. You can filter the table by column. I'm using angular-paginate-anything as my AngularJS plugin for my paginated table. My table has over a million rows, and the filtered queries are not fast. Selecting "top 10" records is much faster than "get total count", so I'd like to reuse the total count value if the table filters don't change between pages. Unfortunately, angular-paginate-anything doesn't do that out of the box. Fortunately, this article shows how to do that.


Background, I have another article that shows how to use angular-paginate-anything for an Mvc.Net project with pagination. Consider that the starting point. Go ahead and read it quick. We'll wait right here...

http://blog.novelessay.com/post/mvc-net-with-angularjs-and-pagination


Next, we'll open the paginate-anything-tpls.js file and work with that (since it's not minimized). Again, our goal is to only request a new total count when the parameters change.


When the pagination app initially loads, we need to get a total count so we'll initialize that state to true:

              controller: ['$scope', '$http', function ($scope, $http) {
                  $scope.urlParamsChanged = true;

After we get the paginated results from the server, we want to set the urlParamsChange state to false.

function requestRange(request) {
$scope.$emit('pagination:loadStart', request);
$http({
...
}).success(function (data, status, headers, config) {
$scope.urlParamsChanged = false;

Then, we need to set urlParamsChanged to true again when the filter parameters do change:

                  $scope.$watch('urlParams', function (newParams, oldParams) {
                      if ($scope.passive === 'true') { return; }
                      if (!angular.equals(newParams, oldParams)) {
                          $scope.urlParamsChanged = true;

When we do make a request to the server we need to check the urlParamsChanged and indicate if we need total count or not:

                      $http({
                          method: 'GET',
                          url: $scope.url,
                          params: $scope.urlParams,
                          headers: angular.extend(
                            {}, $scope.headers,
                            { 'Range-Unit': 'items', Range: $scope.urlParamsChanged == true ? [request.from, request.to].join('-') : [request.from, request.to, $scope.numItems].join('-') }
                          ),
That's all the java script changes we need. Next, we need to modify our server to check if the total count is passed back to the server. Originally, total count on the input paging model was always 0, so we'll check if it's non-zero in the new code.

[Route("GetSearchResults"), HttpGet()]
[WithPaging]
public IHttpActionResult Search([FromUri] PagingModel pagingModel
, [FromUri] string Name = null
, [FromUri] string Region = null
)
{
IQueryable <company> query = ...
PagingModeSearch result = new PagingModeSearch();
result.From = pagingModel.From;
result.To = pagingModel.To;
if (pagingModel.TotalRecordCount > 0)
{
result.TotalRecordCount = pagingModel.TotalRecordCount;
}

else
{
result.TotalRecordCount = query.Count();
}


Finally, we need to update our WithPagingAttribute class in our NgPaginateModel.cs file to parse the range header for the total count value like this:

    public class WithPagingAttribute : ActionFilterAttribute
    {
        private readonly string _paramName;
        private static readonly Regex RangeHeaderRegex = new Regex(@"(?<from>\d+)-(?<to>\d*)-(?<total>\d*)", RegexOptions.Compiled);
And, very lastly this:
        private static PagingModel IncomingPagingModel(HttpRequestHeaders headers, PagingModel pagingModel)
        {
            if (pagingModel == null)
                pagingModel = new PagingModel();
            IEnumerable<string> rangeUnitValues;
            IEnumerable<string> rangeValues;
            if (headers.TryGetValues("Range-Unit", out rangeUnitValues)
                && rangeUnitValues.First().Equals("items", StringComparison.OrdinalIgnoreCase)
                && headers.TryGetValues("Range", out rangeValues))
            {
                var rangeHeaderMatch = RangeHeaderRegex.Match(rangeValues.First());
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["from"].Value))
                    pagingModel.From = int.Parse(rangeHeaderMatch.Groups["from"].Value);
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["to"].Value))
                    pagingModel.To = int.Parse(rangeHeaderMatch.Groups["to"].Value);
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["total"].Value))
                    pagingModel.TotalRecordCount = int.Parse(rangeHeaderMatch.Groups["total"].Value);

            }
            return pagingModel;
        }


This makes navigating through each page of a search result much faster when the total count can be reused.















Must change item for configuring MySQL for Windows

I recently installed MySQL and the MySQL workbench on my Windows 10 PC. It's working great. I loaded up about 2 GB of data in two 3 tables with roughly 2M rows total. Most of my simple (single table by ID or Full Text Index) queries ran really fast, and I was happy.


Then, I started working on queries that join tables. They were slow - like 26 seconds slow for a two table join by integer ID using a FK (simple stuff).


First, I checked my indices. I added several indices just trying things and hoping it'd work. That did not solve the slowness.


I had to tinker in Workbench quite a bit imagining what could be wrong and how do I identify the source problem, which turned out to be good learning exercise.


Eventually, I saw (on the Workbench Server's dashboard) that InnoDB buffer usage went to 100% during my long running queries. I didn't really know what that meant, but I was out of "good" ideas to try so I was happy to have any leads.


I asked the internet sages (search engines) about this "InnoDB buffer usage at 100%", and I found some leads. It turns out that buffer size is set in "my.ini" (on Windows) or "my.cnf" (Not-windows). The default was 8MB, which is amazingly low for any modern computer. (MySQL on my phone would want a bigger value than that as a default.)


I found my.ini at this path on my Windows 10 PC:

C:\programdata\MySQL\MySQL Server 5.7\my.ini

Specifically, edit that file with notepad or notepad++. Look for this value:

innodb_buffer_pool_size=8M

I changed mine to 1G first, and now I'm using 2G. I don't even think that's a lot, but this is all on my dev-box so I don't want to give it all of my RAM.

innodb_buffer_pool_size=2G

After you change that my.ini value, be sure to restart your MySQL service. 

  1. Type services.msc in to the Run box on Windows to bring up your services manager. 
  2. Find the MySQL service, and right click it. 
  3. There should be a "restart" option, so click that.
After this change, my 26 second simple table join query went to sub-second time.

C# IQueryable with a Dynamic Predicate Builder for Entity Framework and Linq

Scenario:

You have a table in a web page (or WinForms) with filterable columns. A user can add filter requirements or not, which makes your query's "where" clause very dynamic. This is easy to solve if you just have a few columns, but what if you have a dozen columns all which may or may not have filters that need to be applied to the "where" clause?


Solution:

Use the PredicateBuilder in the LinqKit Nuget package.


1) Install the LinqKit Nuget package.


2) Add the typical using statement for LinqKit

using LinqKit;


3) Create a function to populate your PredicateBuilder

IQueryable<person> GetPeoplePredicate(SearchParametersPeople searchParameters)
{
	var predicate = PredicateBuilder.True<person>();

	if (!string.IsNullOrEmpty(searchParameters.personName))
	{
		predicate = predicate.And(p => p.Name.Contains(searchParameters.personName));
	}
	if (!string.IsNullOrEmpty(searchParameters.personTitle))
	{
		predicate = predicate.And(p => p.JobTitle.Contains(searchParameters.personTitle));
	}
	if (!string.IsNullOrEmpty(searchParameters.personLocation))
	{
		predicate = predicate.And(p => p.Locations.Contains(searchParameters.personLocation));
	}
	if (!string.IsNullOrEmpty(searchParameters.personBio))
	{
		predicate = predicate.And(p => p.Bio.Contains(searchParameters.personBio));
	}
	// Entity Framework requires AsExpandable
	//return _db.companies.Where(predicate);
	return _db.people.AsExpandable().Where(predicate);
}

This example shows several Person fields that might be filterable, and does a predicate.And call to append a new requirement to the "where" clause of our query. This example shows the "and" operation, so hits on all filters must pass to be part of the result set.

Notice that Entity Framework requires the AsExpandable() . We are using EF in this example.

Let's say you are doing a Keywords collection and want to have all hits for any keyword. Then, use a PredicateBuiler.False instead of PredicateBuilder.True. Notice, in this example the conditions have the "or" operator applied with each other via the predicate.Or call.

IQueryable<Product> SearchProducts (params string[] keywords)
{
  var predicate = PredicateBuilder.False<Product>();
  foreach (string keyword in keywords)
  {
    string temp = keyword;
    predicate = predicate.Or (p => p.Description.Contains (temp));
  }
  return db.Products.Where (predicate);
}


Now, let's take our dynamic predicate IQueryable query and join with another table. Here we join a Person table with a Company table:

IQueryable<person> query = GetPeoplePredicate(searchParametersPeople);
IQueryable<company> companyQuery = from company in _db.companies
		where company.NameLower.Contains(companyName.ToLower())
		select company;
query = from peopleResult in query
		join c in companyQuery on peopleResult.Company_Id equals c.Id
		select peopleResult;

In this example, we use our PredicateBuilder function to get a dynamically created Person query. Then, we create a second query to filter our Company table by a company names. Lastly, we use LINQ to join our Person query with our Company query to get a result set that contains People that have a relationship to our Company table filtered by the company name predicate.


Now, we have a really powerful dynamic query builder that we can let our users run adhoc queries against any number of fields across multiple related tables!