NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Using IEqualityComparer for finding near duplicates with custom business logic in C#

Example use case:

Let's say you're crawling the web gathering up information about people, and you want to group any matches of "John Smith" that might actually be the same person. 


In the generic case, we're walking through how to manage business logic cases for finding near duplicates using good software design principles.


We'll start by having a class like the Person example that follows:

    public class Person
    {
        [Key]
        public long Id { get; set; }
        public string Name { get; set; }
        public string LinkedInUrl { get; set; }
        public string TwitterUrl { get; set; }
        public string FacebookUrl { get; set; }


Let's say we get a hit from some blog about "John Smith", and his TwitterUrl is twitter.com/jsmith. We'll populate a Person object like that and toss it in our database, repository, or whatever storage we're using like this:

Person foundPerson = new Person() { Name = "John Smith", TwitterUrl = "twitter.com/jsmith" };
List<Person> allPeopleFound = new List<Person>();
allPeopleFound.Add(foundPerson);

Later, we find another "John Smith", and his LinkedInUrl is linkedin.com/in/jsmith. We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "linkedin.com/in/jsmith" };
allPeopleFound.Add(foundPerson);

Finally, we find another "J. P. Smith", and his LinkedInUrl is linkedin.com/in/jsmith and Facebook URL is facebook.com/jps. We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "linkedin.com/in/jsmith", FacebookUrl = "facebook.com/jps" };
allPeopleFound.Add(foundPerson);



We could group allPeopleFound by Name, but that's certainly going to have many false positives in our "John Smith" group. That approach will also not let us group "J. P. Smith" with "John Smith".

Let's show the code we want to happen before we show the solution we need.

var nearDupePeople = allPeopleFound.GroupBy(c => c, new PersonComparer());
foreach (var nearDupePerson in nearDupePeople)
{
    foreach (var person in nearDupePerson)
    {
        // Here we are iterating through all person objects that were grouped to gether by the PersonComparer above
        // TODO: Now, that "JP Smith" and "John Smith" are found equal, we need to have business rules about multi-valued fields

Now, you should be thinking - What's with that PersonComparer class?


Not messing around, we'll show off the PersonComparer class that implements an IEqualityComparer.

public class PersonComparer : IEqualityComparer<Person>
{
    public bool Equals(Person p1, Person p2)
    {
        // Social media matches, various social network identity matching here:
        if (!string.IsNullOrEmpty(p1.LinkedInUrl) && !string.IsNullOrEmpty(p2.LinkedInUrl) && p1.LinkedInUrl.Equals(p2.LinkedInUrl))
        {
            return true;
        }
        if (!string.IsNullOrEmpty(p1.TwitterUrl) && !string.IsNullOrEmpty(p2.TwitterUrl) && p1.TwitterUrl.Equals(p2.TwitterUrl))
        {
            return true;
        }
        if (!string.IsNullOrEmpty(p1.FacebookUrl) && !string.IsNullOrEmpty(p2.FacebookUrl) && p1.FacebookUrl.Equals(p2.FacebookUrl))
        {
            return true;
        }
        return false;
    }
    public int GetHashCode(Person p)
    {
        return (p.LinkedInUrl + p.TwitterUrl + p.FacebookUrl).GetHashCode();
    }
}

Notice that our IEqualityComparer implementation needs to have two functions implemented: Equals and GetHashCode.


In our code, we'll call two Person objects equal if their LinkedIn Urls are the same, or their Twitter Urls are the same, or their Facebook Urls are the same. We don't consider two Person instances equal if their Name is equal.


Our GetHashCode function needs to account for all 3 properties we are using for equating Person objects, so we concatenate our 3 Url properties to get our object's hash code.


That's all there is to executing a custom "near duplicate" grouping and easily handling the business logic inside your implementation for IEqualityComparer.