Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Handling inconsistent Url formats in C#

Let's say you have a whole bunch of Urls in a database or whatever collection format, but some are like these examples:


Is there an easy way to handle these variety of formats in C#?

Yes, we'll do some prefix detection and then use a Uri class to help parse the data.

First, we assume our original data is in the HomePage variable on the Record object. We'll detect if HomePage starts with "http://". If it doesn't, we'll add that:

if (!Record.HomePage.StartsWith(HTTP_PREFIX))
Record.HomePage = HTTP_PREFIX + Record.HomePage;

Next, we'll try to parse the HomePage with the Uri class like this:

    Uri myUri = new Uri(Record.HomePage);
catch (Exception e)
    // HomePage is not parsable as Uri object

If HomePage is still in a bad format, the Uri constructor will throw an exception that we want to catch.

Finally, if an exception isn't thrown, we can use the new myUri to extract the base domain and other parts of the Uri like this:
Record.Domain = myUri.Host;
One last trick, is to check the Record.Domain or myUri.Host to see if it begins with a www. prefix. Depending on how you want to normalize your host information, you may want to add or remove the www prefix on the Host (or Domain) value.

That's all. Happy Uri parsing!

Using IEqualityComparer for finding near duplicates with custom business logic in C#

Example use case:

Let's say you're crawling the web gathering up information about people, and you want to group any matches of "John Smith" that might actually be the same person. 

In the generic case, we're walking through how to manage business logic cases for finding near duplicates using good software design principles.

We'll start by having a class like the Person example that follows:

    public class Person
        public long Id { get; set; }
        public string Name { get; set; }
        public string LinkedInUrl { get; set; }
        public string TwitterUrl { get; set; }
        public string FacebookUrl { get; set; }

Let's say we get a hit from some blog about "John Smith", and his TwitterUrl is We'll populate a Person object like that and toss it in our database, repository, or whatever storage we're using like this:

Person foundPerson = new Person() { Name = "John Smith", TwitterUrl = "" };
List<Person> allPeopleFound = new List<Person>();

Later, we find another "John Smith", and his LinkedInUrl is We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "" };

Finally, we find another "J. P. Smith", and his LinkedInUrl is and Facebook URL is We'll add that Person to our collection:

Person foundPerson = new Person() { Name = "John Smith", LinkedInUrl = "", FacebookUrl = "" };

We could group allPeopleFound by Name, but that's certainly going to have many false positives in our "John Smith" group. That approach will also not let us group "J. P. Smith" with "John Smith".

Let's show the code we want to happen before we show the solution we need.

var nearDupePeople = allPeopleFound.GroupBy(c => c, new PersonComparer());
foreach (var nearDupePerson in nearDupePeople)
    foreach (var person in nearDupePerson)
        // Here we are iterating through all person objects that were grouped to gether by the PersonComparer above
        // TODO: Now, that "JP Smith" and "John Smith" are found equal, we need to have business rules about multi-valued fields

Now, you should be thinking - What's with that PersonComparer class?

Not messing around, we'll show off the PersonComparer class that implements an IEqualityComparer.

public class PersonComparer : IEqualityComparer<Person>
    public bool Equals(Person p1, Person p2)
        // Social media matches, various social network identity matching here:
        if (!string.IsNullOrEmpty(p1.LinkedInUrl) && !string.IsNullOrEmpty(p2.LinkedInUrl) && p1.LinkedInUrl.Equals(p2.LinkedInUrl))
            return true;
        if (!string.IsNullOrEmpty(p1.TwitterUrl) && !string.IsNullOrEmpty(p2.TwitterUrl) && p1.TwitterUrl.Equals(p2.TwitterUrl))
            return true;
        if (!string.IsNullOrEmpty(p1.FacebookUrl) && !string.IsNullOrEmpty(p2.FacebookUrl) && p1.FacebookUrl.Equals(p2.FacebookUrl))
            return true;
        return false;
    public int GetHashCode(Person p)
        return (p.LinkedInUrl + p.TwitterUrl + p.FacebookUrl).GetHashCode();

Notice that our IEqualityComparer implementation needs to have two functions implemented: Equals and GetHashCode.

In our code, we'll call two Person objects equal if their LinkedIn Urls are the same, or their Twitter Urls are the same, or their Facebook Urls are the same. We don't consider two Person instances equal if their Name is equal.

Our GetHashCode function needs to account for all 3 properties we are using for equating Person objects, so we concatenate our 3 Url properties to get our object's hash code.

That's all there is to executing a custom "near duplicate" grouping and easily handling the business logic inside your implementation for IEqualityComparer.

C# Entity Framework implementing 'Contains' Query for MySQL

This article will show some tricks on how to use C# Entity Framework queries with 'Contains' predicates for a MySQL database.

The goal is to use C# queries like this against a MySQL database:

var result = dbContext.People.Where(p => p.Name.Contains("john"));

First, we need to create a Full-Text Search Interceptor. We'll start with the one found at the following link:

That one is specific to SQL Server, so we'll need to make some changes. After my changes, a version for MySQL is this:

static class LanguageExtensions
    public static bool In<T>(this T source, params T[] list)
        return (list as IList<T>).Contains(source);
public class FtsInterceptor : IDbCommandInterceptor
    private const string FullTextPrefix = "-FTSPREFIX-";
    public static string Fts(string search)
        return string.Format("({0}{1})", FullTextPrefix, search);
    public void NonQueryExecuting(DbCommand command, DbCommandInterceptionContext<int> interceptionContext)
    public void NonQueryExecuted(DbCommand command, DbCommandInterceptionContext<int> interceptionContext)
    public void ReaderExecuting(DbCommand command, DbCommandInterceptionContext<DbDataReader> interceptionContext)
    public void ReaderExecuted(DbCommand command, DbCommandInterceptionContext<DbDataReader> interceptionContext)
    public void ScalarExecuting(DbCommand command, DbCommandInterceptionContext<object> interceptionContext)
    public void ScalarExecuted(DbCommand command, DbCommandInterceptionContext<object> interceptionContext)
    public static void RewriteFullTextQuery(DbCommand cmd)
        string text = cmd.CommandText;
        for (int i = 0; i < cmd.Parameters.Count; i++)
            DbParameter parameter = cmd.Parameters[i];
            if (parameter.DbType.In(DbType.String, DbType.AnsiString, DbType.StringFixedLength, DbType.AnsiStringFixedLength))
                if (parameter.Value == DBNull.Value)
                var value = (string)parameter.Value;
                if (value.IndexOf(FullTextPrefix) >= 0)
                    parameter.Size = 4096;
                    parameter.DbType = DbType.AnsiStringFixedLength;
                    value = value.Replace(FullTextPrefix, "");
                    value = value.Substring(1, value.Length - 2);
                    parameter.Value = value;
                    cmd.CommandText = Regex.Replace(
                        string.Format(@"`(\w*)`.`(\w*)`\s*LIKE\s*@{0}\s?", parameter.ParameterName),
                        string.Format(@"match(`$1`.`$2`) against(@{0})", parameter.ParameterName)

                    if (text == cmd.CommandText)
                        throw new Exception("FTS was not replaced on: " + text);
                    text = cmd.CommandText;

The most significant difference is in the Regex.Replace. For MySQL, we need to make our SQL use Full Text indices using the Match-Against combo like this:

select * from people WHERE MATCH (Name) against ('john')

Note: Be sure that your table has a Full Text index on the columns you are querying. You can always add a Full Text index to your table like this:

ALTER TABLE people ADD FULLTEXT INDEX people_name_FT_IDX (Name);

Next, we'll need to add our FtsInterceptor to our DbContext like this:

public partial class MyAppMySqlDbEntities : DbContext
    public MyAppMySqlDbEntities()
        : base("name=MyAppMySqlDbEntities")
        DbInterception.Add(new FtsInterceptor());
        Configuration.UseDatabaseNullSemantics = true;

You'll also need a using statement to go with those DbInterception calls.

using System.Data.Entity.Infrastructure;
using System.Data.Entity.Infrastructure.Interception;

Finally, we're ready to make some queries:

var personNameFtsI = FtsInterceptor.Fts('john');
var query = myDbContext.People.Where(p => p.Name.Contains(personNameFtsI));
int count = query.Count();

I suggest you set a break point in the FtsInterceptor.RewriteFullTextQuery function, and when the query.Count line is executed that breakpoint should get hit. Then, you can step through that function and see how your SQL is changing to use the Match-Against combo that MySQL needs for a Full Text index search.

A special big thank you goes out to our friends at for help figuring all of this out.

One final "gotcha" to watch out for is that MySQL seems to only like using one Full Text index per query, so if you do something like this:

var personNameFtsI = FtsInterceptor.Fts('john');
var personJobFtsI = FtsInterceptor.Fts('engineer');
var query = myDbContext.People.Where(p => p.Name.Contains(personNameFtsI) && p.JobTitle.Contains(personJobFtsI));
int count = query.Count();

The SQL result will look correct, and it should have two Match-Against clauses, but MySQL still won't use both Full Text indices. I suggest you capture your SQL from in the FtsInterceptor.RewriteFullTextQuery function, and then execute it in the MySQL workbench. Get the execution plan or "explain" result for your query. That'll show that MySQL is really only using one of the Full Text indices.

There's a way to rewrite the queries to make use of both Full Text indices, but I have no idea how to make that happen using C# Entity Framework and Linq. Here's a link about how MySQL can do it with temp tables:

If you solve this, I'd love to hear from you about how you did it. Thank you.

AngularJs Pagination with custom Total Count reuse.

I have an AngularJS application that paginates through a table. You can filter the table by column. I'm using angular-paginate-anything as my AngularJS plugin for my paginated table. My table has over a million rows, and the filtered queries are not fast. Selecting "top 10" records is much faster than "get total count", so I'd like to reuse the total count value if the table filters don't change between pages. Unfortunately, angular-paginate-anything doesn't do that out of the box. Fortunately, this article shows how to do that.

Background, I have another article that shows how to use angular-paginate-anything for an Mvc.Net project with pagination. Consider that the starting point. Go ahead and read it quick. We'll wait right here...

Next, we'll open the paginate-anything-tpls.js file and work with that (since it's not minimized). Again, our goal is to only request a new total count when the parameters change.

When the pagination app initially loads, we need to get a total count so we'll initialize that state to true:

              controller: ['$scope', '$http', function ($scope, $http) {
                  $scope.urlParamsChanged = true;

After we get the paginated results from the server, we want to set the urlParamsChange state to false.

function requestRange(request) {
$scope.$emit('pagination:loadStart', request);
}).success(function (data, status, headers, config) {
$scope.urlParamsChanged = false;

Then, we need to set urlParamsChanged to true again when the filter parameters do change:

                  $scope.$watch('urlParams', function (newParams, oldParams) {
                      if ($scope.passive === 'true') { return; }
                      if (!angular.equals(newParams, oldParams)) {
                          $scope.urlParamsChanged = true;

When we do make a request to the server we need to check the urlParamsChanged and indicate if we need total count or not:

                          method: 'GET',
                          url: $scope.url,
                          params: $scope.urlParams,
                          headers: angular.extend(
                            {}, $scope.headers,
                            { 'Range-Unit': 'items', Range: $scope.urlParamsChanged == true ? [request.from,].join('-') : [request.from,, $scope.numItems].join('-') }
That's all the java script changes we need. Next, we need to modify our server to check if the total count is passed back to the server. Originally, total count on the input paging model was always 0, so we'll check if it's non-zero in the new code.

[Route("GetSearchResults"), HttpGet()]
public IHttpActionResult Search([FromUri] PagingModel pagingModel
, [FromUri] string Name = null
, [FromUri] string Region = null
IQueryable <company> query = ...
PagingModeSearch result = new PagingModeSearch();
result.From = pagingModel.From;
result.To = pagingModel.To;
if (pagingModel.TotalRecordCount > 0)
result.TotalRecordCount = pagingModel.TotalRecordCount;

result.TotalRecordCount = query.Count();

Finally, we need to update our WithPagingAttribute class in our NgPaginateModel.cs file to parse the range header for the total count value like this:

    public class WithPagingAttribute : ActionFilterAttribute
        private readonly string _paramName;
        private static readonly Regex RangeHeaderRegex = new Regex(@"(?<from>\d+)-(?<to>\d*)-(?<total>\d*)", RegexOptions.Compiled);
And, very lastly this:
        private static PagingModel IncomingPagingModel(HttpRequestHeaders headers, PagingModel pagingModel)
            if (pagingModel == null)
                pagingModel = new PagingModel();
            IEnumerable<string> rangeUnitValues;
            IEnumerable<string> rangeValues;
            if (headers.TryGetValues("Range-Unit", out rangeUnitValues)
                && rangeUnitValues.First().Equals("items", StringComparison.OrdinalIgnoreCase)
                && headers.TryGetValues("Range", out rangeValues))
                var rangeHeaderMatch = RangeHeaderRegex.Match(rangeValues.First());
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["from"].Value))
                    pagingModel.From = int.Parse(rangeHeaderMatch.Groups["from"].Value);
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["to"].Value))
                    pagingModel.To = int.Parse(rangeHeaderMatch.Groups["to"].Value);
                if (!string.IsNullOrWhiteSpace(rangeHeaderMatch.Groups["total"].Value))
                    pagingModel.TotalRecordCount = int.Parse(rangeHeaderMatch.Groups["total"].Value);

            return pagingModel;

This makes navigating through each page of a search result much faster when the total count can be reused.

Must change item for configuring MySQL for Windows

I recently installed MySQL and the MySQL workbench on my Windows 10 PC. It's working great. I loaded up about 2 GB of data in two 3 tables with roughly 2M rows total. Most of my simple (single table by ID or Full Text Index) queries ran really fast, and I was happy.

Then, I started working on queries that join tables. They were slow - like 26 seconds slow for a two table join by integer ID using a FK (simple stuff).

First, I checked my indices. I added several indices just trying things and hoping it'd work. That did not solve the slowness.

I had to tinker in Workbench quite a bit imagining what could be wrong and how do I identify the source problem, which turned out to be good learning exercise.

Eventually, I saw (on the Workbench Server's dashboard) that InnoDB buffer usage went to 100% during my long running queries. I didn't really know what that meant, but I was out of "good" ideas to try so I was happy to have any leads.

I asked the internet sages (search engines) about this "InnoDB buffer usage at 100%", and I found some leads. It turns out that buffer size is set in "my.ini" (on Windows) or "my.cnf" (Not-windows). The default was 8MB, which is amazingly low for any modern computer. (MySQL on my phone would want a bigger value than that as a default.)

I found my.ini at this path on my Windows 10 PC:

C:\programdata\MySQL\MySQL Server 5.7\my.ini

Specifically, edit that file with notepad or notepad++. Look for this value:


I changed mine to 1G first, and now I'm using 2G. I don't even think that's a lot, but this is all on my dev-box so I don't want to give it all of my RAM.


After you change that my.ini value, be sure to restart your MySQL service. 

  1. Type services.msc in to the Run box on Windows to bring up your services manager. 
  2. Find the MySQL service, and right click it. 
  3. There should be a "restart" option, so click that.
After this change, my 26 second simple table join query went to sub-second time.