Handling inconsistent Url formats in C#

Let’s say you have a whole bunch of Urls in a database or whatever collection format, but some are like these examples:

  • NovelEssay.com
  • www.NovelEssay.com
  • http://NovelEssay.com
  • http://www.NovelEssay.com

Is there an easy way to handle these variety of formats in C#?

Yes, we’ll do some prefix detection and then use a Uri class to help parse the data.

First, we assume our original data is in the HomePage variable on the Record object. We’ll detect if HomePage starts with “http://”. If it doesn’t, we’ll add that:

if (!Record.HomePage.StartsWith(HTTP_PREFIX))
{
Record.HomePage = HTTP_PREFIX + Record.HomePage;
}

Next, we’ll try to parse the HomePage with the Uri class like this:

try
{
    Uri myUri = new Uri(Record.HomePage);
}
catch (Exception e)
{
    // HomePage is not parsable as Uri object
}

If HomePage is still in a bad format, the Uri constructor will throw an exception that we want to catch.

Finally, if an exception isn’t thrown, we can use the new myUri to extract the base domain and other parts of the Uri like this:
Record.Domain = myUri.Host;
One last trick, is to check the Record.Domain or myUri.Host to see if it begins with a www. prefix. Depending on how you want to normalize your host information, you may want to add or remove the www prefix on the Host (or Domain) value.
That’s all. Happy Uri parsing!