NovelEssay.com Programming Blog

Exploration of Big Data, Machine Learning, Natural Language Processing, and other fun problems.

Machine Learning for Network Security with C# and Vowpal Wabbit


This article will discuss a solution to the KDD99 Network Security contest using C# and the Vowpal Wabbit machine learning library. This is a very quick and naive approach that yields 97%+ accuracy. Proper data preprocessing and better feature selection should result in much better predictions. Details about that Network Security contest can be found here:

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html


Start by creating a new Visual Studio C# project and import the Vowpal Wabbit Nuget package.


Then, we build a class that describes the data records and the target labels that we have for training.

    public class DataRecord : VWRecord
    {
        public VWRecord GetVWRecord()
        {
            return (VWRecord) this;
        }

        public string label { get; set; }

        public int labelInt { get; set; }

        public bool isKnownAttackType { get; set; }
    }


Create a VWRecord class with Vowpal Wabbit annotations. This is used by the VW to map features to the correct format. In this example, I set Enumerize=true as much as I can and lump the features in to a single feature group. I didn't try splitting up the features in to different groups, but that seems like a smart and reasonable thing to explore.

    public class VWRecord
    {
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float duration { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public string protocol_type { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public string service { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public string flag { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float src_bytes { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_bytes { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float land { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float wrong_fragment { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float urgent { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float hot { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_failed_logins { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float logged_in { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_compromised { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float root_shell { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float su_attempted { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_root { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_file_creations { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_shells { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_access_files { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float num_outbound_cmds { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float is_host_login { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float is_guest_login { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float count { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float srv_count { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float serror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float srv_serror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float rerror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float srv_rerror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float same_srv_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float diff_srv_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float srv_diff_host_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_count { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_srv_count { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_same_srv_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_diff_srv_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_same_src_port_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_srv_diff_host_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_serror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_srv_serror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_rerror_rate { get; set; }
        [Feature(FeatureGroup = 'a', Enumerize = true)]
        public float dst_host_srv_rerror_rate { get; set; }

    }


Next, we create a Vowpal Wabbit wrapper class. After the VWWrapper instantiation, call Init before any calls to Train or Predict.

    public class VWWrapper
    {
        VW.VowpalWabbit<VWRecord> vw = null;

        public void Init(bool train = true)
        {
            vw = new VW.VowpalWabbit<VWRecord>(new VowpalWabbitSettings
            {
                EnableStringExampleGeneration = true,
                Verbose = true,
                Arguments = string.Join(" "
                , "-f vw.model"
                , "--progress 100000"
                , "-b 27"
                )
            });

        }

        public void Train(DataRecord record)
        {
            VWRecord vwRecord = record.GetVWRecord();
            SimpleLabel label = new SimpleLabel() { Label = record.labelInt };
            vw.Learn(vwRecord, label);
        }

        public float Predict(VWRecord record)
        {
            return vw.Predict(record, VowpalWabbitPredictionType.Scalar);
        }

The whole program is just a few lines. Open a data source to the training or test data set. Loop through the records and call train or predict. On the predictions, compare the prediction against the actual label and score appropriately.

Notice the cutoff score of 0.4 in the prediction function. Vowpal Wabbit will give a prediction between 0 and 1. You can tune your cutoff score to meet whatever precision and recall behaviors suite your needs. A higher cutoff score will result in some more "attack" records being predicted as "normal".

        static string trainingSource = @"C:\kaggle\kdd99\kddcup.data.corrected";
        static string testSource = @"C:\kaggle\kdd99\corrected";

        static VWWrapper vw = new VWWrapper();

        static int trainingRecordCount = 0;
        static int testRecordCount = 0;
        static int evaluateRecordCount = 0;

        static int correctNormal = 0;
        static int correctAttack = 0;
        static int totalNormal = 0;
        static int totalAttack = 0;

        static void Main(string[] args)
        {
            Stopwatch swTotal = new Stopwatch();
            swTotal.Start();
            vw.Init();
            DoTraining();
            DoEvaluate();
            swTotal.Stop();
            Console.WriteLine("Done. ElapsedTime: " + swTotal.Elapsed);
        }


        static void DoEvaluate()
        {
            float cutoffScore = 0.4f;

            DataSource sourceEval = new DataSource(testSource);
            while (sourceEval.NextRecord())
            {
                if(sourceEval.Record.isKnownAttackType)
                {
                    float prediction = vw.Predict(sourceEval.Record.GetVWRecord());

                    if(sourceEval.Record.labelInt == 0)
                    {
                        totalNormal++;
                        if (prediction < cutoffScore) correctNormal++;
                    }
                    if(sourceEval.Record.labelInt == 1)
                    {
                        totalAttack++;
                        if (prediction >= cutoffScore) correctAttack++;
                    }
                    evaluateRecordCount++;
                }
            }

            Console.WriteLine("Evaluate Complete. evaluateRecordCount = " + evaluateRecordCount);
            Console.WriteLine("Evaluate totalNormal = " + totalNormal + " correctNormal = " + correctNormal);
            Console.WriteLine("Evaluate totalAttack = " + totalAttack + " correctAttack = " + correctAttack);
            Console.WriteLine("Evaluate DONE!");
        }

        static void DoTraining()
        {
            DataSource source = new DataSource(trainingSource);
            while (source.NextRecord())
            {
                vw.Train(source.Record);
            }
            Console.WriteLine("Train Complete. trainingRecordCount = " + trainingRecordCount);
        }


This solution was incredibly quick and easy to implement and yields a 99.6% correct prediction of normal records, and about 97% correct prediction of attack records.


Finally, here is the data source example so you can cut & paste to try this out yourself:

    public class DataSource
    {
        // Columns
        private static int COLUMN_COUNT = 42;
        private static int COLUMN_COUNT_TEST = 42;

        // Current Record Attributes
        public DataRecord Record = new DataRecord();


        private string sourceReport;

        private System.IO.StreamReader fileReader;
        private int sourceIndex;

        public DataSource(string sourceReport)
        {
            this.sourceReport = sourceReport;
            Reset();
        }

        public bool NextRecord()
        {
            bool foundRecord = false;
            while (!fileReader.EndOfStream)
            {
                try
                {
                    //Processing row
                    string line = fileReader.ReadLine();
                    string[] fields = line.TrimEnd('.').Split(',');

                    // Expect COLUMN_COUNT columns
                    if (fields.Count() != COLUMN_COUNT && fields.Count() != COLUMN_COUNT_TEST)
                    {
                        throw new Exception(string.Format("sourceReportParser column count [{0}] != expected COLUMN_COUNT [{1}]", fields.Count(), COLUMN_COUNT));
                    }

                    Record = new DataRecord();

                    if (fields.Count() == COLUMN_COUNT)
                    {
                        Record.duration = float.Parse(fields[0]);
                        Record.protocol_type = fields[1];
                        Record.service = fields[2];
                        Record.flag = fields[3];
                        Record.src_bytes = float.Parse(fields[4]);
                        Record.dst_bytes = float.Parse(fields[5]);
                        Record.land = float.Parse(fields[6]);
                        Record.wrong_fragment = float.Parse(fields[7]);
                        Record.urgent = float.Parse(fields[8]);
                        Record.hot = float.Parse(fields[9]);
                        Record.num_failed_logins = float.Parse(fields[10]);
                        Record.logged_in = float.Parse(fields[11]);
                        Record.num_compromised = float.Parse(fields[12]);
                        Record.root_shell = float.Parse(fields[13]);
                        Record.su_attempted = float.Parse(fields[14]);
                        Record.num_root = float.Parse(fields[15]);
                        Record.num_file_creations = float.Parse(fields[16]);
                        Record.num_shells = float.Parse(fields[17]);
                        Record.num_access_files = float.Parse(fields[18]);
                        Record.num_outbound_cmds = float.Parse(fields[19]);
                        Record.is_host_login = float.Parse(fields[20]);
                        Record.is_guest_login = float.Parse(fields[21]);
                        Record.count = float.Parse(fields[22]);
                        Record.srv_count = float.Parse(fields[23]);
                        Record.serror_rate = float.Parse(fields[24]);
                        Record.srv_serror_rate = float.Parse(fields[25]);
                        Record.rerror_rate = float.Parse(fields[26]);
                        Record.srv_rerror_rate = float.Parse(fields[27]);
                        Record.same_srv_rate = float.Parse(fields[28]);
                        Record.diff_srv_rate = float.Parse(fields[29]);
                        Record.srv_diff_host_rate = float.Parse(fields[30]);
                        Record.dst_host_count = float.Parse(fields[31]);
                        Record.dst_host_srv_count = float.Parse(fields[32]);
                        Record.dst_host_same_srv_rate = float.Parse(fields[33]);
                        Record.dst_host_diff_srv_rate = float.Parse(fields[34]);
                        Record.dst_host_same_src_port_rate = float.Parse(fields[35]);
                        Record.dst_host_srv_diff_host_rate = float.Parse(fields[36]);
                        Record.dst_host_serror_rate = float.Parse(fields[37]);
                        Record.dst_host_srv_serror_rate = float.Parse(fields[38]);
                        Record.dst_host_rerror_rate = float.Parse(fields[39]);
                        Record.dst_host_srv_rerror_rate = float.Parse(fields[40]);

                        Record.label = fields[41];
                        Record.isKnownAttackType = true;

                        switch (Record.label)
                        {
                            case "buffer_overflow":
                                Record.labelInt = 1;
                                break;
                            case "ftp_write":
                                Record.labelInt = 2;
                                break;
                            case "guess_passwd":
                                Record.labelInt = 3;
                                break;
                            case "imap":
                                Record.labelInt = 4;
                                break;
                            case "ipsweep":
                                Record.labelInt = 5;
                                break;
                            case "land":
                                Record.labelInt = 6;
                                break;
                            case "loadmodule":
                                Record.labelInt = 7;
                                break;
                            case "multihop":
                                Record.labelInt = 8;
                                break;
                            case "neptune":
                                Record.labelInt = 9;
                                break;
                            case "nmap":
                                Record.labelInt = 10;
                                break;
                            case "normal":
                                Record.labelInt = 11;
                                break;
                            case "perl":
                                Record.labelInt = 12;
                                break;
                            case "phf":
                                Record.labelInt = 13;
                                break;
                            case "pod":
                                Record.labelInt = 14;
                                break;
                            case "portsweep":
                                Record.labelInt = 15;
                                break;
                            case "rootkit":
                                Record.labelInt = 16;
                                break;
                            case "satan":
                                Record.labelInt = 17;
                                break;
                            case "smurf":
                                Record.labelInt = 18;
                                break;
                            case "spy":
                                Record.labelInt = 19;
                                break;
                            case "teardrop":
                                Record.labelInt = 20;
                                break;
                            case "warezclient":
                                Record.labelInt = 21;
                                break;
                            case "warezmaster":
                                Record.labelInt = 22;
                                break;
                            case "back":
                                Record.labelInt = 23;
                                break;
                            default:
                                //Console.WriteLine("ERROR: Invalid Label Type");
                                Record.isKnownAttackType = false;
                                break;
                        }

                        if(Record.label == "normal")
                        {
                            Record.labelInt = 0;
                        }
                        else
                        {
                            Record.labelInt = 1;
                        }
                    }
                    else
                    {


                    }

                    sourceIndex++;

                    // Getting here means we have a good record. Break the loop.
                    foundRecord = true;
                    break;
                }
                catch (Exception e)
                {
                    Console.WriteLine("ERROR: NextRecord failed for line: " + sourceIndex + " with exception: " + e.Message + " Stack: " + e.StackTrace);
                    sourceIndex++;
                }
            }
            return foundRecord;
        }

        public void Reset()
        {
            fileReader = new System.IO.StreamReader(sourceReport);
            // Burn column headers
            string line = fileReader.ReadLine();
            string[] fields = line.Split(',');
            if (fields.Count() != COLUMN_COUNT && fields.Count() != COLUMN_COUNT_TEST)
            {
                throw new Exception(string.Format("sourceReportParser column count [{0}] != expected COLUMN_COUNT [{1}]", fields.Count(), COLUMN_COUNT));
            }
            sourceIndex = 0;
        }




Extracting important snip-its with C# and Log Likelyhood


How does text summary software pick out the most important ideas to present?

One solution is to use Log Likelyhood to generate a summary of the sentences that contain the important terms and cover the most different topics.


What Log Likelyhood is and why it works can be read here: https://en.wikipedia.org/wiki/Likelihood_function


This article will focus on implementing it in C#. Here's the code I wrote as a translation from the Mahout version. I tried to make it as readable as possible rather than optimizing for performance.

// Log Likelyhood code roughly translated from here:
// http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-math/0.3/org/apache/mahout/math/stats/LogLikelihood.java#LogLikelihood.logLikelihoodRatio%28int%2Cint%2Cint%2Cint%29
static private double ShannonEntropy(List<Int64> elements)
{
    double sum = 0;
    foreach (Int64 element in elements)
    {
        sum += element;
    }
    double result = 0.0;
    foreach (Int64 element in elements)
    {
        if(element < 0)
        {
            throw new Exception("Should not have negative count for entropy computation (" + element + ")");
        }
        int zeroFlag = (element == 0 ? 1 : 0);
        result += element * Math.Log((element + zeroFlag) / sum);
    }
    return result;
}
/*
    Calculate the Raw Log-likelihood ratio for two events, call them A and B. Then we have:
 	    Event A	Everything but A
        Event B	A and B together (k_11)	B, but not A (k_12)
        Everything but B	A without B (k_21)	Neither A nor B (k_22)
    Parameters:
    k11 The number of times the two events occurred together
    k12 The number of times the second event occurred WITHOUT the first event
    k21 The number of times the first event occurred WITHOUT the second event
    k22 The number of times something else occurred (i.e. was neither of these events
*/
static public double LogLikelihoodRatio(Int64 k11, Int64 k12, Int64 k21, Int64 k22)
{
    double rowEntropy = ShannonEntropy(new List<Int64>() { k11, k12 }) + ShannonEntropy(new List<Int64>() { k21, k22 });
    double columnEntropy = ShannonEntropy(new List<Int64>() { k11, k21 }) + ShannonEntropy(new List<Int64>() { k12, k22 });
    double matrixEntropy = ShannonEntropy(new List<Int64>() { k11, k12, k21, k22 });
    return 2 * (matrixEntropy - rowEntropy - columnEntropy);
}

Now, we have a simple LogLikelihoodRatio function we can call with 4 parameters and get the score result.


Let's say we want to pick out the most important sentences from a particular Wikipedia article in order to summarize it. (See this article for loading Wikipedia in to ElasticSearch: http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch)

Follow these steps:

  1. Pick a Wikipedia article.
  2. Get a Term Frequency dictionary for the whole article.
  3. Parse the article in to sentences.
  4. For each token in each sentence, calculate the Log Likelyhood score with the above LogLikelihoodRatio function.
  5. If the result of LogLikelihoodRatio is less than -10, give that sentence +1 to a weight value.
  6. At the end of each sentence, you have a +X weight value. That can be normalized by the number of words in the sentence.
  7. After you've obtained the weight score from #6 for all of the sentences in an article, you can sort them and pick the most important ones.
For extra credit, you'll want to avoid redundant important sentences. In order to do that, you'll need to score the candidate sentences against the summary's output as you build it.


Here's some code with comments about populating the input values passed to the LogLikelihoodRatio function. Be sure to check the result score is less than -10 before adding a +1 weight.

// http://www.cs.columbia.edu/~gmw/candidacy/LinHovy00.pdf - Section 4.1
Int64 k11 = // frequency of current term in this article
Int64 k12 = // frequency of current term in all of Wikipedia - k11
Int64 k21 = // total count of all terms in this article - k11
Int64 k22 = // total count of all terms in Wikipedia - k12
double termWeight = LogLikelihoodRatio(k11, k12, k21, k22);

if(termWeight < -10)
{
    weightSum++;
}

Obviously, in the above you don't want to be calculating Term Frequency across Wikipedia on-the-fly. K11 and K21 will get calculated as you process an article, but K12 and K22 should be calculated in advance and cached in a lookup dictionary. 


I use LevelDb as my Term Frequency look up dictionary. You can read about using that here: http://blog.novelessay.com/post/fast-persistent-key-value-pairs-in-c-with-leveldb


In order to build your Term Frequency look up dictionary chace, you could process each documents and create your own term frequency output, or use the ElasticSearch plugin for _termList here: https://github.com/jprante/elasticsearch-index-termlist



Text Extraction using C# .Net and Apache Tika


You want to using C# to extract text from documents and web pages. You want it to have high quality and be free. Try the .Net wrapper to the Apache Tika library!


Let's build a sample app and show the use case. First step, start a C# console application with Visual Studio. Use the Nuget package manager and install the TikaOnDotNet.TextExtractor packages.



Then, try this sample code. It shows an example of text extraction examples for a file, Url, and byte array sources.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using TikaOnDotNet.TextExtraction;

namespace TikaTest
{
    class Program
    {
        static void Main(string[] args)
        {

            TextExtractor textExtractor = new TextExtractor();

            // Fun Utf8 strings found here: http://www.columbia.edu/~fdc/utf8/
            string utf8InputString = @"It's a small village in eastern Lower Saxony. The ""oe"" in this case turns out to be the Lower Saxon ""lengthening e""(Dehnungs-e), which makes the previous vowel long (used in a number of Lower Saxon place names such as Soest and Itzehoe), not the ""e"" that indicates umlaut of the preceding vowel. Many thanks to the Óechtringen-Namenschreibungsuntersuchungskomitee (Alex Bochannek, Manfred Erren, Asmus Freytag, Christoph Päper, plus Werner Lemberg who serves as Óechtringen-Namenschreibungsuntersuchungskomiteerechtschreibungsprüfer) for their relentless pursuit of the facts in this case. Conclusion: the accent almost certainly does not belong on this (or any other native German) word, but neither can it be dismissed as dirt on the page. To add to the mystery, it has been reported that other copies of the same edition of the PLZB do not show the accent! UPDATE (March 2006): David Krings was intrigued enough by this report to contact the mayor of Ebstorf, of which Oechtringen is a borough, who responded:";
            // Convert string to byte array
            byte[] byteArrayInput = Encoding.UTF8.GetBytes(utf8InputString);
            // Text Extraction Example for Byte Array
            TextExtractionResult result = textExtractor.Extract(byteArrayInput);
            Console.WriteLine(result.Text);

            // Text Extraction Example for Uri:
            result = textExtractor.Extract(new Uri("http://blog.novelessay.com"));
            Console.WriteLine(result.Text);

            // Text Extraction Example for File
            result = textExtractor.Extract(@"c:\myPdf.pdf");
            Console.WriteLine(result.Text);

            // Note that result also has metadata collection and content type attributes
            //result.Metadata
            //result.ContentType
        }
    }
}

Notice that the TextExtractionResult has a Metadata collection and also a content type attribute. Here's an example of the metadata provided along with the extracted text. It contains many things including author, dates, keywords, title, and description.


      

I've been very pleased with Tika's quality and ability to handle many different file types. I hope you try it out and enjoy it too.


Fast Persistent Key Value Pairs in C# with LevelDb



Let's say we want to crawl the internet, but we don't want to request any given URL more than once. We need to have a collection of URL keys that we can look up. It would be nice if we could have key-value pairs, so that we can give URL keys a value in case we change our minds and want to allow URL request updates every X days. We want it to handle billions of records and be really fast (and free). This article will show how to accomplish that using LevelDb and its C# wrapper.


First, start a Visual Studio C# project and download the LevelDb.Net nuget package. There are a few different one, but this is my favorite. 


You can also find this LevelDb.Net at this Github location:

https://github.com/AntShares/leveldb


First, I'm going to show how to use LevelDb via C#. Later in this article, code shows how to insert and select a large number of records for speed testing.


Let's create a LevelDb:

            Options levelDbOptions = new Options();
            levelDbOptions.CreateIfMissing = true;
            LevelDB.DB levelDb = LevelDB.DB.Open("myLevelDb.dat", levelDbOptions);

Next, we'll insert some keys:

            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value1");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value2");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value3");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key2", "Value2");

Then, we'll select some keys:

            LevelDB.Slice outputValue;
            if (levelDb.TryGet(LevelDB.ReadOptions.Default, "Key2", out outputValue))
            {
                Console.WriteLine("Key2: Value = " + outputValue.ToString());// Expect: Value2
            }
            if (levelDb.TryGet(LevelDB.ReadOptions.Default, "Key1", out outputValue))
            {
                Console.WriteLine("Key1: Value = " + outputValue.ToString()); // Expect: Value3
            }
            if (!levelDb.TryGet(LevelDB.ReadOptions.Default, "KeyXYZ", out outputValue))
            {
                Console.WriteLine("KeyXYZ: NOT FOUND.");
            }

LevelDb supports many different types of keys and values (strings, int, float, byte[], etc...).

  1. Open instance handle.
  2. Insert = Put
  3. Select = TryGet

That's it! 

But, how fast is it?

Let's build a collection of MD5 hash keys and insert them:

            List<string> seedHashes = new List<string>();
            for (int idx = 0; idx < 500000; idx++)
            {
                byte[] encodedPassword = new UTF8Encoding().GetBytes(idx.ToString());
                byte[] hash = ((HashAlgorithm)CryptoConfig.CreateFromName("MD5")).ComputeHash(encodedPassword);
                string encoded = BitConverter.ToString(hash).Replace("-", string.Empty).ToLower();
                seedHashes.Add(encoded);
            }

            // Start Insert Speed Tests
            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            foreach(var key in seedHashes)
            {
                levelDb.Put(LevelDB.WriteOptions.Default, key, "1");
            }
            stopWatch.Stop();
            Console.WriteLine("LevelDb Inserts took (ms) " + stopWatch.ElapsedMilliseconds);


Next, let's select each of the keys we just inserted several times:

            // Start Lookup Speed Tests
            stopWatch.Start();
            for (int loopIndex = 0; loopIndex < 10; loopIndex++)
            {
                for(int seedIndex = 0; seedIndex < seedHashes.Count; seedIndex++)
                {
                    if (!levelDb.TryGet(LevelDB.ReadOptions.Default, seedHashes[seedIndex], out outputValue))
                    {
                        Console.WriteLine("ERROR: Key Not Found: " + seedHashes[seedIndex]);
                    }
                }
            }
            stopWatch.Stop();
            Console.WriteLine("LevelDb Lookups took (ms) " + stopWatch.ElapsedMilliseconds);

On my junky 4 year old desktop, 500,000 inserts took just under 60 seconds and 5 Million selects took just over 2 minutes. Here's the program output:


The complete code sample is below:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using LevelDB;
using System.Security.Cryptography;
using System.Diagnostics;

namespace LevelDbExample
{
    class Program
    {
        static void Main(string[] args)
        {

            Options levelDbOptions = new Options();
            levelDbOptions.CreateIfMissing = true;
            LevelDB.DB levelDb = LevelDB.DB.Open("myLevelDb.dat", levelDbOptions);

            // Insert some records
            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value1");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value2");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key1", "Value3");
            levelDb.Put(LevelDB.WriteOptions.Default, "Key2", "Value2");

            // Select some records
            LevelDB.Slice outputValue;
            if (levelDb.TryGet(LevelDB.ReadOptions.Default, "Key2", out outputValue))
            {
                Console.WriteLine("Key2: Value = " + outputValue.ToString());// Expect: Value2
            }
            if (levelDb.TryGet(LevelDB.ReadOptions.Default, "Key1", out outputValue))
            {
                Console.WriteLine("Key1: Value = " + outputValue.ToString()); // Expect: Value3
            }
            if (!levelDb.TryGet(LevelDB.ReadOptions.Default, "KeyXYZ", out outputValue))
            {
                Console.WriteLine("KeyXYZ: NOT FOUND.");
            }

            // Build a collection of hash keys
            List<string> seedHashes = new List<string>();
            for (int idx = 0; idx < 500000; idx++)
            {
                byte[] encodedPassword = new UTF8Encoding().GetBytes(idx.ToString());
                byte[] hash = ((HashAlgorithm)CryptoConfig.CreateFromName("MD5")).ComputeHash(encodedPassword);
                string encoded = BitConverter.ToString(hash).Replace("-", string.Empty).ToLower();
                seedHashes.Add(encoded);
            }

            // Start Insert Speed Tests
            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            foreach(var key in seedHashes)
            {
                levelDb.Put(LevelDB.WriteOptions.Default, key, "1");
            }
            stopWatch.Stop();
            Console.WriteLine("LevelDb Inserts took (ms) " + stopWatch.ElapsedMilliseconds);

            // Start Lookup Speed Tests
            stopWatch.Start();
            for (int loopIndex = 0; loopIndex < 10; loopIndex++)
            {
                for(int seedIndex = 0; seedIndex < seedHashes.Count; seedIndex++)
                {
                    if (!levelDb.TryGet(LevelDB.ReadOptions.Default, seedHashes[seedIndex], out outputValue))
                    {
                        Console.WriteLine("ERROR: Key Not Found: " + seedHashes[seedIndex]);
                    }
                }
            }
            stopWatch.Stop();
            Console.WriteLine("LevelDb Lookups took (ms) " + stopWatch.ElapsedMilliseconds);

            return;
        }
    }
}

Using C# Nest with ElasticSearch Carrot2 Clustering Plugin


ElasticSearch Carrot2 Clustering Plugin:

https://github.com/carrot2/elasticsearch-carrot2

This article will walk you through setting up ElasticSearch and Carrot2 clutering, so that you can implement something awesome like clustering topics on the publicly available Hillary Clinton email data set. Then, use foam tree to visually display it like this:



On to the example!

Let's say we want to get clusters for our Wikipedia index that we have loaded in to ElasticSearch, and we want to be able to also get clusters based on queries.


First, we'll want to build a SearchDescriptor based on query input. Let's just use a simple query string example for now. Here's example code to build a SearchDescriptor (which uses a special ignore "redirect" string that is custom for our Wikipedia index):

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
	string ignoreA = "#redirect";
	string ignoreB = "redirect";

	var searchDescriptor = new SearchDescriptor<Page>()
			.Query(q =>
				q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
				&& !q.Term(p => p.text, ignoreA)
				&& !q.Term(p => p.text, ignoreB)
				&& !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
			);
	return searchDescriptor;
}

Next, we need to build our ElasticSearch query with connection strings along with size and from values. Then, we use the Nest client serializer to convert our request to JSON:

public static EsCarrotClusters GetSearchCarrotClusters(string esUrl, string queryString, int from, int size, bool queryAnd, string ignoreQuery)
{
	ConnectionSettings Settings = new ConnectionSettings(new Uri(esUrl));
	ElasticClient Client = new ElasticClient(Settings);
	var searchDescriptor = GetDocumentSearchDescriptorFromSearchParameters(queryString, queryAnd, ignoreQuery);
	searchDescriptor.Fields(f => f.Add("text"));
	searchDescriptor.Size(size);
	searchDescriptor.From(from);

	var jsonQuery = Encoding.UTF8.GetString(Client.Serializer.Serialize(searchDescriptor));
	jsonQuery = jsonQuery.Replace("\"query\": {},", "");
	jsonQuery = "{ \"search_request\" : " + jsonQuery;
	jsonQuery += ", \"query_hint\" : \"";
	jsonQuery += queryString == null ? "" : queryString;
	jsonQuery += "\",\"field_mapping\":{\"title\":[\"fields.text\"],\"content\":[]}}";

	string esClusterQueryRequestJson = jsonQuery;

	EsCarrotClusters clusters = null;

	string json = GetEsJsonFromAPI(esUrl, "_search_with_clusters", "", esClusterQueryRequestJson);
	if (json.Length > 0)
	{
		try
		{
			clusters = JsonConvert.DeserializeObject<EsCarrotClusters>(json);


Example calling code that uses "cats and dogs" as a query string input:

EsCarrotClusters esCarrotClusters = EsHttpWebRequestApi.GetSearchCarrotClusters("http://localhost:9200/mywiki", "cats and dogs", 0, 10, true, "");
Special Note:
The GetEsJsonFromAPI function simlpy does a HttpWebRequest POST to the ElasticSearch Uri with the JSON content written to the stream like this:
            using (var streamWriter = new StreamWriter(request.GetRequestStream()))
            {
                streamWriter.Write(esRequestJson);
                streamWriter.Flush();
                streamWriter.Close();
            }


Lastly, you'll want to see my EsCarrotClusters classes, so you can deserialize the HttpWebRequest's response back to a C# friendly object. Enjoy:

    public class Shards
    {
        public int total { get; set; }
        public int successful { get; set; }
        public int failed { get; set; }
    }

    public class Fields
    {
        public List<string> filename { get; set; }
    }

    public class Hit
    {
        public string _index { get; set; }
        public string _type { get; set; }
        public string _id { get; set; }
        public double _score { get; set; }
        public Fields fields { get; set; }
    }

    public class Hits
    {
        public int total { get; set; }
        public double max_score { get; set; }
        public List<Hit> hits { get; set; }
    }

    public class Cluster
    {
        public int id { get; set; }
        public double score { get; set; }
        public string label { get; set; }
        public List<string> phrases { get; set; }
        public List<string> documents { get; set; }
        public bool? other_topics { get; set; }
    }

    public class Info
    {
        public string algorithm { get; set; }
        [JsonProperty("search-millis")]
        public string searchmillis { get; set; }
        [JsonProperty("clustering-millis")]
        public string clusteringmillis { get; set; }
        [JsonProperty("total-millis")]
        public string totalmillis { get; set; }
        [JsonProperty("include-hits")]
        public string includehits { get; set; }
        [JsonProperty("max-hits")]
        public string maxhits { get; set; }
    }

    public class EsCarrotClusters
    {
        public int took { get; set; }
        public bool timed_out { get; set; }
        public Shards _shards { get; set; }
        public Hits hits { get; set; }
        public List<Cluster> clusters { get; set; }
        public Info info { get; set; }
    }


After I run this against my Wikipedia ElasticSearch index for "cats and dogs", I get clusters like these:

  • Polynueropath in Dogs and Cats
  • Album Cats
  • Canine
  • Domestic Cats
  • Missing Disease

Notice that you also are given a Score property, which you can use to weight topics or visually show them differently to the user.



Querying Wikipedia in ElasticSearch with C# Nest client



This article assumes that you've already loaded the Wikipedia articles in to your local ElasticSearch as described in this previous article. Please follow the instructions in this article on how to load your ElasticSearch with the entire content of Wikipedia:

http://blog.novelessay.com/post/loading-wikipedia-in-to-elasticsearch


Start a Visual Studio C# console application project, and install the ElasticSearch Nest Nuget package. 


In your code, create a Nest ElasticClient instance that is configured for your ElasticSearch instance. We are using localhost:9200 and the index named "mywiki" as the location of our Wikipedia data. 

var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(
    node,
    defaultIndex: "mywiki"
).SetTimeout(int.MaxValue);
ElasticClient esClient = new ElasticClient(settings);


The Wikipedia index schema has a particular field format. We'll need a Page class like this for Nest to map fields in to:

public class Page
{
    public List<string> category { get; set; }
    public bool special { get; set; }
    public string title { get; set; }
    public bool stub { get; set; }
    public bool disambiguation { get; set; }
    public List<string> link { get; set; }
    public bool redirect { get; set; }
    public string text { get; set; }
}


Now, we can start querying our Wikipedia ElasticSearch index using our Nest client. Here's a simple example that pulls down the first 10 Wikipedia articles:

var result = esClient.Search<Page>(s => s
    .From(0)
    .Size(10)
    .MatchAll()
    );

You can check the response for errors and loop through the Page hits like this:

if (result.IsValid)
{
    foreach (var page in result.Hits)
    {
        // page.Source.text contains the wikipedia article text

After this, you can loop through all Wikipedia documents by changing the arguments passed to From and Size in the ElasticSearch query call.


Here's a query example that emulates a Google-like search via the use of a QueryString. Notice the use for Operator.And. I suggest you change it to Operator.Or and observe the difference effect on your results.

var result = esClient.Search<Page>(s => s
    .Take(10)
    .Query(q => q
        .QueryString(p => p.Query("cats dogs birds").DefaultOperator(Operator.And))
    )
);


If you're ready to start getting fancy, you can write a function that builds a Nest SearchDescriptor based on your query criteria. Then use the SearchDescriptor in your query to ElasticSearch. I wanted to search Wikipedia without getting redirect link results, so I set some ignore options in the example below that exclude #redirect terms for my search descriptors.

public static SearchDescriptor<Page> GetDocumentSearchDescriptorFromSearchParameters(string queryString, bool queryAnd, string ignoreQuery)
{
    string ignoreA = "#redirect";
    string ignoreB = "redirect";

    var searchDescriptor = new SearchDescriptor<Page>()
            .Query(q =>
                q.QueryString(p => p.Query(queryString).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
                && !q.Term(p => p.text, ignoreA)
                && !q.Term(p => p.text, ignoreB)
                && !q.QueryString(p => p.Query(ignoreQuery).DefaultOperator(queryAnd ? Operator.And : Operator.Or))
            );
    return searchDescriptor;
}

If this SearchDescriptor example is a little confusing, stay tuned for the ElasticSearch Wikipedia clustering future article that I intend to write. In the mean time, you should be set up to query your Wikipedia ElasticSearch index with the C# Nest client.



Loading Wikipedia in to ElasticSearch



This article gives instructions for loading Wikipedia articles in to ElasticSearch. I did this on Windows, but all of these steps should work on any java friendly platform.
  1. Download ElasticSearch
  2. Download stream2es
  3. Download Wikipedia articles
  4. Start ElasticSearch
  5. Run stream2es


Download ElasticSearch

Go to Elastic.co and download ElasticSearch here: https://www.elastic.co/downloads/elasticsearch
Download and unzip the elasticsearch download in to a folder of your choice.

Download stream2es

Go here and download the stream2es java jar file: http://download.elasticsearch.org/stream2es/stream2es
Optional: See stream2es on github for options: https://github.com/elastic/stream2es

Download Wikipedia articles

Go here and download the wikipedia article archive: https://dumps.wikimedia.org/enwiki/latest/
There are many options, but the specific one I downloaded was this: enwiki-latest-pages-articles.xml.bz2
(It's over 12GB, so be sure you have plenty of disk space.)

Start ElasticSearch

I'm on Windows, so I opened a command window and ran this: 
c:\elasticsearch-1.5.2\bin\elasticsearch.bat
That starts up your local ElasticSearch instance at localhost:9200

Run stream2es

  • Move the stream2es file to your ElasticSearch bin folder. I put stream2es here c:\elasticsearch-1.5.2\bin\
  • Move the Wikipedia archive (enwiki-latest-pages-articles.xml.bz2) to your ElasticSearch bin folder too.
  • Run the stream2es java file: 
C:\elasticsearch-1.5.2\bin>java -jar stream2es wiki --target http://localhost:9200/mywiki --log debug --source /enwiki-latest-pages-articles.xml.bz2

Notes:
  1. You can change the "mywiki" to whatever you want your specific ElasticSearch index name to be.
  2. I had some trouble getting stream2es to find my wikipedia archive path on Windows, but the / in front of the file name worked.



I ran this all local on my Windows desktop, and it took 6-8 hours. It appears to be locked up near the end, but it did eventually exit. 

Now, you should have over 16 million Wikipedia articles loaded in to your local ElasticSearch index. Enjoy.

I plan on doing future articles on using this Wikipedia data for machine learning, natural language processing, and topic clustering.

Vowpal Wabbit C# solution to Kaggle competition: Grupo Bimbo Inventory Demand

Kaggle Competition: Grupo Bimbo Inventory Demand

grupo-bimbo-inventory-demand

Vowpal Wabbit C# solution summary:

  1. Download data sets from the Kaggle competition site.
  2. Create a new Visual Studio 2015 solution and project.
  3. Install Vowpal Wabbit Nuget package to your Visual Studio project.
  4. Create a DataSource class to read the data set files.
  5. Create a VWRecord class with annotated properties.
  6. Create a VWWrapper class to manage the Vowpal Wabbit engine instance, training, and prediction API calls.
  7. Create a test program to manage training, prediction, evaluation jobs.
  8. Run test program and tinker until satisfied.


Download data sets from the Kaggle competition site.

Specifically here: grupo-bimbo-inventory-demand/data

You don't necessarily need to download the data to understand this article, but some of the data fields may be referenced later in this article. Review the field detail descriptions to give yourself some context.


Create a new Visual Studio 2015 solution and project.

I created a C# console application in Visual Studio to run my test projects. Nothing fancy about that.


Install Vowpal Wabbit Nuget package to your Visual Studio project.

Use the Nuget package manager in Visual Studio. Search for vowpal wabbit. Install the latest version.


Create a DataSource class to read the data set files.

This class opens the source (or test file) and reads each line. Each line is split in to fields and mapped to properties by its column position. The number of columns viewed tells this class how to map fields. The training data has 11 columns. Test test data has 7 columns. Correct - this is not fancy, but it works.

    public class DataSource
    {
        // Columns
        private static int COLUMN_COUNT = 11;
        private static int COLUMN_COUNT_TEST = 7;

        // Current Record Attributes
        public DataRecord Record = new DataRecord();


        private string sourceReport;

        private System.IO.StreamReader fileReader;
        private int sourceIndex;

        public DataSource(string sourceReport)
        {
            this.sourceReport = sourceReport;
            Reset();
        }

        public bool NextRecord()
        {
            bool foundRecord = false;
            while (!fileReader.EndOfStream)
            {
                try
                {
                    //Processing row
                    string line = fileReader.ReadLine();
                    string[] fields = line.Split(',');

                    // Expect COLUMN_COUNT columns
                    if (fields.Count() != COLUMN_COUNT && fields.Count() != COLUMN_COUNT_TEST)
                    {
                        throw new Exception(string.Format("sourceReportParser column count [{0}] != expected COLUMN_COUNT [{1}]", fields.Count(), COLUMN_COUNT));
                    }

                    Record = new DataRecord();

                    if (fields.Count() == COLUMN_COUNT)
                    {
                        // Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
                        Record.WeekId = fields[0];
                        Record.SalesDepotID = int.Parse(fields[1]);
                        Record.SalesChannelID = int.Parse(fields[2]);
                        Record.RouteID = int.Parse(fields[3]);
                        Record.Cliente_ID = int.Parse(fields[4]);
                        Record.Producto_ID = int.Parse(fields[5]);
                        Record.Venta_uni_hoy = fields[6];
                        Record.Venta_hoy = fields[7];
                        Record.Dev_uni_proxima = fields[8];
                        Record.Dev_proxima = fields[9];
                        Record.Demanda_uni_equil = fields[10];
                    }
                    else
                    {
                        //id,Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID
                        Record.Id = fields[0];
                        Record.WeekId = fields[1];
                        Record.SalesDepotID = int.Parse(fields[2]);
                        Record.SalesChannelID = int.Parse(fields[3]);
                        Record.RouteID = int.Parse(fields[4]);
                        Record.Cliente_ID = int.Parse(fields[5]);
                        Record.Producto_ID = int.Parse(fields[6]);
                    }

                    sourceIndex++;

                    // Getting here means we have a good record. Break the loop.
                    foundRecord = true;
                    break;
                }
                catch (Exception e)
                {
                    Console.WriteLine("ERROR: NextRecord failed for line: " + sourceIndex + " with exception: " + e.Message + " Stack: " + e.StackTrace);
                    sourceIndex++;
                }
            }
            return foundRecord;
        }

        public void Reset()
        {
            fileReader = new System.IO.StreamReader(sourceReport);
            // Burn column headers
            string line = fileReader.ReadLine();
            string[] fields = line.Split(',');
            if (fields.Count() != COLUMN_COUNT && fields.Count() != COLUMN_COUNT_TEST)
            {
                throw new Exception(string.Format("sourceReportParser column count [{0}] != expected COLUMN_COUNT [{1}]", fields.Count(), COLUMN_COUNT));
            }
            sourceIndex = 0;
        }

    }


Create a VWRecord class with annotated properties.

This class has a property for each of the fields in the test data except the Week ID. I observed the Week ID was best for splitting the training data up rather than using it as a feature for training or prediction. Notice the annotations on each property has a separate FeatureGroup. I tinkered with putting the properties in the same feature groups, but that yielded worse results. Notice the Enumerize=true annotations. The properties are integers, but they are ID values hence the Enumerize = true setup.

    public class VWRecord
    {        
        [Feature(FeatureGroup = 'h', Enumerize = true)]
        public int SalesDepotID { get; set; }

        [Feature(FeatureGroup = 'i', Enumerize = true)]
        public int SalesChannelID { get; set; }

        [Feature(FeatureGroup = 'j', Enumerize = true)]
        public int RouteID { get; set; }


        [Feature(FeatureGroup = 'k', Enumerize = true)]
        public int Cliente_ID { get; set; }

        [Feature(FeatureGroup = 'l', Enumerize = true)]
        public int Producto_ID { get; set; }
    }

Create a VWWrapper class to manage the Vowpal Wabbit engine instance, training, and prediction API calls.

The VW instance is created with a call to Init. Notice several quadratic and cubic feature spaces specified. I originally tried using --loss_function=squared but found much better results with --loss_function=quantile. Notice the commented out lines in the Train function that allow you to view the serialized VW intput strings. I found that useful in debugging and sanity checking.

    public class VWWrapper
    {
        VW.VowpalWabbit<VWRecord> vw = null;

        public void Init(bool train = true)
        {
            vw = new VW.VowpalWabbit<VWRecord>(new VowpalWabbitSettings {
                EnableStringExampleGeneration = true,
                Verbose = true,
                Arguments = string.Join(" "
                , "-f vw.model"
                //, "--loss_function=squared"
                , "--loss_function=quantile"
                , "--progress 100000"
                //, "--passes 2"
                //, "--cache_file vw.cache"
                , "-b 27"

                , "-q lk"
                , "-q lj"
                , "-q li"
                , "-q lh"

                , "--cubic hkl"
                , "--cubic ikl"
                , "--cubic jkl"

                , "--cubic hjl"
                //, "--cubic hil"
                , "--cubic hij"
                )
            });
            
        }

        public void Train(VWRecord record, float label)
        {
            // Comment this in if you want to see the VW serialized input records:
            //var str = vw.Serializer.Create(vw.Native).SerializeToString(record, new SimpleLabel() { Label = label });
            //Console.WriteLine(str);
            vw.Learn(record, new SimpleLabel() { Label = label });
        }

        public void SaveModel()
        {
            vw.Native.SaveModel();
        }

        public float Predict(VWRecord record)
        {
            return vw.Predict(record, VowpalWabbitPredictionType.Scalar);
        }

Create a test program to manage training, prediction, evaluation jobs.

If you created a Console Application (like I did), then you can set your Main up like this. There are two primary modes: train/evaluate and train/predict. The train & evaluate will let you measure your model against labeled data not used during the training. Vowpal Wabbit does a good job of calculating average loss on its own.


Notice the "Median" collections. When VW predicts a 0 label, we're going to ignore that and use some Median values instead.

        static string trainingSource = @"C:\kaggle\GrupoBimboInventoryDemand\train.csv";
        static string testSource = @"C:\kaggle\GrupoBimboInventoryDemand\test.csv";
        static string outputPredictions = @"C:\kaggle\GrupoBimboInventoryDemand\myPredictions.csv";

        static VWWrapper vw = new VWWrapper();

        static int trainingRecordCount = 0;
        static int testRecordCount = 0;
        static int evaluateRecordCount = 0;

        static Dictionary<int, float> productDemand = new Dictionary<int, float>();
        static Dictionary<int, int> productRecords = new Dictionary<int, int>();
        static Dictionary<int, List<float>> productDemandInstances = new Dictionary<int, List<float>>();
        static List<float> demandInstances = new List<float>();
        static Int64 totalDemand = 0;

        static Dictionary<int, float> productMedians = new Dictionary<int, float>();
        static float allProductsMedian = 0;

        static void Main(string[] args)
        {
            Stopwatch swTotal = new Stopwatch();
            swTotal.Start();

            vw.Init();

            bool testOnly = true;
            if(testOnly)
            {
                DoTraining(true);
                DoEvaluate();
            }
            else
            {
                DoTraining();
                DoPredictions();
            }

            // Save model here if you want to load and reuse it.
            // Current test app doesn't ever load/reuse it.
            //vw.SaveModel();

            swTotal.Stop();
            Console.WriteLine("Done. ElapsedTime: " + swTotal.Elapsed);
        }
        static float Median(float[] xs)
        {
            var ys = xs.OrderBy(x => x).ToList();
            double mid = (ys.Count - 1) / 2.0;
            return (ys[(int)(mid)] + ys[(int)(mid + 0.5)]) / 2;
        }


The DoTraining function loops through the DataSource records and maps fields in to a VWRecord. The VWRecord is sent to the Train function along with the known label. Each record's label is saved in to product demand and instance collections, so we can calculate Median demand values later. If the skipEightNine is set to true, then the training data for weeks 8 & 9 will not be used for training. Week 8 & 9 data is later used for model Evaluation.

        static void DoTraining(bool skipEightNine = false)
        {
            DataSource source = new DataSource(trainingSource);
            while (source.NextRecord())
            {
                if (skipEightNine && (source.Record.WeekId == "8" || source.Record.WeekId == "9")) continue;

                VWRecord vwRecord = new VWRecord();
                vwRecord.Cliente_ID = source.Record.Cliente_ID;
                vwRecord.Producto_ID = source.Record.Producto_ID;
                vwRecord.RouteID = source.Record.RouteID;
                vwRecord.SalesChannelID = source.Record.SalesChannelID;
                vwRecord.SalesDepotID = source.Record.SalesDepotID;

                float label = float.Parse(source.Record.Demanda_uni_equil);

                demandInstances.Add(label);
                totalDemand += int.Parse(label.ToString());
                if (productDemand.ContainsKey(source.Record.Producto_ID))
                {
                    productDemand[source.Record.Producto_ID] += label;
                    productRecords[source.Record.Producto_ID]++;
                    productDemandInstances[source.Record.Producto_ID].Add(label);
                }
                else
                {
                    productDemand.Add(source.Record.Producto_ID, label);
                    productRecords.Add(source.Record.Producto_ID, 1);
                    productDemandInstances.Add(source.Record.Producto_ID, new List<float>() { label });
                }

                trainingRecordCount++;

                vw.Train(vwRecord, label);
            }

            // Calculate medians
            foreach(var product in productDemandInstances.Keys)
            {
                productMedians.Add(product, Median(productDemandInstances[product].ToArray()));
            }
            allProductsMedian = Median(demandInstances.ToArray());
            // end median calculations

            Console.WriteLine("Train Complete. trainingRecordCount = " + trainingRecordCount);
        }


The DoEvaluate function loops through the DataSource records and only processes the Week 8 & 9 records. Each of those records is mapped to a VWRecord class. Then, a Vowpal Wabbit prediction is made on the data. If the prediction is 0, then we use the Medians values as predictions instead.


The difference between the predicted value and the actual value is calculated and reported as the RMSLE value. RMSLE is the Root Mean Squared Logarithmic Error which is used as the competition evaluation measurement:

https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError

        static void DoEvaluate()
        {
            double logDiffSquaredSum = 0;

            DataSource sourceEval = new DataSource(trainingSource);
            while (sourceEval.NextRecord())
            {
                if (sourceEval.Record.WeekId != "8" && sourceEval.Record.WeekId != "9") continue;

                VWRecord vwRecord = new VWRecord();

                vwRecord.Cliente_ID = sourceEval.Record.Cliente_ID;
                vwRecord.Producto_ID = sourceEval.Record.Producto_ID;
                vwRecord.RouteID = sourceEval.Record.RouteID;
                vwRecord.SalesChannelID = sourceEval.Record.SalesChannelID;
                vwRecord.SalesDepotID = sourceEval.Record.SalesDepotID;

                float actualLabel = float.Parse(sourceEval.Record.Demanda_uni_equil);

                float prediction = vw.Predict(vwRecord);

                if (prediction == 0)
                {
                    if (productDemand.ContainsKey(sourceEval.Record.Producto_ID))
                    {
                        prediction = productMedians[sourceEval.Record.Producto_ID];
                    }
                    else
                    {
                        prediction = allProductsMedian;
                    }
                }

                double logDiff = Math.Log(prediction + 1) - Math.Log(actualLabel + 1);
                logDiffSquaredSum += Math.Pow(logDiff, 2);


                evaluateRecordCount++;
            }
             
            Console.WriteLine("Evaluate Complete. evaluateRecordCount = " + evaluateRecordCount);
            Console.WriteLine("Evaluate RMSLE = " + Math.Pow(logDiffSquaredSum / evaluateRecordCount, 0.5));
            Console.WriteLine("Evaluate DONE!");
        }


After you have a model you like from the train & evaluate process, you will want to run train & predict. The DoPredictions function sets up the test data set as a DataSource, loops through the records, and calls predict for each record. Whenever the prediction is 0, the Median values are used as the prediction instead. The final output is written to a file with two columns (the test record ID and the predicted value).

        static void DoPredictions()
        {
            System.IO.File.WriteAllText(outputPredictions, "id,Demanda_uni_equil\n");

            StringBuilder predictionSB = new StringBuilder();

            DataSource sourceTest = new DataSource(testSource);
            while (sourceTest.NextRecord())
            {
                VWRecord vwRecord = new VWRecord();

                vwRecord.Cliente_ID = sourceTest.Record.Cliente_ID;
                vwRecord.Producto_ID = sourceTest.Record.Producto_ID;
                vwRecord.RouteID = sourceTest.Record.RouteID;
                vwRecord.SalesChannelID = sourceTest.Record.SalesChannelID;
                vwRecord.SalesDepotID = sourceTest.Record.SalesDepotID;

                testRecordCount++;
                if (testRecordCount % 100000 == 0)
                {
                    Console.WriteLine("sourceTest recordCount = " + testRecordCount);
                }

                float prediction = vw.Predict(vwRecord);

                if (prediction == 0)
                {
                    if (productDemand.ContainsKey(sourceTest.Record.Producto_ID))
                    {
                        prediction = productMedians[sourceTest.Record.Producto_ID];
                    }
                    else
                    {
                        prediction = allProductsMedian;
                    }
                }

                string outputLine = sourceTest.Record.Id + "," + prediction + "\n";
                predictionSB.Append(outputLine);
            }

            System.IO.File.AppendAllText(outputPredictions, predictionSB.ToString());

            Console.WriteLine("Predict Complete. recordCount =" + testRecordCount);
        }


Run test program and tinker until satisfied.

If you run this program exactly as it is, you'll get a RMSLE score of about 0.548 for train & evaluate. This yields a Kaggle competition score of 0.532. You can tinker with the features, feature spaces, cubic/quadratic feature, and other Vowpal Wabbit engine configurations to try and improve your scores. I'm sure there are many ways to improve this particular solution, but hopefully this provides you with a good framework to use.