Kaggle: Zillow’s Zestimate competition: C# Vowpal Wabbit training and prediction – Part3

Part 1 and 2 of my Kaggle: Zillow’s Zestimate competition blog have been about reading the source data files and preprocessing the records. This is Part 3 and will focus on training and prediction of the data using Vowpal Wabbit.

Be sure to install the Vowpal Wabbit nuget package to your visual studio C# project. It should look like this:

Let’s create a VWRecord class that is essentially the same as our Parcel class from Part 1 of this article set, but the fields are annotated with Vowpal Wabbit name space markups.

    public class VWRecord
    {
        [Feature(FeatureGroup = 'a')]
        public float airconditioningtypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float architecturalstyletypeid { get; set; }
        public float basementsqft { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float bathroomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float bedroomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float buildingclasstypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float buildingqualitytypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float calculatedbathnbr { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float decktypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedfloor1squarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float calculatedfinishedsquarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet12 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet13 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet15 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet50 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float finishedsquarefeet6 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fips { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fireplacecnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fullbathcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float garagecarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float garagetotalsqft { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float hashottuborspa { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float heatingorsystemtypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float latitude { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float longitude { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float lotsizesquarefeet { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float poolcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float poolsizesum { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid10 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid2 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float pooltypeid7 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string propertycountylandusecode { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float propertylandusetypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string propertyzoningdesc { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float rawcensustractandblock { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidcity { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidcounty { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidneighborhood { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float regionidzip { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float roomcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float storytypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float threequarterbathnbr { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float typeconstructiontypeid { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float unitcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yardbuildingsqft17 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yardbuildingsqft26 { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float yearbuilt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float numberofstories { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float fireplaceflag { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float structuretaxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float assessmentyear { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float landtaxvaluedollarcnt { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxamount { get; set; }
        [Feature(FeatureGroup = 'a')]
        public string taxdelinquencyflag { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float taxdelinquencyyear { get; set; }
        [Feature(FeatureGroup = 'a')]
        public float censustractandblock { get; set; }
    }

Then, we create a wrapper we use to call the Vowpal Wabbit API and hold the VW engine instance handle. You can extend the Init function to try various loss functions, learning parameters, and other VW tuning approaches. A good first loss function for the Zillow’s Zestimate Kaggle competition is quantile.

    public class VWWrapper
    {
        VW.VowpalWabbit<VWRecord> vw = null;

        public void Init()
        {
            string vwArgs = string.Join(" "
                , "-f vw.model"
                //, "--loss_function=squared"
                , "--loss_function=quantile"
                //, "--loss_function=hinge"
                //, "--loss_function=logistic"
                , "--progress 10000"
                //, "--learning_rate " + learningRate
                //, "--power_t " + powerRates
                //, "--l2 " + l2Value
                //, "--binary"
                , "-b 27"
                );

            vw = new VW.VowpalWabbit<VWRecord>(new VowpalWabbitSettings
            {
                EnableStringExampleGeneration = true,
                Verbose = true,
                Arguments = vwArgs
            });
        }
        public VowpalWabbitPerformanceStatistics GetStats()
        {
            return vw.Native.PerformanceStatistics;
        }
        public VWRecord GetVwRecord(Parcel parcel)
        {
            VWRecord vwRecord = new VWRecord();
            vwRecord.airconditioningtypeid = parcel.airconditioningtypeid;
            vwRecord.architecturalstyletypeid = parcel.architecturalstyletypeid;
            vwRecord.basementsqft = parcel.basementsqft;
            vwRecord.bathroomcnt = parcel.bathroomcnt;
            vwRecord.bedroomcnt = parcel.bedroomcnt;
            vwRecord.buildingclasstypeid = parcel.buildingclasstypeid;
            vwRecord.buildingqualitytypeid = parcel.buildingqualitytypeid;
            vwRecord.calculatedbathnbr = parcel.calculatedbathnbr;
            vwRecord.decktypeid = parcel.decktypeid;
            vwRecord.calculatedfinishedsquarefeet = parcel.calculatedfinishedsquarefeet;
            vwRecord.finishedsquarefeet12 = parcel.finishedsquarefeet12;
            vwRecord.finishedsquarefeet13 = parcel.finishedsquarefeet13;
            vwRecord.finishedsquarefeet15 = parcel.finishedsquarefeet15;
            vwRecord.finishedsquarefeet50 = parcel.finishedsquarefeet50;
            vwRecord.finishedsquarefeet6 = parcel.finishedsquarefeet6;
            vwRecord.fips = parcel.fips;
            vwRecord.fireplacecnt = parcel.fireplacecnt;
            vwRecord.fullbathcnt = parcel.fullbathcnt;
            vwRecord.garagecarcnt = parcel.garagecarcnt;
            vwRecord.garagetotalsqft = parcel.garagetotalsqft;
            vwRecord.hashottuborspa = parcel.hashottuborspa;
            vwRecord.heatingorsystemtypeid = parcel.heatingorsystemtypeid;
            vwRecord.latitude = parcel.latitude;
            vwRecord.longitude = parcel.longitude;
            vwRecord.lotsizesquarefeet = parcel.lotsizesquarefeet;
            vwRecord.poolcnt = parcel.poolcnt;
            vwRecord.poolsizesum = parcel.poolsizesum;
            vwRecord.pooltypeid10 = parcel.pooltypeid10;
            vwRecord.pooltypeid2 = parcel.pooltypeid2;
            vwRecord.pooltypeid7 = parcel.pooltypeid7;
            vwRecord.propertycountylandusecode = parcel.propertycountylandusecode;
            vwRecord.propertylandusetypeid = parcel.propertylandusetypeid;
            vwRecord.propertyzoningdesc = parcel.propertyzoningdesc;
            vwRecord.rawcensustractandblock = parcel.rawcensustractandblock;
            vwRecord.regionidcity = parcel.regionidcity;
            vwRecord.regionidcounty = parcel.regionidcounty;
            vwRecord.regionidneighborhood = parcel.regionidneighborhood;
            vwRecord.regionidzip = parcel.regionidzip;
            vwRecord.roomcnt = parcel.roomcnt;
            vwRecord.storytypeid = parcel.storytypeid;
            vwRecord.threequarterbathnbr = parcel.threequarterbathnbr;
            vwRecord.typeconstructiontypeid = parcel.typeconstructiontypeid;
            vwRecord.unitcnt = parcel.unitcnt;
            vwRecord.yardbuildingsqft17 = parcel.yardbuildingsqft17;
            vwRecord.yardbuildingsqft26 = parcel.yardbuildingsqft26;
            vwRecord.yearbuilt = parcel.yearbuilt;
            vwRecord.numberofstories = parcel.numberofstories;
            vwRecord.fireplaceflag = parcel.fireplaceflag;
            vwRecord.structuretaxvaluedollarcnt = parcel.structuretaxvaluedollarcnt;
            vwRecord.taxvaluedollarcnt = parcel.taxvaluedollarcnt;
            vwRecord.assessmentyear = parcel.assessmentyear;
            vwRecord.landtaxvaluedollarcnt = parcel.landtaxvaluedollarcnt;
            vwRecord.taxamount = parcel.taxamount;
            vwRecord.taxdelinquencyflag = parcel.taxdelinquencyflag;
            vwRecord.taxdelinquencyyear = parcel.taxdelinquencyyear;
            vwRecord.censustractandblock = parcel.censustractandblock;
            return vwRecord;
        }
        public void Train(Parcel parcel, float label)
        {
            VWRecord vwRecord = GetVwRecord(parcel);
            SimpleLabel simpleLabel = new SimpleLabel() { Label = label };
            // Comment this in if you want to see the VW serialized input records:
            //var str = vw.Serializer.Create(vw.Native).SerializeToString(vwRecord, simpleLabel);
            //Console.WriteLine(str);
            vw.Learn(vwRecord, simpleLabel);
        }

        public float Predict(Parcel parcel)
        {
            VWRecord vwRecord = GetVwRecord(parcel);
            return vw.Predict(vwRecord, VowpalWabbitPredictionType.Scalar);
        }

        public void SaveModel()
        {
            vw.Native.SaveModel();
        }
    }

Notice the GetVwRecord function that maps the Parcel class to the VWRecord class. The VWRecord class (and its annotations) are needed to call Predict and Learn on the VW engine instance.

Please refer back to Part 2 of this article series and insert a vwWrapper.Train call in the transactionList training loop like this:

Parcel parcel = null;
if (parcelMap.TryGetValue(transactionTrain.parcelid, out parcel))
{
vwWrapper.Train(parcel, transactionTrain.logerror);
}
else
{
Console.WriteLine("ERROR: TRAIN: Failed to find parcelMap item for parcel id: " + transactionTrain.parcelid);
}

You’ll also want to insert a vwWrapper.Predict call in the predictionList loop like this:

Parcel parcel = null;
if (parcelMap.TryGetValue(prediction.parcelid, out parcel))
{
predictedValue = vwWrapper.Predict(parcel);
prediction.LogErr201610 = predictedValue;
prediction.LogErr201611 = predictedValue;
prediction.LogErr201612 = predictedValue;
prediction.LogErr201710 = predictedValue;
prediction.LogErr201711 = predictedValue;
prediction.LogErr201712 = predictedValue;
}
else
{
Console.WriteLine("ERROR: TEST: Failed to find parcelMap item for parcel id: " + prediction.parcelid);
}

At this point, you should be able to use the code from Part 1, 2, and 3 of this blog article series to completely train and predict a submission for the $1.2M Zillow’s Zestimate Kaggle competition using Vowpal Wabbit and C#. 

The total run time of this code solution takes under 10 minutes on a very modest Windows laptop.

Kaggle: Zillow’s Zestimate competition: C# Classes for reading source data files – Part2

In my previous Kaggle: Zillow’s Zestimate competition article (Part1), we loaded up the Parcel data source file in to a Dictionary map.

Now we will process the Transaction and Prediction data files. We’ll start by making classes to hold the rows in Transaction and Prediction data files.

    public class Prediction
    {
        // ParcelId,201610,201611,201612,201710,201711,201712
        public int parcelid;
        public float LogErr201610;
        public float LogErr201611;
        public float LogErr201612;
        public float LogErr201710;
        public float LogErr201711;
        public float LogErr201712;
    }
    public class Transaction
    {
        // parcelid,logerror,transactiondate
        public int parcelid;
        public float logerror;
        public DateTime transactiondate;
    }

Initially, I tried to put these two files in a Dictionary map, but there were duplicate parcel ID values in both the training and prediction data sets. We’ll be iterating through them, but we don’t really need a parcel ID lookup for them (yet). Due to that, we’ll load these two data sources in to a simple List.

string train_2016 = @"C:\kaggle\zillow\train_2016.csv";
string sample_submission = @"C:\kaggle\zillow\sample_submission.csv";
List<Prediction> predictionList = dataSource.GetPredictionList(sample_submission);
List<Transaction> transactionList = dataSource.GetTransactionList(train_2016);

In the GetPredictionList and GetTransactionList, open a StreamReader to these two files and loop through the lines like this:

List<Transaction> output = new List<Transaction>();
while (!fileReader.EndOfStream)
{
try
{
//Processing row
string line = fileReader.ReadLine();
string[] fields = line.Split(',');
Transaction row = new Transaction();
int.TryParse(fields[0], out row.parcelid);
float.TryParse(fields[1], out row.logerror);
DateTime.TryParse(fields[2], out row.transactiondate);
output.Add(row);

Parsing the sample_submission.csv file is not very interesting since the example log error values are all 0. We need to process this list to get a list of prediction parcel IDs that we need. The test set parcel IDs are not provided anywhere else, so we obtain them from the sample_submission.csv file.

                    Prediction row = new Prediction();
                    int.TryParse(fields[0], out row.parcelid);
                    float.TryParse(fields[1], out row.LogErr201610);
                    float.TryParse(fields[2], out row.LogErr201611);
                    float.TryParse(fields[3], out row.LogErr201612);
                    float.TryParse(fields[4], out row.LogErr201710);
                    float.TryParse(fields[5], out row.LogErr201711);
                    float.TryParse(fields[6], out row.LogErr201712);

Next, we’ll iterate through our training set and pick out the associated parcel properties to a training transaction case:

foreach (var transactionTrain in transactionList)
{
Parcel parcel = null;
if (parcelMap.TryGetValue(transactionTrain.parcelid, out parcel))
{
// Train record here
}
else
{
Console.WriteLine("ERROR: TRAIN: Failed to find parcelMap item for parcel id: " + transactionTrain.parcelid);
}
}

Lastly, we’ll loop through the prediction list and pick out associated parcel properties. We’ll call a (not yet implemented) prediction function, and assign that prediction value to all of the Log Error fields.

foreach (var prediction in predictionList)
{
Parcel parcel = null;
if (parcelMap.TryGetValue(prediction.parcelid, out parcel))
{
float predictedValue = vwWrapper.Predict(parcel);
prediction.LogErr201610 = predictedValue;
prediction.LogErr201611 = predictedValue;
prediction.LogErr201612 = predictedValue;
prediction.LogErr201710 = predictedValue;
prediction.LogErr201711 = predictedValue;
prediction.LogErr201712 = predictedValue;
}
else
{
Console.WriteLine("ERROR: TEST: Failed to find parcelMap item for parcel id: " + prediction.parcelid);
}
}

Clearly, we’ll want to enhance this to have better prediction values per actual prediction date range field. This sets us up with a nice top level initial framework to load our data, iterate training data, iterate test data, and have a prediction result. This prediction result will make a valid submission for the Kaggle competition.

Finally, we’ll create a new output file that contains our submission data that we can upload directly to the Kaggle competition submission form:

// Now, write out our predictionList to a prediction output file
string predictionFileName = @"C:\kaggle\zillow\OutputPrediction.txt";
using (System.IO.StreamWriter file = new System.IO.StreamWriter(predictionFileName))
{
file.WriteLine("ParcelId,201610,201611,201612,201710,201711,201712");
StringBuilder sbOrderLine = new StringBuilder();
foreach (var prediction in predictionList)
{
sbOrderLine.Clear();
sbOrderLine.Append(prediction.parcelid);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201610);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201611);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201612);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201710);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201711);
sbOrderLine.Append(",");
sbOrderLine.Append(prediction.LogErr201712);
file.WriteLine(sbOrderLine.ToString());
}
}

More to come with feature modeling, Vowpal Wabbit implementation, and non-trivial prediction results.