Kaggle: Zillow’s Zestimate competition: C# Classes for reading source data files – Part2

In my previous Kaggle: Zillow’s Zestimate competition article (Part1), we loaded up the Parcel data source file in to a Dictionary map.

Now we will process the Transaction and Prediction data files. We’ll start by making classes to hold the rows in Transaction and Prediction data files.

    public class Prediction
        // ParcelId,201610,201611,201612,201710,201711,201712
        public int parcelid;
        public float LogErr201610;
        public float LogErr201611;
        public float LogErr201612;
        public float LogErr201710;
        public float LogErr201711;
        public float LogErr201712;
    public class Transaction
        // parcelid,logerror,transactiondate
        public int parcelid;
        public float logerror;
        public DateTime transactiondate;

Initially, I tried to put these two files in a Dictionary map, but there were duplicate parcel ID values in both the training and prediction data sets. We’ll be iterating through them, but we don’t really need a parcel ID lookup for them (yet). Due to that, we’ll load these two data sources in to a simple List.

string train_2016 = @"C:\kaggle\zillow\train_2016.csv";
string sample_submission = @"C:\kaggle\zillow\sample_submission.csv";
List<Prediction> predictionList = dataSource.GetPredictionList(sample_submission);
List<Transaction> transactionList = dataSource.GetTransactionList(train_2016);

In the GetPredictionList and GetTransactionList, open a StreamReader to these two files and loop through the lines like this:

List<Transaction> output = new List<Transaction>();
while (!fileReader.EndOfStream)
//Processing row
string line = fileReader.ReadLine();
string[] fields = line.Split(',');
Transaction row = new Transaction();
int.TryParse(fields[0], out row.parcelid);
float.TryParse(fields[1], out row.logerror);
DateTime.TryParse(fields[2], out row.transactiondate);

Parsing the sample_submission.csv file is not very interesting since the example log error values are all 0. We need to process this list to get a list of prediction parcel IDs that we need. The test set parcel IDs are not provided anywhere else, so we obtain them from the sample_submission.csv file.

                    Prediction row = new Prediction();
                    int.TryParse(fields[0], out row.parcelid);
                    float.TryParse(fields[1], out row.LogErr201610);
                    float.TryParse(fields[2], out row.LogErr201611);
                    float.TryParse(fields[3], out row.LogErr201612);
                    float.TryParse(fields[4], out row.LogErr201710);
                    float.TryParse(fields[5], out row.LogErr201711);
                    float.TryParse(fields[6], out row.LogErr201712);

Next, we’ll iterate through our training set and pick out the associated parcel properties to a training transaction case:

foreach (var transactionTrain in transactionList)
Parcel parcel = null;
if (parcelMap.TryGetValue(transactionTrain.parcelid, out parcel))
// Train record here
Console.WriteLine("ERROR: TRAIN: Failed to find parcelMap item for parcel id: " + transactionTrain.parcelid);

Lastly, we’ll loop through the prediction list and pick out associated parcel properties. We’ll call a (not yet implemented) prediction function, and assign that prediction value to all of the Log Error fields.

foreach (var prediction in predictionList)
Parcel parcel = null;
if (parcelMap.TryGetValue(prediction.parcelid, out parcel))
float predictedValue = vwWrapper.Predict(parcel);
prediction.LogErr201610 = predictedValue;
prediction.LogErr201611 = predictedValue;
prediction.LogErr201612 = predictedValue;
prediction.LogErr201710 = predictedValue;
prediction.LogErr201711 = predictedValue;
prediction.LogErr201712 = predictedValue;
Console.WriteLine("ERROR: TEST: Failed to find parcelMap item for parcel id: " + prediction.parcelid);

Clearly, we’ll want to enhance this to have better prediction values per actual prediction date range field. This sets us up with a nice top level initial framework to load our data, iterate training data, iterate test data, and have a prediction result. This prediction result will make a valid submission for the Kaggle competition.

Finally, we’ll create a new output file that contains our submission data that we can upload directly to the Kaggle competition submission form:

// Now, write out our predictionList to a prediction output file
string predictionFileName = @"C:\kaggle\zillow\OutputPrediction.txt";
using (System.IO.StreamWriter file = new System.IO.StreamWriter(predictionFileName))
StringBuilder sbOrderLine = new StringBuilder();
foreach (var prediction in predictionList)

More to come with feature modeling, Vowpal Wabbit implementation, and non-trivial prediction results.

Kaggle: Zillow’s Zestimate competition: C# Classes for reading source data files – Part1

Kaggle: Zillow’s Zestimate competition just launched this week, and there is $1.2M in prize money.

I wanted to quick get out some C# classes for reading the source data files. Hope this helps you get off the ground faster for this competition.

For the properties_2016.csv file, I am reading records in to this Parcel class:

public class Parcel
// parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,calculatedfinishedsquarefeet,finishedsquarefeet12,finishedsquarefeet13,finishedsquarefeet15,finishedsquarefeet50,finishedsquarefeet6,fips,fireplacecnt,fullbathcnt,garagecarcnt,garagetotalsqft,hashottuborspa,heatingorsystemtypeid,latitude,longitude,lotsizesquarefeet,poolcnt,poolsizesum,pooltypeid10,pooltypeid2,pooltypeid7,propertycountylandusecode,propertylandusetypeid,propertyzoningdesc,rawcensustractandblock,regionidcity,regionidcounty,regionidneighborhood,regionidzip,roomcnt,storytypeid,threequarterbathnbr,typeconstructiontypeid,unitcnt,yardbuildingsqft17,yardbuildingsqft26,yearbuilt,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
public int parcelid;
public float airconditioningtypeid;
public float architecturalstyletypeid;
public float basementsqft;
public float bathroomcnt;
public float bedroomcnt;
public float buildingclasstypeid;
public float buildingqualitytypeid;
public float calculatedbathnbr;
public float decktypeid;
public float finishedfloor1squarefeet;
public float calculatedfinishedsquarefeet;
public float finishedsquarefeet12;
public float finishedsquarefeet13;
public float finishedsquarefeet15;
public float finishedsquarefeet50;
public float finishedsquarefeet6;
public float fips;
public float fireplacecnt;
public float fullbathcnt;
public float garagecarcnt;
public float garagetotalsqft;
public float hashottuborspa;
public float heatingorsystemtypeid;
public float latitude;
public float longitude;
public float lotsizesquarefeet;
public float poolcnt;
public float poolsizesum;
public float pooltypeid10;
public float pooltypeid2;
public float pooltypeid7;
public string propertycountylandusecode;
public float propertylandusetypeid;
public string propertyzoningdesc;
public float rawcensustractandblock;
public float regionidcity;
public float regionidcounty;
public float regionidneighborhood;
public float regionidzip;
public float roomcnt;
public float storytypeid;
public float threequarterbathnbr;
public float typeconstructiontypeid;
public float unitcnt;
public float yardbuildingsqft17;
public float yardbuildingsqft26;
public float yearbuilt;
public float numberofstories;
public float fireplaceflag;
public float structuretaxvaluedollarcnt;
public float taxvaluedollarcnt;
public float assessmentyear;
public float landtaxvaluedollarcnt;
public float taxamount;
public string taxdelinquencyflag;
public float taxdelinquencyyear;
public float censustractandblock;

As you can see, I left most of the fields as float type. There are only a few strings that I’ve seen so far, and I left the parcel ID as an integer.

This file is about 600 MB, so I’m going to load it all up in to a dictionary with parcel ID as the key.

Dictionary<int, Parcel> parcelMap = new Dictionary<int, Parcel>();

Lastly, here’s my code to open the parcel source data file and populate the fields for each record:

public Dictionary<int, Parcel> GetParcelMap(string sourceReport)
var fileReader = new System.IO.StreamReader(sourceReport);
// Burn column headers
string line = fileReader.ReadLine();
string[] fields = line.Split(',');
sourceIndex = 0;
Dictionary<int, Parcel> output = new Dictionary<int, Parcel>();
while (!fileReader.EndOfStream)
//Processing row
string line = fileReader.ReadLine();
string[] fields = line.Split(',');
Parcel row = new Parcel();
int.TryParse(fields[0], out row.parcelid);
float.TryParse(fields[1], out row.airconditioningtypeid);
float.TryParse(fields[2], out row.architecturalstyletypeid);
float.TryParse(fields[3], out row.basementsqft);
float.TryParse(fields[4], out row.bathroomcnt);
float.TryParse(fields[5], out row.bedroomcnt);
float.TryParse(fields[6], out row.buildingclasstypeid);
float.TryParse(fields[7], out row.buildingqualitytypeid);
float.TryParse(fields[8], out row.calculatedbathnbr);
float.TryParse(fields[9], out row.decktypeid);
float.TryParse(fields[10], out row.finishedfloor1squarefeet);
float.TryParse(fields[11], out row.calculatedfinishedsquarefeet);
float.TryParse(fields[12], out row.finishedsquarefeet12);
float.TryParse(fields[13], out row.finishedsquarefeet13);
float.TryParse(fields[14], out row.finishedsquarefeet15);
float.TryParse(fields[15], out row.finishedsquarefeet50);
float.TryParse(fields[16], out row.finishedsquarefeet6);
float.TryParse(fields[17], out row.fips);
float.TryParse(fields[18], out row.fireplacecnt);
float.TryParse(fields[19], out row.fullbathcnt);
float.TryParse(fields[20], out row.garagecarcnt);
float.TryParse(fields[21], out row.garagetotalsqft);
float.TryParse(fields[22], out row.hashottuborspa);
float.TryParse(fields[23], out row.heatingorsystemtypeid);
float.TryParse(fields[24], out row.latitude);
float.TryParse(fields[25], out row.longitude);
float.TryParse(fields[26], out row.lotsizesquarefeet);
float.TryParse(fields[27], out row.poolcnt);
float.TryParse(fields[28], out row.poolsizesum);
float.TryParse(fields[29], out row.pooltypeid10);
float.TryParse(fields[30], out row.pooltypeid2);
float.TryParse(fields[31], out row.pooltypeid7);
row.propertycountylandusecode = fields[32];
float.TryParse(fields[33], out row.propertylandusetypeid);
row.propertyzoningdesc = fields[34];
float.TryParse(fields[35], out row.rawcensustractandblock);
float.TryParse(fields[36], out row.regionidcity);
float.TryParse(fields[37], out row.regionidcounty);
float.TryParse(fields[38], out row.regionidneighborhood);
float.TryParse(fields[39], out row.regionidzip);
float.TryParse(fields[40], out row.roomcnt);
float.TryParse(fields[41], out row.storytypeid);
float.TryParse(fields[42], out row.threequarterbathnbr);
float.TryParse(fields[43], out row.typeconstructiontypeid);
float.TryParse(fields[44], out row.unitcnt);
float.TryParse(fields[45], out row.yardbuildingsqft17);
float.TryParse(fields[46], out row.yardbuildingsqft26);
float.TryParse(fields[47], out row.yearbuilt);
float.TryParse(fields[48], out row.numberofstories);
float.TryParse(fields[49], out row.fireplaceflag);
float.TryParse(fields[50], out row.structuretaxvaluedollarcnt);
float.TryParse(fields[51], out row.taxvaluedollarcnt);
float.TryParse(fields[52], out row.assessmentyear);
float.TryParse(fields[53], out row.landtaxvaluedollarcnt);
float.TryParse(fields[54], out row.taxamount);
row.taxdelinquencyflag = fields[55];
float.TryParse(fields[56], out row.taxdelinquencyyear);
float.TryParse(fields[57], out row.censustractandblock);
output.Add(row.parcelid, row);
catch (Exception e)
Console.WriteLine("ERROR: GetParcelMap failed for line: " + sourceIndex + " with exception: " + e.Message + " Stack: " + e.StackTrace);
return output;

I don’t have a prediction result yet, and I’m working on that next. I’ll be sure to post my solution when it becomes available.