PxPixel

Data Cleaning Process - Yillow

Hold on - Why does this page look different?

I have a confession to make.

Originally, Yillow wasn't meant to be a project where I scraped tons of data, made an amazing model, and then did tons of cool data science stuff. A while ago, I wanted to do a project where I used convolutional neural networks to extract features from houses and build a ML model off of that - that's all. No webscraping, no Yillow maps, just a simple ConvNet project.

What you see on this page was my very first time coding in html, before I even made my personal website, and I was just trying to make a simple page to show off how extracting features from pictures of houses could improve ML performance - that's it.

But a couple months later, I wanted Yillow to be more than just this, so I did the rest of the project and made the rest of the website. Even though this part of the project never made it to the final product because I didn't figure out how to scrape images for all the houses and also because I don't think my laptop could process all that image data anyways, I still wanted to show it off because engineering features from images is still a really, really cool idea I explored once.

So here it is - the original Yillow page. Marvel at the simplicity of it all, and if you want to move past this and go onto the Yestimate construction, click the button below.

Next Page: Yestimate Construction

Yillow: A Parody of Zillow

What if you could estimate a house's

value from some pictures of it?

My Awesome Project

A Frontal Picture

My Awesome Project

A Kitchen Picture

My Awesome Project

A Bedroom Picture

My Awesome Project

A Bathroom Picture

This is the concept I set out to explore in this project, so read on if you want to learn more about it!

Structured Data Benchmark

Hold on! Before we dive into the deep end and talk about how
pictures can be used to estimate house value, let's first go back
a little. As a benchmark for this project, let's first look at how
well we can estimate house value using only structured data. This
is an important decision for a couple reasons. First, machine
learning models have historically performed very well on
structured data, which is data that can be organized into tables
and columns. Second, it is well-known that the largest factor in
a house's value is its location, which can be represented by the
structured data of the house's zip code. So, a ML model's performance
on structured data is a good benchmark for how well it can perform
on unstructured data, like pictures.

My Awesome Project

Structured Data is quantitative and can usually be arranged in tables and columns. The four structured data values I used are zip code, square footage, number of bedrooms, and number of bathrooms. Machine Learning perform extraordinarily well on quantitative data, so this will serve as a good benchmark for the the rest of the project.

The Performance of Machine Learning on Only Structured Data

My Awesome Project

Training Root Mean Squared Error (RMSE)

My Awesome Project

Validation Root Mean Squared Error (RMSE)

As shown by our training and validation performance, machine learning on our small dataset of about 300 houses is not very accurate, with a best median validation RMSE of about $200,000. This is not surprising due to the small size of our datset. However, it is still interesting to see how well a simple model can perform on only structured data.

With this benchmark in mind, let's see how well we can do with pictures!

Convolutional Neural Network Performance

My Awesome Project

The first step in the image processing pipeling is to extract features from the house pictures. For this project, I chose to extract five features about the houses, and then use these five features together with the previous four features about the house (zip code, square footage, number of bedrooms, number of bathrooms) to predict the house's value.

I use three convolutional networks and compare their performance. The first is the Amir algorithm from a Kaggle competition (credit at bottom of page), and the other two algorithms come from the Ahmed paper (all with slight tweaks) to work with my filtered dataset.

The Performance of Machine Learning on Structured and Picture Data

My Awesome Project

Training Root Mean Squared Error (RMSE)

My Awesome Project

Validation Root Mean Squared Error (RMSE)

With the addition of pictures, our model's performance has improved dramatically. Our lowest median validation RMSE is now about $125,000, which is a 63% improvement over our previous model. This is a very promising result, and it shows that pictures can be used to improve the accuracy of machine learning models.

Although I could continue to explore how to improve the performance of the machine learning aspect of this project, I learned through this project the importance of the data used in machine learning. After all, if you give a ML model bad data, you will get bad results. This couldn't be demonstrated more than in this project: the dataset I used was very small, so it is likely that no matter what ML techniques I could implement, the model would likely never perform as well as a it could on a much larger dataset.
Because of this, I'm going to do more projects that focus on data science. Stay tuned for future projects where I explore techniques in data science!

References

To lean into the idea of a parody-project, this website is meant to look like the Zillow website, and I used much of their code as well as the main house image on their website to make that happen.
The dataset I am using comes from Ahmed and Moustafa in their 2016 paper, House Price Estimation from Visual and Textual Features, here. Some of the zip codes included in the dataset only contained one house, so I only kept the zip codes (and corresponding houses) for zip codes containing 25 or more houses.
One of the ML algorithms used, Amir, came from this Kaggle competition.

Thank you for visiting my website!
Made by Brandon Bonifacio, check out my Linkedin!