Skip main navigation
Harnessing Machine Learning - Yillow

Machine Learning Model Construction

The first step in machine learning is to choose which models to use. Because this is a regression problem and also to ensure I covered a broad range of models, I chose nine popular and openly accessible regression models: 'RandomForestRegressor', 'AdaBoostRegressor', 'LGBMRegressor', 'XGBRegressor', 'Ridge', 'Lasso', 'ElasticNet', 'SVR', and 'GradientBoostingRegressor'. This versatile squadron of models, with their distinctive learning strategies, offered an extensive palette of capabilities, each uniquely adept at capturing different dimensions and nuances of my dataset. These models seemed promising upon spot-checking, so I moved on to tuning them.

Hyperparameter Tuning via Optuna

Hyperparameter tuning is an extremely important process of the machine model pipeline because each dataset has a different, unique set of best model parameters that are near-impossible to determine without experimental testing. In the past, I've used a variety of techniques to tune hyperparameters, from tuning them manually, to using RandomSearchCV, to (my personal favorite) using a genetic algorithm. However, I decided to use a new tool for this project that I've never used before: Optuna. Optuna is an open-source hyperparameter optimization framework that uses a ton of different techniques to find the best hyperparameters for a given model based on a given objective function, which I chose to be a simple Kfold cross-validation on the tuned model. Not only that, but Optuna has a setting that allows for multi-threading, which made the search process much, much faster! I tried tuning some of the models myself to see if I could beat Optuna, and I just couldn't! I'll be sure to make much use out of Optuna or tuning packages like it in the future if I ever need to train a ton of ML models!

Ensembling to Get the Yestimate

After the models had been tuned, I ensembled the four best performing ones to form a stacked model that was... the Yestimate! By ensembling models together, I was able to take advantage of the strengths of each model while offsetting their weaknesses. In this way, I was able to create a model that was more robust and resistant to overfitting than any of the individual models. The performance of the nine individual models as well as the stacked Yestimate is given below with a comparison to the Zestimate baseline. After much tuning and playing with the models, I'm happy to say that the Yestimate's test mean absolute error is only 2.1x that of the real Zestimate. Although this may sound impressive, there's a lot of nuances to this result that will be discussed on the next page.

RMSLE Manufactued
This is a plot of how our models perform via the root-mean-squared-logarithmic-error (RMSLE) metric, with the Zestimate's predictions for the data shown as a dashed line.
MAE Manufactured
This is a plot of how our models perform via the mean absolute error (MAE) metric, with the Zestimate's predictions for the data shown as a dashed line.

What I Learned & Summary

Although I have some experience with training machine learning models, I still learned a lot from this process. Most notably, I learned how to use Optuna, which is a super useful tool that sped up this process considerably. I'll be sure to bring Optuna with me into the future, and I'll also be on the lookout for new tools that come out like Optuna!

Through a judicious selection of diverse models, meticulous hyperparameter tuning, and the strategic application of ensembling techniques, we've finally done it: the Yestimate is real! And, even if I do say so myself, I don't think it's half-bad! It's MAE is only 2.1 times that of the real Zestimate, which I consider to be a major accomplishment esepecially considering the fact that the Zestimate is a world-class model that has been trained on millions of data points.

However, comparing the Yestimate to the Zestimate in this way is a bit misleading, and there's so many nuances to this result that I just need to devote a whole page to it.

Next Page: Performance Nuances

Exact Performance Statistics: Yestimate MAE: $64,349 | Yestimate RMSLE: 0.4936 | Zestimate MAE: $30,490 | Zestimate RMSLE: 0.3238


Yillow was created by Brandon Bonifacio with the help of a variety of sources which are credited on our References page.

Come check out my personal website or connect with me on LinkedIn!

Disclaimer: Yillow is an independent project, not affiliated with or endorsed by Zillow in any way. It is created for educational purposes and is not intended to infringe on any rights of Zillow.

No rights reserved - whatsoever.

Footer Image