For any data science project, acquiring the right dataset forms the crux of the journey. For Yillow, my desired dataset needed to be comprehensive, varied, and representative of the current housing market. To continue the theme of being a parody of Zillow, the inspiration behind this project, the data was responsibly gathered from Zillow's own website. This was performed via respectful web scraping, with deliberate multisecond delays between web requests to ensure no undue burden on the Zillow website. The process was carried out through original code in Python and was designed to minimize disruption while still gathering the data necessary for the project. As a final note, I chose to only scrape houses in Washington because Washington is on my radar as a place to live one day. Furthermore, I knew that my dataset wouldn't be gigantic, so predicting housing prices for all states would be a difficult given the size of my dataset.
The first dataset is a robust collection of single family houses in Washington sold recently, consisting of approximately 2,500 homes. To achieve a varied and representative sample, the data was broken down by the number of bedrooms - from one up to five bedrooms. For each category, data from Zillow's 20 pages was collected, ensuring a well-rounded dataset that covered a range of home sizes. I wanted to scrape single family homes because I might live in one someday, and I wanted to learn more about the price trends for these houses.
The second dataset mirrors the first but focuses on manufactured homes instead. This dataset was acquired in the same fashion as the single family home dataset, and I think it will be really interesting to compare and contrast housing trends for these two types of houses. Furthermore, with the advent of 3D printed houses as well as the increasing popularity of tiny homes, I think it's very possible I will live in a manufactured home one day, so I wanted to also learn more about the price trends for these houses.
The thirteen primary features extracted for all ~5,000 houses were price, zestimate, zipcode, bathrooms, bedrooms, living area, address, city, state, home type, longitude, latitude, and url. During the webscraping process, I made sure to only collect houses for which all thirteen of these features were present so that I could have a complete dataset for the most important features. To further enrich our dataset and analysis, I went the extra mile by visiting the individual webpage for each house listed, which took a very long time because I made sure to delay my web requests by multiple seconds to be respectful of Zillow's website. This allowed me to gather six more secondary features about each property, such as the year the house was built, the heating and air conditioning status of the house, the parking status of the house, the offer review date, and more information on the type of house. As a note, sometimes houses would be missing secondary features, so I'll definitely need to do some data cleaning after this. Below is a picture of some of my data
I've never webscraped before, and it was both really fun to learn but also took a lot of time. I learned how to use the BeautifulSoup library in Python, which is a really powerful tool for webscraping. This part of Yillow took a really long time, but I wanted to make sure I got it right because the entire project relies on the quality of the data I could collect. In hindsight, I probably could have collected a lot more features about each house, such as the price of the nearest neighbor, the Walkability score, and more. If I were to ever go through a data collection process such as this in the future, I'd be sure to squeeze as much data as possible from my data source because it makes a big difference in the end even if I might just want to jump to the next steps.
Next Page: Data CleaningYillow was created by Brandon Bonifacio with the help of a variety of sources which are credited on our References page.
Come check out my personal website or connect with me on LinkedIn!
Disclaimer: Yillow is an independent project, not affiliated with or endorsed by Zillow in any way. It is created for educational purposes and is not intended to infringe on any rights of Zillow.
No rights reserved - whatsoever.