Feature engineering is the process of creating and modifying our current features to create new features that make machine learning algorithms work better. This is because it extracts more information for the machine learnign algorithm, it increases the dimensionality of the dataset, and also because it can transform our current features into forms that work better for machine learning.
The first step I took was to parse the string features. For example, if the parking feature indicated '2 Carport Spots', I extracted the '2' and used this as the numerical value for the 'parking spaces' feature. This way, instead of each parking value feature being unique, I could use an extra feature to represent how many parking spots there were. I did this string extraction technique for a lot of features: some heating method strings had multiple methods, so I added two more features to represent each way that the house was heated. I also extracted tons of features from the address string, such as the unit number, street type (like 'Way' or 'Drive'), street direction (e.g., 'NE' or 'W'), and the name of the street. I also looked for special keywords like 'Lake', 'Riverside', or 'Canyon', and included a boolean feature indicating if the address has a hyphen. I also used string parsing to identify the type of property (such as 'unit' or 'trailer'). After this, I had thirty total features for each house, which I was fairly happy with.
Unfortunately, machine learning algorithms can't take strings as input, so all categorical variables needed to be transformed into integer values. After this, I used this cool trick called one-hot encoding via pd.dummies to create a new feature for each category of a categorical variable. In other words, each category of a qualitative variable gets its own feature, marked as '1' if the category applies and '0' otherwise. This way, the machine learning algorithm can process this data without implicitly assigning order or magnitude where there isn't any.
An interesting note with encoding is that, originally, I didn't encode the zip code because I thought it was a quantitative variable (after all, it's a number, right?). However, when the machine learning model wasn't getting as good of results as I was expecting, I realized that zip code made the machine learning work significantly better as a categorical variable instead of a quantitative variable! This makes a lot of sense in hindsight because a zip code is just a unique identifier for a region, not a quantitative value where a bigger or smaller value indicates more or less value like the number of bedrooms does.
After encoding my features, I noticed something interesting: features that had a near 1:1 mapping to each house had a correlation that was super close to 1, even if they don't give us any new information about the house! A prime example of this is the url, which was unique for each house. Features like this led to overfitting in the machine learning process, so I had to remove them.
Machine learning algorithms work best when all quantitative features and (in the case of regression) the prediction goal are normally distributed and small. In other words, the closer it is to a normal distribution with mean 0 and standard deviation 1, the better. Although most of my features are small and approximately normally distributed, there were two notable exceptions to this: living area and price.
Thankfully, this is an easy fix if you are aware of it. Both features had long rightward tails, so I used a log transformation to make them smaller and more normal. Although I would need to be extra careful in my code to ensure that I remembered where I was using the logarithm of the price/living area and where I wasn't, this was a small price to pay for the benefits this transformation does for the machine learning.
A technique I played with was using convolutional neural networks to extract house features from pictures, such as how many stories the house was. However, my journey with this technique can't be fit into this page, so there's a whole subpage devoted to it.
Side Page: House PicturesAlthough I have done much feature engineering in the past, there's always new things to learn. In this case, I learned that considering whether each variable is cateogorical or quantitative is not as simple as "what's a string?" or "what's a number?". For example, as we'll see on the data analysis section, the zip code is one of the most important features for determining price, so I sure am glad I figured out it was a categorical variable!
However, looking back on this section, I think that I could have done a lot more to make features. For example, I could have used the latitude and longitude values to extract a lot of information about the location of the houses, such as the elevation, the distance to the nearest body of water, proximity to railroads, or the distance from the nearest major city like Seattle.
With these feature engineering techniques, I was able to prepare my dataset to better fit into the Yestimate. By crafting these new features, I could capture more nuances in the data, potentially improving the performance of the Yestimate. Let's hope it was enough! Check out the next page to see if it was...
Next Page: Yestimate ConstructionYillow was created by Brandon Bonifacio with the help of a variety of sources which are credited on our References page.
Come check out my personal website or connect with me on LinkedIn!
Disclaimer: Yillow is an independent project, not affiliated with or endorsed by Zillow in any way. It is created for educational purposes and is not intended to infringe on any rights of Zillow.
No rights reserved - whatsoever.