Using machine learning to predict your rent

Part 1: Speaking in machine learning

6 min readFeb 4, 2021

Last year, my girlfriend and I were apartment hunting in Vancouver, a notoriously expensive city. Walking into a 700 sq foot apartment that’s listed for $2000 (and feels like a shoebox) makes one go “this is crazy.” Having that happen multiple times makes one go “wait, is everyone crazy?”

Well, let’s find out. Can we use machine learning to predict what a “good” rental price is for a given apartment size?

(This is the first of a few articles on the basics of machine learning. To set expectations, we’re not going to solve the problem in this article, but we ARE going to introduce a bunch of terms and concepts that will help us build a solution.)

Starting with the data

Let’s start with a bunch of data on rental prices and square footages. (If you’re interested in reading an article about how to scrape this data from craigslist, click here.)

Imagine the data looks something like this:

Which we can plot like so (not exact points):

Our first approach will be simple. We want to find the “line of best fit”, the line that goes right through that data:

We could calculate that ourselves, but it’s more interesting to use machine learning. Plus, one day, we can use our machine learning chops to take a more sophisticated approach, like plotting a curved line through the points.

Defining the problem

Before we get to the actual machine learning, let me introduce a couple of terms, so we can talk about the problem more fluently. Our goal is this: given a square footage, what would be the expected rent?

We can visualize it like so:

… where for a given x value (square footage), we want to find the y value (rent). In other words, what rent value corresponds with the red circle?

The first question we need to answer: what kind of problem is this?

Regression problems

In the above graph, we can see that we’re trying to find the rent from a range of values, from 0 to a million bajillion. If we were idiots, we wouldn’t have any idea where in that range the y value would fall. Since we’re smart, we can assume it’s probably between $500-$50,000. If we were even smarter, we can look at our dataset and assume a given rent will fall between the lowest and the highest values there.

But… we don’t know WHERE in that range the rent will fall. It could be any dollar amount, in theory. Since we’re trying to find a value in a range, AKA a continuous value, we call this kind of problem a regression problem.

Regression involves picking one value out of a range of values. From a scale of $500 to $50,000 a month in rent, we’re trying to pick one (say, $2000) as the average rate for a given square footage.

Classification problems

The opposite of a regression problem is a classification problem, which is when you’re trying to answer a YES/NO question. If, for example, we wanted to say whether a given rental price was above or below the average rent, that would be a classification problem.

In a classification problem, you’re looking for one of two answers: is it above or below the average rent? Those are the only possibilities. In a regression problem, you’re looking for an answer within a range: $500-$50,000.

Clustering and association problems

But those aren’t the only types of machine learning problems. Let me throw two more at you: clustering and association.

Clustering problems involve organizing data into meaningful chunks. You might ask your machine learning algorithm to split our rental properties into three groups, and the algorithm might come back with a group at around $1000/m, a group between 1200–1500, and a group around $2000.

Association problems are about finding relationships between data, which kind of sounds like the same thing, at first glance. An example can illustrate, though: let’s say we wanted to find what apartment features are associated with each price range. Our algorithm might come back and tell us that gas fireplaces are usually only in apartments with $2000+ rent, or that $1000 apartments are usually under 1000 square feet.

To give an example from a different context, let’s say you’re a data scientist at Amazon. Clustering can help you divide your products into groups based on their sales volume, while association can tell you which products are usually bought together.

Summarizing the problem types

Here’s a summary of all four, pretending we are working with a dataset of people’s birthdays and their weight:

Classification: is a given person’s weight higher than average?

Regression: given an age, what is a person’s most likely weight?

Clustering: how can we divide our data in four meaningful groups?

Association: if we had more data in our set, an association algorithm might be able to find meaningful patterns with, say, weight and grams of carbohydrates eaten per week.

Supervised vs unsupervised learning

Speaking of clustering, we can actually divide these four problem types into two groups. Here’s the difference between them:

In classification and regression, we are already know the answer we’re looking for. With classification, it’s a YES/NO, and with regression, it’s a value in a range.

But with clustering and association, we don’t know the answer. What clusters will our values be grouped into? What associations will be found? Who knows!

You can view this distinction as the difference between doing a test and pursuing a research project. In a test, there’s a right answer, but with an open-ended research project, you don’t know what you’ll turn up!

The term for these two problem types are supervised and unsupervised. Classification and regression are supervised machine learning problems, since they have a “right” answer. Clustering and association are unsupervised machine learning problems, since they’re more ambiguous.

Defining the problem, part two

Okay, that was a lot of terminology. But now we can say the following with confidence: we need a supervised machine learning algorithm to solve the regression problem of finding reasonable apartment rent based on square footage.

Crafting that statement may not seem like a lot of progress, but we now know exactly what we’re seeking, and can start building a solution.

If you liked this article, and want to see me get around to actually solving this problem, subscribe to my newsletter below.

VOLTA

Learning by building, trying things out, turning complexity into simplicity.

scottdomes.substack.com

This article is based heavily on Andrew Ng’s excellent course on machine learning. All credit for my knowledge goes to him.

Other sources:

https://medium.com/quick-code/regression-versus-classification-machine-learning-whats-the-difference-345c56dd15f7