Buying a property will — for most people — be the biggest purchase they will ever make. So what’s the smart way for a data scientist to figure out the best flat deal in Hong Kong? Scrape and hack the market of course.
Normal home buyers may burn a lot of shoe leather visiting dozens of real estate agents, spend hours on websites looking at individual entries, maybe even start a spreadsheet.
But using a bit of data, or in this case around 1.65 million transaction records, we start to see through the sales pitches and get a feel for the total market.
Alternatively, there are many websites with dozens of pages as research fodder.
What’s missing is a tool that allows you to view the actual sale prices throughout Hong Kong with a single user interface. That’s what we developed during Data Science Hong Kong’s fifth unhackathon using data kindly donated by Hong Kong’s premier Open Data platform dataguru.hk.
A market exploration tool
Using a dataset of 1.65 million transactions scraped from the site of one of Hong Kong’s main real estate agencies, this tool (pictured below) shows the average price per square foot by location and time.
Reading the data
Each circle represents a building, colour-coded and shaded darker to show higher prices per square foot. Where more sales are registered in the user’s chosen time period the circle grows larger.
With this tool, market understanding becomes a lot easier and more intuitive. In one single interface, all the actual transactions can be summarized and filtered using attributes like price, size, address or sale date — for example, flats between 300 and 400 square feet.
At first glance, the colours of the markers would suggest that Central and the west of Hong Kong Island look like the most expensive areas, followed by Tsim Sha Tsui, and the southern Kowloon peninsula.
Looking closer at the northern edge of Kowloon Park, we can set the transaction year on 2017. A summary appears upon hovering over the property and we can see that 3 flat sales occurred at 10-24 Parkes Street (known as Wing Fu Mansion), for an average price of under HK$14,400 a square foot.
The following animation shows how easy it is to navigate through the districts and market history, and get a clear and visual idea of prices in the area.
You can try it yourself by following this link.
How to get the data
To build this tool, we need first to collect the data.
This dataset was scraped using the Python webscraping library Scrapy. For most transactions, it has the flat, floor area, floor and building address and most importantly the prices and dates of sales.
After cleaning the data, our team began looking at prices and transaction volumes, including outliers. You can see a presentation of the data here.
Collecting geolocation data
To be able to correctly place a flat on a map, we need its geolocation, consisting of a longitude and latitude pair. The only location data we have is the address, so we can use the Google Maps API to convert the addresses to lat/lon pairs.
But Google limits API calls at 2500 addresses, with a 50 cent charge for an additional 1000 requests. Small change, but fortunately it’s not necessary in Hong Kong as there are other organizations that also offer this service.
The OFCA government website, used to check whether a building has digital television or fibre-optic broadband, also returns the geolocation of the searched address. We automatically requested all block addresses. In the following screenshots, we verify correct resolution of the address.
After joining transactions and geolocations, we can start building the tool.
Building the visualization tool
The main steps to build this tool are:
- Creating a basic html template with text blocks and containers for titles, caption, transaction year filter and popup text
- Loading the transaction data and looping over it . For each year and address, we compute the circle parameters.
- The radius is a linear function of the square root of transaction quantities. With this choice, circle area and transaction quantity are linearly dependent.
- The colors (red and green channels) are simple linear functions of the price.
Taking the next steps
Various improvements to this first version can be implemented:
- Some data quality issues have been raised (missing flat areas, block address, etc.) and need to be corrected
- The decision of buying a flat includes other factors that can be incorporated into this visualization tool. For instance, car traffic, public transport options, school networks (some of which are already included in the data), average pollution levels, altitude/topography
- Adding current flat sale offers in the vicinity to help find the best deals
If you have any question or want to further explore the data, don’t hesitate to send a message in the slack group: datasciencehk.slack.com
dataguru.hk for the data and support
Thanks go to the Data Science Hong Kong organizers for this event.