Welcome to Data Science Hong Kong

Data science is starting to become embedded in Hong Kong. We are a community for all data scientists in Hong Kong — from beginners to multi-decade practitioners of BI, artificial intelligence and data warehouse design, and from students to professors — and their fellow travellers, from business, government and academia. We want to create an environment where data scientists can learn from each other and share their stories, and a community that non-data scientists can turn to when they want to understand more about what data science can do for them. We will organise monthly events such as unhackathons and lectures as well as hosting social media platforms.

Join us on our other social platforms to keep up-to-date with our activities and to become part of our community:


slack_icon

meetup_icon

FBicon

linked-in_icon

Hong Kong property data can work for you: here’s how

Buying a property will — for most people — be the biggest purchase they will ever make. So what’s the smart way for a data scientist to figure out the best flat deal in Hong Kong? Scrape and hack the market of course.

Normal home buyers may burn a lot of shoe leather visiting dozens of real estate agents, spend hours on websites looking at individual entries, maybe even start a spreadsheet.

But using a bit of data, or in this case around 1.65 million transaction records, we start to see through the sales pitches and get a feel for the total market.

Alternatively, there are many websites with dozens of pages as research fodder.

hk1to90
Only 90 pages, no biggie

What’s missing is a tool that allows you to view the actual sale prices throughout Hong Kong with a single user interface.  That’s what we developed during Data Science Hong Kong’s fifth unhackathon.

A market exploration tool

Using a dataset of 1.65 million transactions scraped from the site of one of Hong Kong’s main real estate agencies, this tool (pictured below) shows the average price per square foot by location and time.

new_screen
A visual representation of property price per square foot in 2017

Reading the data

Each circle represents a building, colour-coded and shaded darker to show higher prices per square foot. Where more sales are registered in the user’s chosen time period the circle grows larger.

With this tool, market understanding becomes a lot easier and more intuitive. In one single interface, all the actual transactions can be summarized and filtered using attributes like price, size, address or sale date — for example, flats between 300 and 400 square feet.

At first glance, the colours of the markers would suggest that Central and the west of Hong Kong Island look like the most expensive areas, followed by Tsim Sha Tsui, and the southern Kowloon peninsula.

Looking closer at the northern edge of Kowloon Park, we can set the transaction year on 2017. A summary appears upon hovering over the property and we can see that 3 flat sales occurred at 10-24 Parkes Street (known as Wing Fu Mansion), for an average price of under HK$14,400 a square foot.

focus_example
Sales in the north of Kowloon Park in 2017 

The following animation shows how easy it is to navigate through the districts and market history, and get a clear and visual idea of prices in the area.

js_research_red
Only the beginning for our property search tool

You can try it yourself by following this link.

How to get the data

To build this tool, we need first to collect the data.

transaction_sample
How the original data was presented online

This dataset was scraped using the Python webscraping library Scrapy. For most transactions, it has the flat, floor area, floor and building address and most importantly the prices and dates of sales.

json_line
One transaction in JSON format

After cleaning the data, our team began looking at prices and transaction volumes, including outliers. You can see a presentation of the data here.

Collecting geolocation data

To be able to correctly place a flat on a map, we need its geolocation, consisting of a longitude and latitude pair. The only location data we have is the address, so we can use the Google Maps API to convert the addresses to lat/lon pairs.

But Google limits API calls at 2500 addresses, with a 50 cent charge for an additional 1000 requests. Small change, but fortunately it’s not necessary in Hong Kong as there are other organizations that also offer this service.

The OFCA government website, used to check whether a building has digital television or fibre-optic broadband, also returns the geolocation of the searched address. We automatically requested all block addresses. In the following screenshots, we verify correct resolution of the address.

resu_robinson
Geolocation results for 103 Robinson Road
robinson_road
Google Map results for this geolocation
number
Street view on the result: building # 103

After joining transactions and geolocations, we can start building the tool.

Building the visualization tool

The main steps to build  this tool are:

  • Creating a basic html template with text blocks and containers for titles, caption, transaction year filter and popup text
  • Loading the map from the ESRI javascript API. The parameters are set to focus on Hong Kong.
  • Loading the transaction data and looping over it . For each year and address, we compute the circle parameters.
    • The radius is a linear function of the square root of transaction quantities. With this choice, circle area and transaction quantity are linearly dependent.
    • The colors (red and green channels) are simple linear functions of the price.
  • Managing user events: When the user hovers the mouse on a point, a javascript function is called to display the data related to the point. When the mouse leave, the data is cleared.

Taking the next steps

Various improvements to this first version can be implemented:

  • Some data quality issues have been raised (missing flat areas, block address, etc.) and need to be corrected
  • The decision of buying a flat includes other factors that can be incorporated into this visualization tool. For instance, car traffic, public transport options, school networks (some of which are already included in the data), average pollution levels, altitude/topography
  • Adding current flat sale offers in the vicinity to help find the best deals

If you have any question or want to further explore the data, don’t hesitate to send a message in the slack group: datasciencehk.slack.com

Credits:
Thanks go to the Data Science Hong Kong organizers for this event.

Data science news round up

Our tight-knit community of data scientist have shared a wealth of news and inspiring projects from around the web over the past couple of months. Here is a brief round up of the more interesting articles, and remember, you can join in on our slack group.

2-l-304106-unsplash

Millions of Chinese farmers reap benefits of huge crop experiment

An article that demonstrates the world changing potential of evidence based approaches to the world’s problems. For me, it’s also a reminder that it’s often not the latest buzzword or most glamourous topics that have the most impact.

Winning with Data Science

Next is an article examining the business and organisational side of data science. This is a topic that probably doesn’t get enough attention compared to the latest and coolest algorithm. It’s important for data scientists to take an interest in how organisations should adapt, if they don’t it will probably be decided by someone not qualified to make the decision!

nasa-43569-unsplash

What Comes After Deep Learning?

This article examines whether deep learning is actually a blind alley and considers what new approaches might be next for data science. Also a brief examination of the question of US vs China in the AI “arms race”.

‘Who’s Leading AI’ Isn’t the Intelligent Question

Our final article explores the much talked about question of whether the US or China is winning and why it’s not the right question to ask.

If you found any of these articles interesting then do come and join the discussion on our Slack group, where you will also find details of meetups. https://datasciencehk.slack.com/

April 15 Unhackathon #7

poster_7

We are organizing another Un-Hackathon on April 15th! You can sign up here! We have organized a number of talks and a day of collaborative, hands-on problem solving.

Details:

  • 9.30am – Arrive, registration
  • 10am – Welcome
  • 10.15am – Talks begin
  • 11.30am – Pitch session, recruitment
  • 12pm – Work on projects
  • 5.30 pm – Present results of work session

Location:

11F, 40-44 Bonham Strand, Sheung Wan, Hong Kong

Requirements:

Laptop and charger for those joining the coding.
Prepared data and project pitches for those submitting projects.
If presenting, send us your presentation slides ahead of time so we can prepare them.
50HKD in cash for admin and organisation.
Recommendations for project submissions:
Prepare data in advance as much as you can; spending the day cleaning or retrieving data won’t gather crowds of DS! Contact organisers if you need a data repository to share data with all your team members.
If the project is already underway, prepare an introduction to it so that people can join (if you’re presenting slides, send them to us before you arrive), and make sure the task you propose is feasible during the time of the event, and describe the skills you expect your team to have: R or Python? AWS, Spark? etc.

For final presentations:

Start writing the final presentation right from the start and add elements little by little all day long. Recall the context of the project and articulate the presentations to make it understandable by the non-initiated public around you.
If you wish your work will be published on the website datasciencehongkong.com with your bio, name, etc.

Other details:

50 participants max
Food / drink: Only water, coffee and snacks are provided. Attendees can order their own food to the venue, take a break to find a restaurant in Kennedy Town or bring their own lunch.
Price: 50 HKD. We charge a fee to cover costs. We are not a for-profit organisation and will aim to keep the costs of our events as low as possible to make it accessible to all.

March 18 Unhackathon #6

nh-sw-hk-1Data scientists: we are organising another Un-hackathon in our monthly series. There will be talks and a day of collaborative, hands-on problem solving.

Sign up on our Eventbrite page and stay up to date with upcoming events on our meetup page.

 

Details:

  • 9.30am – Arrive, registration
  • 10am – Welcome
  • 10.15am – Talks begin
  • 11.30am – Pitch session, recruitment
  • 12pm – Work on projects
  • 5.30 pm – Present results of work session

Location:

16F, 40-44 Bonham Strand, Sheung Wan, Hong Kong

Requirements:

Laptop and charger for those joining the coding.
Prepared data and project pitches for those submitting projects.
If presenting, send us your presentation slides ahead of time so we can prepare them.
50HKD in cash for admin and organisation.
Recommendations for project submissions:
Prepare data in advance as much as you can; spending the day cleaning or retrieving data won’t gather crowds of DS! Contact organisers if you need a data repository to share data with all your team members.
If the project is already underway, prepare an introduction to it so that people can join (if you’re presenting slides, send them to us before you arrive), and make sure the task you propose is feasible during the time of the event, and describe the skills you expect your team to have: R or Python? AWS, Spark? etc.

For final presentations:

Start writing the final presentation right from the start and add elements little by little all day long. Recall the context of the project and articulate the presentations to make it understandable by the non-initiated public around you.
If you wish your work will be published on the website datasciencehongkong.com with your bio, name, etc.

Other details:

50 participants max
Food / drink: Only water, coffee and snacks are provided. Attendees can order their own food to the venue, take a break to find a restaurant in Kennedy Town or bring their own lunch.
Price: 50 HKD. We charge a fee to cover costs. We are not a for-profit organisation and will aim to keep the costs of our events as low as possible to make it accessible to all.

Women in data science – WiDS 2018

The Stanford Women in Data Science conference 2018  is starting on March 6th at 1am Hong-Kong time

Live Broadcast

We encourage everyone to follow the broadcast here 

You can tweet using the hashtag #WiDS2018Q

Program

The program can be found here, we reproduce it here for convenience in HK time zone

1:00-1:10am: Opening Remarks: Margot Gerritsen, Senior Associate Dean and Director of ICME, Stanford University
1:10-1:30am: Welcome Address: Maria Klawe, President, Harvey Mudd College
1:30-2:05am: Keynote Address: Leda Braga, CEO, Systematica Investments
2:05-2:10am Regional Event Check-in
2:10-2:50am: Technical Vision Talks:
     2:10-2:30am Mala Anand, EVP, President, SAP Leonardo Data Analytics
     2:30-2:50am Lada Adamic, Research Scientist Manager, Facebook
2:50-3:10am: Morning break
3:10-3:15am: WiDS Datathon Winners Announced
3:15-3:55am: Technical Vision Talks:
     3:15-3:35am: Nathalie Henry Riche, Researcher, Microsoft Research
     3:35-3:55am: Daniela Witten, Associate Professor of Statistics and Biostatistics, University of Washington
3:55am-4:30am: Keynote Address: Latanya Sweeney, Professor of Government and Technology in Residence, Harvard University
4:30-6:00am:  Lunch and Breakouts (NO LIVESTREAM)
6:00-6:35am: Keynote Address: Jia Li, Head of Cloud R&D, Cloud AI, Google
6:35-7:15am Technical Vision Talks:
     6:35-6:55am: Bhavani Thuraisingham,
Professor of Computer Science and Executive
Director of Cyber Research and Education Institute, University of Texas at Dallas
     6:55-7:15am: Elena Grewal, Head of Data Science, Airbnb
7:15-7:30am  Afternoon break 

7:30-7:35am Regional event check-in
7:35-8:15am Career Panel moderated by Margot Gerritsen
Bhavani Thuraisingham 
 Professor of Computer Science and Executive
Director of Cyber Research and Education Institute, University of Texas at Dallas
     Ziya Ma,  Vice President of Software and Services Group and Director of Big Data Technologies, Intel Corporation
     Elena Grewal Head of Data Science, Airbnb
     Jennifer Prendki, Head of Data Science, Atlassian
8:15-8:55am: Technical Vision Talks
     8:15-8:35am: Risa Wechsler, Associate Professor of Physics, Stanford University
     8:35-8:55am: Dawn Woodard, Senior Data Science Manager of Maps, Uber
8:55-9:00am: Closing Remarks

 

Unhackathon #5: Discovering trends in property data and scams in ICOs

Our fifth un-hackathon kicked off the year with 30 eager data scientists attending. The event at Makerhive in Kennedy town was the first of the year and combined industry talks with project based collaboration.

As the mercury dipped outside the fires of creativity burned bright among our attendees. Five project leaders suggested projects to focus the skills of our data scientists on uncovering new insights into the property market in Hong Kong, with a 1.6 million row record of transactions over the past 20 years. Another project aimed to discover whether public data can spot a scam initial coin offering, or ICO.

Presentations

IMG_20180204_103740_HDR

Pranav Agrawal, an HK University of Science and Technolology student, presented a code tutorial on Multi-layer perceptrons in PyTorch. The in-depth, code-centric tutorial took us step by step through the process. We can share links to the documentation here:

Github

Presentation


Hang Xu presented his method of looking at DNA using the word-to-vector model. He said his method of adapting the word2vec model to analyse DNA was superior to the best usage of the current method of analysing DNA using a one-hot vector method.

Presentation

Projects

  • Guy’s property data analysis
  • Jenson’s ICO scam detector
  • Kirill’s Ansible machine learning speed booster

Property data analysis

Using a 1.2gb table of 1.6 million property transactions in Hong Kong, from 1997 to today, this group looked for trends and insights in the property market. Some of the central questions were quantifying the rate that property prices were growing in relation to wage growth in the city.

They found some bargains, even in the current market. See their presentation with their findings.

Ansible speed boosting for NumPy and R

A lot of machine learning tools depend on matrix manipulation libraries, e.g. NumPy. In a basic configuration it uses CPU for linear algebra computations, such as matrix multiplication, SVD or Eigenvalues decomposition. OpenBLAS speeds computations 4-10x via Fortran binding.

Github

See their presentation here.

Is this ICO a scam?

The group pulled a list of over 1600 ICOs from the past two years, and with the question of whether they could establish whether it is a scam, evaluated their value. The second step was to gather the return on investment for each of the ICOs, and the countries they were reported to have come from.

See their presentation and findings here.

Job explorer

Morris Wong worked on scraping a dataset to build a structured system to help jobseekers vet a company before joining. Using stealjobs.com data he aims to build an explorer in the shape of GitXplore using four metrics: income, working hours, promotion prospect, happiness. The data is user generated.

See you all at our next event in March.

 

Un-hackathon #5 – February 4

poster_DSHK_feb4thWe have a new event on February 4 at the Hive.
At this event attendees will have the chance to pitch their projects, or join other people’s. And in the beginning of the day we will host some fantastic industry specialists to share their experiences operating in the data science field.
Please sign up to our Eventbrite ticket site to secure your place.

Speakers:

  • Pranav Agrawal, HKUST student researching morphological natural language processing.
    This talk will focus on basics of Pytorch, giving the brief steps regarding setting up PyTorch, implementing a basic neural network and a convolutional neural network. Basic knowledge regarding deep learning is assumed.
  • Hang Xu, PhD Candidate, will speak about his project: Application of word2vec to represent biological sequences

Schedule of events:

9.30am – Arrive, registration
10am – Welcome
10.15am – Talks begin
11.30am – Pitch session, recruitment
12pm – Work on projects
5.30 pm – Present results of work session

Location:

10/F Cheung Hing Industrial Building
12P Smithfield Road, Kennedy Town
See you on the 4th!

Coindex by Xavier M

Xavier M made a quick overview of a project called Coindex at the December Un-hackathon, aimed at studying the array of cryptocurrencies with a market point of view. The study is done to devise quantitative systematic strategies that trading bots would execute.

After acknowledging that yes, Bitcoin was matching all criteria characterizing a bubble, Xavier refocused the challenge towards building something profitable out of it, whether it’s a bubble or not.

In a talk session that did not require or demand any programming, quantitative analyst Xavier proposed several applications that some research on cryptocurrencies could provide: trading cross-exchange arbitrages, identifying and following trends and investing in low-frequency trading strategies which provide a return similar to an individual trade while mitigating the risk.

Short-term horizon trading

The first category falls into trading with a short-term horizon, also known as day-trading, and Xavier showed a simple cross-market arbitrage monitor of real time opportunities for profit that could be made by buying cheap in one exchange and simultaneously selling high in another one, across six exchanges.

image1

The app is elementary and aims simply to instruct, but could be extended easily to more complex real-time arbitrages, and also by adding trading functions for identified arbitrages.

Building more complex arbitrages or simply understanding the detailed working of the market, or microstructure, means a bit of data science has to come into play.

To do this, we can exploit the order books that each exchange publicly releases in real time. This kind of data allows study of the market microstructure and enables the design of high-frequency strategies. A full field of research can be explored (see for example Marco Avellaneda and Sasha Stoikov’s High-frequency trading in a limit order book or Rama Cont, Stoikov and Rishi Talreja’s A stochastic model for order book dynamics.

An example strategy would be to examine if one market is lagged compared to others. If this is the case, then other markets can surely be used as predictors.

Another example would be to study big orders, and see how to make a profit out of these.

image3

If for example, as it is often heard, crypto markets are completely manipulated, then it could be interested to be able to identify manipulation, and based on the properties of such event, use it for profit.

Xavier provided a database made of order books of six exchanges retrieved every 30 seconds available to any data scientist wanting to design price prediction models or other strategies based on order book data.

Low-frequency investment strategies

The second category was about designing low-frequency investment strategies, where trading seldom occurs, but carries a lower risk than simply holding Bitcoins. Such risk reduction can classically be achieved through diversification, but as was shown in a study by R. Porsch, currencies recently tend to correlate to each other, reducing the benefit of diversification.
Nevertheless, other tactics are possible. For instance, systematic rebalancing with fixed weights for each currency, so every month, week or day the portfolio is rebalanced so that it holds the same value of every currency in USD equivalent. Following this while Bitcoin does an impressive 15x, the least performing portfolio does 30x, and with an equivalent volatility level.

image2

These are extremely simple investment ideas, and many more can be designed to reduce risk (volatility) but not the return.

About the author

Xavier Mathieu developed his career as a quantitative team manager with BNP Paribas. He is now the CEO of Modwize limited. He co-organises the group data science Hong Kong.

December 10: Un-hackathon #4

Our fourth Un-hackathon took on a new structure. Adding to the programme of hands-on, project style skillsharing and development over a day of focused hacking, organisers arranged talks from industry leaders and accomplished practitioners.

About 40 people came to hear the talks and join the traditional hackathon groups which we go into detail about in separate posts, linked below. For more detailed writing on each project’s challenges and achievements, scroll past the talks section.

The talks

Lavine Hemlani and Bilal Khan from Accelerated HK began the talks with an inspiring speech on the future of artificial intelligence. Khan posed this call to action: “Do we let AI be in the hands of very few people?” We don’t, and the pair told us their strategy to teach AI and grow a community of practitioners so that this emerging power is not just a tool for big business.

Robert Porsch PhD. spoke on genomics — the mapping and study of the genome — and the problems he was facing in dealing with 80-90 gigabyte genome data sets. His work on human genomes involves seeking out unique DNA patterns of complex illnesses, sometimes hidden in chains of thousands of mutations — in order to identify them and predict the genetic causes of diseases such as huntington’s or the risk of developing cancer.

Gogovan’s Michal Szczecinski — Hong Kong’s first unicorn company — took us through his role in predicting demand and spotting fraud in his business. By constructing visualisations and dashboards, his business can work with greater oversight. As he says, his role is “to facilitate a smarter decision”. He shared his six steps to general optimisation: learn, brainstorm, prioritise, develop, execute, analyse, and showed us a methodically produced manual on practicing data science.

Ho Wa Wong, an open data activist, has been reconstructing government data into structured datasets which the Hong Kong government hasn’t yet made publicly usable in a convenient way. The data.gov.hk site has public government data but it’s limited both in historical data and range, and the date is often in various formats. Wong aims to add to the pool of available data by coding systems to scrape and clean the data and make the sets available here. He also parsed the Legco transcript and made it available here.

Similarly on a public focused level, Data Science Hong Kong organiser Wang Xiaozhou has been using a hidden Markov model in an attempt to improve geocoding in Hong Kong. The model is being trained to spot the particles of addresses and as Wang showed, by training the model to identify parts of the address such as street types or the district name the model can learn fast and develop speed after few steps, even when the format of the address is changing.

From the academic sector, Leif Sigerson has been mining the Twitter API to find dialogues from small communities talking about psychological problems. Using R and rtweet he would scrape sets of up to 3200 tweets from identified users to build an “ego map”, which connects dots between a user and their followers, and then aims to map out the connectedness of their followers to each other. The psychology department at his university was excited by the prospect of a large sample size but he said it was still sceptical about the methodologies employed in his approach.

The projects

Jason Chan Jin-an had built a predictor of MMA match outcomes which he said had more than 70 per cent accuracy by comparing fighters on factors like their winning record and physical attributes, and his model was indeed earning him some money, he said. Jason came to the un-hackathon to enhance his predictor by automatically setting fighters status to active or retired, thereby avoiding comparisons between active and non-active fighters.

See more about his project here.

Xavier Mathieu, a Data Science Hong Kong organiser and former quantitative team manager at BNP Paribas conducted talks on how to study cryptocurrencies and develop low- and high-frequency trading strategies as well as identifying manipulation and seeing opportunities in them.

See more about his project here.

The UFC MMA Predictor Web App by Jason Chan Jin-an

The UFC MMA Predictor is a web app built by Jason Chan Jin An to predict winners of upcoming UFC fights. The web app is entirely built in Python and uses a combination of dynamic web scraping, data cleansing, machine learning and web dev.

image001

The challenge

The project was showcased at the December 10 Un-hackathon hoping to overcome the challenge of displaying lists of active and inactive fighters. Before the Un-hackathon, the fighter list was displaying all fighters that had ever fought in the UFC, some of whom had retired. This would mean results would return irrelevant items and mislead users.

The achievement

During the Un-hackathon, thanks to feedback from participants, Chan found a Wikipedia page that has the list of current fighters in the UFC which is frequently updated.

Chan then built a Scrapy spider to crawl the page to retrieve the active fighter list, which then subsets his fighter database to only active fighters. Chan then redeployed the web app.

image003

Web app process

The spiders are scheduled to run every week. The data is then automatically pushed to Amazon S3, where the website then reads the data. The fighter data are kept current.

For more information

For contacts and more information about the web app and documentation, please visit the following links:

GitHub: https://github.com/jasonchanhku/UFC-MMA-Predictor

Jupyter Documentation: https://github.com/jasonchanhku/UFC-MMA-Predictor/blob/master/UFC%20MMA%20Predictor%20Workflow.ipynb

LinkedIn: https://www.linkedin.com/in/jason-chan-jin-an-45a76a76/