Our 8th Hackathon for Data Science: a full day of fun and working together on YOUR data science projects!
At this event attendees will have the chance to pitch their projects, or join other people’s. And in the beginning of the day we will host some fantastic industry specialists to share their experiences operating in the data science field.
Laptop / charger for those joining the coding
Prepared data, and projects pitches for the ones submitting projects
If presenting, send us your presentation slides ahead of time so we can prepare them
50HKD in cash for the space rental
Recommendations for project submissions:
Send us your presentation slides! Drop a link to one of the organisers on Slack or another way. We want to minimise time spent switching laptops so we will run your slides from our pc. Prepare data in advance as much as you can; spending the day cleaning or retrieving data won’t gather crowds of DS! Contact organisers if you need a data repository to share data with all your team members.
If the project is already underway, prepare an introduction to it so that people can join. If you’re presenting slides, send them to us before you arrive, make sure the task you propose is feasible during the time of the event, and describe the skills you expect your team to have: R or Python? AWS, Spark? etc.
For final presentations:
Start writing the final presentation right from the start and add elements little-by-little all day long. Articulate the reason you want to do the project, and the solution. Make it understandable to everyone.
If you wish, your work will be published on this website with your bio, name, etc.
50 participants max
Food/drink: Only water, coffee and tea are provided. Attendees can order their own food to the venue, take a break to find a restaurant nearby or bring their own lunch.
Price: 50 HKD. We charge a fee to cover venue and food costs. We are a not-for-profit organisation and will aim to keep the costs of our events as low as possible to make it accessible to all.
What time do people rent share bikes in San Jose? Houston and a group of data scientists has looked at bike share data in California and made some curious obvservations at our April unhackathon.
We also heard from Nick Lam-wai who is building a database on Hong Kong’s budget, the blueprint of government spending and priorities. And Chris Choy, who was working with Nick also discovered how to take historical PDFs of the budget and read the tables into Nick’s database. Expect big things from this group.
Our second meet up at Accellerate in Sheung Wan started with a discussion of the Catboost library by Daniil Chepenko, who explains its benefits over other methods such as random forest.
Catboost is a gradient boosting library for work on decision trees, developed by the Russian search engine Yandex, building on many years of development in this field.
Willis sought to find out what makes a Kickstarter project work. He came to the hackathon with data from 2009-2017, and a trained model with 60% accuracy, up from 30% at the beginning of his work. Knowing whether a Kickstarter will succeed is a huge investment advantage, so watch the short videos to see how well he went.
Elizabeth Briel and Ben Davis have been seeking new ways to tell the story of global warming’s effects on arctic sea ice, and came to the hackathon with data they wanted to turn into a song. See the results below.
Ram de Guzman presented this analysis of Overwatch team strategies using scraped data from Winston’s Lab (which gathers it directly from game videos). His insight revealed how the best teams in South Korea arranged their teams and fought.
In the video he describes the process of gathering his data, then shows in impressive visualisations how that data relates to actual game strategy.
Watch his talk at our 6th unhackathon in March here:
Buying a property will — for most people — be the biggest purchase they will ever make. So what’s the smart way for a data scientist to figure out the best flat deal in Hong Kong? Scrape and hack the market of course.
Normal home buyers may burn a lot of shoe leather visiting dozens of real estate agents, spend hours on websites looking at individual entries, maybe even start a spreadsheet.
But using a bit of data, or in this case around 1.65 million transaction records, we start to see through the sales pitches and get a feel for the total market.
Alternatively, there are many websites with dozens of pages as research fodder.
Only 90 pages, no biggie
What’s missing is a tool that allows you to view the actual sale prices throughout Hong Kong with a single user interface. That’s what we developed during Data Science Hong Kong’s fifth unhackathon.
A market exploration tool
Using a dataset of 1.65 million transactions scraped from the site of one of Hong Kong’s main real estate agencies, this tool (pictured below) shows the average price per square foot by location and time.
Reading the data
Each circle represents a building, colour-coded and shaded darker to show higher prices per square foot. Where more sales are registered in the user’s chosen time period the circle grows larger.
With this tool, market understanding becomes a lot easier and more intuitive. In one single interface, all the actual transactions can be summarized and filtered using attributes like price, size, address or sale date — for example, flats between 300 and 400 square feet.
At first glance, the colours of the markers would suggest that Central and the west of Hong Kong Island look like the most expensive areas, followed by Tsim Sha Tsui, and the southern Kowloon peninsula.
Looking closer at the northern edge of Kowloon Park, we can set the transaction year on 2017. A summary appears upon hovering over the property and we can see that 3 flat sales occurred at 10-24 Parkes Street (known as Wing Fu Mansion), for an average price of under HK$14,400 a square foot.
The following animation shows how easy it is to navigate through the districts and market history, and get a clear and visual idea of prices in the area.
To build this tool, we need first to collect the data.
This dataset was scraped using the Python webscraping library Scrapy. For most transactions, it has the flat, floor area, floor and building address and most importantly the prices and dates of sales.
After cleaning the data, our team began looking at prices and transaction volumes, including outliers. You can see a presentation of the data here.
Collecting geolocation data
To be able to correctly place a flat on a map, we need its geolocation, consisting of a longitude and latitude pair. The only location data we have is the address, so we can use the Google Maps API to convert the addresses to lat/lon pairs.
But Google limits API calls at 2500 addresses, with a 50 cent charge for an additional 1000 requests. Small change, but fortunately it’s not necessary in Hong Kong as there are other organizations that also offer this service.
The OFCA government website, used to check whether a building has digital television or fibre-optic broadband, also returns the geolocation of the searched address. We automatically requested all block addresses. In the following screenshots, we verify correct resolution of the address.
After joining transactions and geolocations, we can start building the tool.
Building the visualization tool
The main steps to build this tool are:
Creating a basic html template with text blocks and containers for titles, caption, transaction year filter and popup text
Loading the transaction data and looping over it . For each year and address, we compute the circle parameters.
The radius is a linear function of the square root of transaction quantities. With this choice, circle area and transaction quantity are linearly dependent.
The colors (red and green channels) are simple linear functions of the price.
Taking the next steps
Various improvements to this first version can be implemented:
Some data quality issues have been raised (missing flat areas, block address, etc.) and need to be corrected
The decision of buying a flat includes other factors that can be incorporated into this visualization tool. For instance, car traffic, public transport options, school networks (some of which are already included in the data), average pollution levels, altitude/topography
Adding current flat sale offers in the vicinity to help find the best deals
If you have any question or want to further explore the data, don’t hesitate to send a message in the slack group: datasciencehk.slack.com
Thanks go to the Data Science Hong Kong organizers for this event.
The Stanford Women in Data Science conference 2018 is starting on March 6th at 1am Hong-Kong time
We encourage everyone to follow the broadcast here
You can tweet using the hashtag #WiDS2018Q
The program can be found here, we reproduce it here for convenience in HK time zone
1:00-1:10am: Opening Remarks: Margot Gerritsen, Senior Associate Dean and Director of ICME, Stanford University 1:10-1:30am: Welcome Address:Maria Klawe, President, Harvey Mudd College 1:30-2:05am: Keynote Address: Leda Braga, CEO, Systematica Investments 2:05-2:10am Regional Event Check-in 2:10-2:50am: Technical Vision Talks: 2:10-2:30am Mala Anand, EVP, President, SAP Leonardo Data Analytics 2:30-2:50amLada Adamic, Research Scientist Manager, Facebook 2:50-3:10am: Morning break 3:10-3:15am:WiDS DatathonWinners Announced 3:15-3:55am: Technical Vision Talks: 3:15-3:35am:Nathalie Henry Riche, Researcher, Microsoft Research 3:35-3:55am:Daniela Witten, Associate Professor of Statistics and Biostatistics, University of Washington 3:55am-4:30am: Keynote Address: Latanya Sweeney, Professor of Government and Technology in Residence, Harvard University 4:30-6:00am:Lunch and Breakouts (NO LIVESTREAM) 6:00-6:35am: Keynote Address: Jia Li, Head of Cloud R&D, Cloud AI, Google 6:35-7:15am Technical Vision Talks: 6:35-6:55am: Bhavani Thuraisingham, Professor of Computer Science and Executive
Director of Cyber Research and Education Institute, University of Texas at Dallas 6:55-7:15am: Elena Grewal, Head of Data Science, Airbnb 7:15-7:30am Afternoon break 7:30-7:35am Regional event check-in 7:35-8:15am Career Panel moderated by Margot Gerritsen Bhavani Thuraisingham Professor of Computer Science and Executive
Director of Cyber Research and Education Institute, University of Texas at Dallas Ziya Ma,Vice President of Software and Services Group and Director of Big Data Technologies, Intel Corporation Elena GrewalHead of Data Science, Airbnb Jennifer Prendki, Head of Data Science, Atlassian 8:15-8:55am: Technical Vision Talks 8:15-8:35am:Risa Wechsler, Associate Professor of Physics, Stanford University 8:35-8:55am:Dawn Woodard, Senior Data Science Manager of Maps, Uber 8:55-9:00am: Closing Remarks
Our fifth un-hackathon kicked off the year with 30 eager data scientists attending. The event at Makerhive in Kennedy town was the first of the year and combined industry talks with project based collaboration.
As the mercury dipped outside the fires of creativity burned bright among our attendees. Five project leaders suggested projects to focus the skills of our data scientists on uncovering new insights into the property market in Hong Kong, with a 1.6 million row record of transactions over the past 20 years. Another project aimed to discover whether public data can spot a scam initial coin offering, or ICO.
Pranav Agrawal, an HK University of Science and Technolology student, presented a code tutorial on Multi-layer perceptrons in PyTorch. The in-depth, code-centric tutorial took us step by step through the process. We can share links to the documentation here:
Hang Xu presented his method of looking at DNA using the word-to-vector model. He said his method of adapting the word2vec model to analyse DNA was superior to the best usage of the current method of analysing DNA using a one-hot vector method.
Using a 1.2gb table of 1.6 million property transactions in Hong Kong, from 1997 to today, this group looked for trends and insights in the property market. Some of the central questions were quantifying the rate that property prices were growing in relation to wage growth in the city.
They found some bargains, even in the current market. See their presentation with their findings.
Ansible speed boosting for NumPy and R
A lot of machine learning tools depend on matrix manipulation libraries, e.g. NumPy. In a basic configuration it uses CPU for linear algebra computations, such as matrix multiplication, SVD or Eigenvalues decomposition. OpenBLAS speeds computations 4-10x via Fortran binding.
The group pulled a list of over 1600 ICOs from the past two years, and with the question of whether they could establish whether it is a scam, evaluated their value. The second step was to gather the return on investment for each of the ICOs, and the countries they were reported to have come from.
Morris Wong worked on scraping a dataset to build a structured system to help jobseekers vet a company before joining. Using stealjobs.com data he aims to build an explorer in the shape of GitXplore using four metrics: income, working hours, promotion prospect, happiness. The data is user generated.
Data science is starting to become embedded in Hong Kong. We are a community for all data scientists in Hong Kong — from beginners to multi-decade practitioners of BI, artificial intelligence and data warehouse design, and from students to professors — and their fellow travellers, from business, government and academia. We want to create an environment where data scientists can learn from each other and share their stories, and a community that non-data scientists can turn to when they want to understand more about what data science can do for them. We will organise monthly events such as unhackathons and lectures as well as hosting social media platforms.
Join us on our other social platforms to keep up-to-date with our activities and to become part of our community: