Unhackathon #5: Discovering trends in property data and scams in ICOs

Our fifth un-hackathon kicked off the year with 30 eager data scientists attending. The event at Makerhive in Kennedy town was the first of the year and combined industry talks with project based collaboration.

As the mercury dipped outside the fires of creativity burned bright among our attendees. Five project leaders suggested projects to focus the skills of our data scientists on uncovering new insights into the property market in Hong Kong, with a 1.6 million row record of transactions over the past 20 years. Another project aimed to discover whether public data can spot a scam initial coin offering, or ICO.

Presentations

IMG_20180204_103740_HDR

Pranav Agrawal, an HK University of Science and Technolology student, presented a code tutorial on Multi-layer perceptrons in PyTorch. The in-depth, code-centric tutorial took us step by step through the process. We can share links to the documentation here:

Github

Presentation


Hang Xu presented his method of looking at DNA using the word-to-vector model. He said his method of adapting the word2vec model to analyse DNA was superior to the best usage of the current method of analysing DNA using a one-hot vector method.

Presentation

Projects

  • Guy’s property data analysis
  • Jenson’s ICO scam detector
  • Kirill’s Ansible machine learning speed booster

Property data analysis

Using a 1.2gb table of 1.6 million property transactions in Hong Kong, from 1997 to today, this group looked for trends and insights in the property market. Some of the central questions were quantifying the rate that property prices were growing in relation to wage growth in the city.

They found some bargains, even in the current market. See their presentation with their findings.

Ansible speed boosting for NumPy and R

A lot of machine learning tools depend on matrix manipulation libraries, e.g. NumPy. In a basic configuration it uses CPU for linear algebra computations, such as matrix multiplication, SVD or Eigenvalues decomposition. OpenBLAS speeds computations 4-10x via Fortran binding.

Github

See their presentation here.

Is this ICO a scam?

The group pulled a list of over 1600 ICOs from the past two years, and with the question of whether they could establish whether it is a scam, evaluated their value. The second step was to gather the return on investment for each of the ICOs, and the countries they were reported to have come from.

See their presentation and findings here.

Job explorer

Morris Wong worked on scraping a dataset to build a structured system to help jobseekers vet a company before joining. Using stealjobs.com data he aims to build an explorer in the shape of GitXplore using four metrics: income, working hours, promotion prospect, happiness. The data is user generated.

See you all at our next event in March.

 

Un-hackathon #5 – February 4

poster_DSHK_feb4thWe have a new event on February 4 at the Hive.
At this event attendees will have the chance to pitch their projects, or join other people’s. And in the beginning of the day we will host some fantastic industry specialists to share their experiences operating in the data science field.
Please sign up to our Eventbrite ticket site to secure your place.

Speakers:

  • Pranav Agrawal, HKUST student researching morphological natural language processing.
    This talk will focus on basics of Pytorch, giving the brief steps regarding setting up PyTorch, implementing a basic neural network and a convolutional neural network. Basic knowledge regarding deep learning is assumed.
  • Hang Xu, PhD Candidate, will speak about his project: Application of word2vec to represent biological sequences

Schedule of events:

9.30am – Arrive, registration
10am – Welcome
10.15am – Talks begin
11.30am – Pitch session, recruitment
12pm – Work on projects
5.30 pm – Present results of work session

Location:

10/F Cheung Hing Industrial Building
12P Smithfield Road, Kennedy Town
See you on the 4th!

Coindex by Xavier M

Xavier M made a quick overview of a project called Coindex at the December Un-hackathon, aimed at studying the array of cryptocurrencies with a market point of view. The study is done to devise quantitative systematic strategies that trading bots would execute.

After acknowledging that yes, Bitcoin was matching all criteria characterizing a bubble, Xavier refocused the challenge towards building something profitable out of it, whether it’s a bubble or not.

In a talk session that did not require or demand any programming, quantitative analyst Xavier proposed several applications that some research on cryptocurrencies could provide: trading cross-exchange arbitrages, identifying and following trends and investing in low-frequency trading strategies which provide a return similar to an individual trade while mitigating the risk.

Short-term horizon trading

The first category falls into trading with a short-term horizon, also known as day-trading, and Xavier showed a simple cross-market arbitrage monitor of real time opportunities for profit that could be made by buying cheap in one exchange and simultaneously selling high in another one, across six exchanges.

image1

The app is elementary and aims simply to instruct, but could be extended easily to more complex real-time arbitrages, and also by adding trading functions for identified arbitrages.

Building more complex arbitrages or simply understanding the detailed working of the market, or microstructure, means a bit of data science has to come into play.

To do this, we can exploit the order books that each exchange publicly releases in real time. This kind of data allows study of the market microstructure and enables the design of high-frequency strategies. A full field of research can be explored (see for example Marco Avellaneda and Sasha Stoikov’s High-frequency trading in a limit order book or Rama Cont, Stoikov and Rishi Talreja’s A stochastic model for order book dynamics.

An example strategy would be to examine if one market is lagged compared to others. If this is the case, then other markets can surely be used as predictors.

Another example would be to study big orders, and see how to make a profit out of these.

image3

If for example, as it is often heard, crypto markets are completely manipulated, then it could be interested to be able to identify manipulation, and based on the properties of such event, use it for profit.

Xavier provided a database made of order books of six exchanges retrieved every 30 seconds available to any data scientist wanting to design price prediction models or other strategies based on order book data.

Low-frequency investment strategies

The second category was about designing low-frequency investment strategies, where trading seldom occurs, but carries a lower risk than simply holding Bitcoins. Such risk reduction can classically be achieved through diversification, but as was shown in a study by R. Porsch, currencies recently tend to correlate to each other, reducing the benefit of diversification.
Nevertheless, other tactics are possible. For instance, systematic rebalancing with fixed weights for each currency, so every month, week or day the portfolio is rebalanced so that it holds the same value of every currency in USD equivalent. Following this while Bitcoin does an impressive 15x, the least performing portfolio does 30x, and with an equivalent volatility level.

image2

These are extremely simple investment ideas, and many more can be designed to reduce risk (volatility) but not the return.

About the author

Xavier Mathieu developed his career as a quantitative team manager with BNP Paribas. He is now the CEO of Modwize limited. He co-organises the group data science Hong Kong.

December 10: Un-hackathon #4

Our fourth Un-hackathon took on a new structure. Adding to the programme of hands-on, project style skillsharing and development over a day of focused hacking, organisers arranged talks from industry leaders and accomplished practitioners.

About 40 people came to hear the talks and join the traditional hackathon groups which we go into detail about in separate posts, linked below. For more detailed writing on each project’s challenges and achievements, scroll past the talks section.

The talks

Lavine Hemlani and Bilal Khan from Accelerated HK began the talks with an inspiring speech on the future of artificial intelligence. Khan posed this call to action: “Do we let AI be in the hands of very few people?” We don’t, and the pair told us their strategy to teach AI and grow a community of practitioners so that this emerging power is not just a tool for big business.

Robert Porsch PhD. spoke on genomics — the mapping and study of the genome — and the problems he was facing in dealing with 80-90 gigabyte genome data sets. His work on human genomes involves seeking out unique DNA patterns of complex illnesses, sometimes hidden in chains of thousands of mutations — in order to identify them and predict the genetic causes of diseases such as huntington’s or the risk of developing cancer.

Gogovan’s Michal Szczecinski — Hong Kong’s first unicorn company — took us through his role in predicting demand and spotting fraud in his business. By constructing visualisations and dashboards, his business can work with greater oversight. As he says, his role is “to facilitate a smarter decision”. He shared his six steps to general optimisation: learn, brainstorm, prioritise, develop, execute, analyse, and showed us a methodically produced manual on practicing data science.

Ho Wa Wong, an open data activist, has been reconstructing government data into structured datasets which the Hong Kong government hasn’t yet made publicly usable in a convenient way. The data.gov.hk site has public government data but it’s limited both in historical data and range, and the date is often in various formats. Wong aims to add to the pool of available data by coding systems to scrape and clean the data and make the sets available here. He also parsed the Legco transcript and made it available here.

Similarly on a public focused level, Data Science Hong Kong organiser Wang Xiaozhou has been using a hidden Markov model in an attempt to improve geocoding in Hong Kong. The model is being trained to spot the particles of addresses and as Wang showed, by training the model to identify parts of the address such as street types or the district name the model can learn fast and develop speed after few steps, even when the format of the address is changing.

From the academic sector, Leif Sigerson has been mining the Twitter API to find dialogues from small communities talking about psychological problems. Using R and rtweet he would scrape sets of up to 3200 tweets from identified users to build an “ego map”, which connects dots between a user and their followers, and then aims to map out the connectedness of their followers to each other. The psychology department at his university was excited by the prospect of a large sample size but he said it was still sceptical about the methodologies employed in his approach.

The projects

Jason Chan Jin-an had built a predictor of MMA match outcomes which he said had more than 70 per cent accuracy by comparing fighters on factors like their winning record and physical attributes, and his model was indeed earning him some money, he said. Jason came to the un-hackathon to enhance his predictor by automatically setting fighters status to active or retired, thereby avoiding comparisons between active and non-active fighters.

See more about his project here.

Xavier Mathieu, a Data Science Hong Kong organiser and former quantitative team manager at BNP Paribas conducted talks on how to study cryptocurrencies and develop low- and high-frequency trading strategies as well as identifying manipulation and seeing opportunities in them.

See more about his project here.

The UFC MMA Predictor Web App by Jason Chan Jin-an

The UFC MMA Predictor is a web app built by Jason Chan Jin An to predict winners of upcoming UFC fights. The web app is entirely built in Python and uses a combination of dynamic web scraping, data cleansing, machine learning and web dev.

image001

The challenge

The project was showcased at the December 10 Un-hackathon hoping to overcome the challenge of displaying lists of active and inactive fighters. Before the Un-hackathon, the fighter list was displaying all fighters that had ever fought in the UFC, some of whom had retired. This would mean results would return irrelevant items and mislead users.

The achievement

During the Un-hackathon, thanks to feedback from participants, Chan found a Wikipedia page that has the list of current fighters in the UFC which is frequently updated.

Chan then built a Scrapy spider to crawl the page to retrieve the active fighter list, which then subsets his fighter database to only active fighters. Chan then redeployed the web app.

image003

Web app process

The spiders are scheduled to run every week. The data is then automatically pushed to Amazon S3, where the website then reads the data. The fighter data are kept current.

For more information

For contacts and more information about the web app and documentation, please visit the following links:

GitHub: https://github.com/jasonchanhku/UFC-MMA-Predictor

Jupyter Documentation: https://github.com/jasonchanhku/UFC-MMA-Predictor/blob/master/UFC%20MMA%20Predictor%20Workflow.ipynb

LinkedIn: https://www.linkedin.com/in/jason-chan-jin-an-45a76a76/

Book Review: Computer Age Statistical Inference

jacket_wave_cropped

It took me a while but I finally have some time to write my review about Efron and Hastie’s new book Computer Age Statistical Inference: Algorithms, Evidence and Data Science.

Bradley Efron is probably best known for his bootstrapping re-sampling technique. With his new book Computer Age Statistical Inference he provides a rather short overview of statistics as a whole. The book covers topics from the early beginnings of statistics to the new era of machine learning. As one can imagine covering such a huge amount of content is not an easy task and the two authors did their best to focus on a number of interesting aspects.

The book is separated into three major parts: classic statistical inference, early computer age methods, and twenty-first century topics. Hence, I will review each part individually as well. Despite the great number of topics the book covers it is definitely not meant for beginners. The authors assume a fair amount of algebra, probability theory as well as statistics. Nevertheless, I found it a great way to not only refresh my knowledge, but also delve deeper into various aspects of classical and modern statistics.

Classic Statistical Inference

Overall I think this is the strongest parts of the book. The authors did not go into extensive detail but covered interesting aspects of frequentist and bayesian inference. In addition, Efron and Hastie put emphasis on fisherian inference and maximum likelihood estimation, and demonstrated parallels between these different approaches as well as their historical connections. This really helped me to classify and interconnect all of these different methods. However, I found it a bit surprising on how little space is dedicated to frequentist and bayesian, compared to fisherian inference. On the one hand I really appreciated reading more about Fisher’s ideas and methods since it is often insufficiently covered in most text book. On the other hand, I would have hoped for some new insight into bayesian statistics.

Overall, I really enjoyed this part of the book. It helped me to get a deeper understanding of classical statistical methods.

Early Computer-Age Methods

This part of the book covers quite a variety of topics, from empirical Bayes, over generalized linear models (GLM), to cross-validation and the bootstrap. In particular the bootstrap is covered extensively and pops up in a number of chapters. While this is not particularly surprising given the background of the authors,  it does feel a bit too much.  Furthermore, I find that GLM are covered insufficiently (only 20 pages), considering the importance of  linear models in all areas of statistics. However, given the extensive scope of this part of the book, the authors do a fairly good job by discussing each topic in detail while not being too general.

I especially liked the notes at the end of each chapter, which provided additional historic and mathematical annotations. I often enjoyed these notes more than the actual chapter.

Twenty-first century topics

This is probably the weakest part of the book. While topics such as local false-discovery rate (FDR), sparse modeling and lasso are covered clearly and in detail, topics such as neural networks and random forests feel sparse and are in my view insufficiently discussed. The discussion of neural networks feels especially rudimentary. Again, this is not particular surprising given that neither author is an expert in machine learning. However, the book is good enough without venturing into machine learning topics. The additional space could have been used for more extensive discussions of FDR or GLM.

Hence if you are interested in learning more about machine learning this book might not be ideal for you. However, that does not mean that individual chapters of this book are bad. Indeed, topics such as support vector machines (SVM) and lasso are very well discussed. Nevertheless, although I enjoyed refreshing my knowledge about these methods I did not feel that I gained a deeper understanding compared to the previous parts of the book.

Conclusion

Overall I really enjoyed reading the book. It gave me a great view of current and past statistical applications. It was especially rewarding to discover and understand connections between various different methods and ideas. Furthermore, the book is covered with nice examples (the data and code for each example is also available on the author’s website).

If you want to refresh or update your knowledge about general statistics Efron and Hastie’s Computer Age Statistical Inference is an excellent choice. You can download the free PDF from the author’s website.

Unhackathon #4 december 10th

Here is our next event coming up on December 10th
This time on top of the usual “coding day” where people propose their project and form teams to work on it, we added 2 features :
– a beginner’s corner, for the ones starting off with Python, R or datascience itself.
– a talks corner to share during 30′ some thoughts, an experience, or introduce your project in depths. 3 talks are already planned for December 10th. If you feel like bringing one, just let us know !
All details including the location and the list of talks is on the eventbrite ticket.
See you on the 10th !

November Unhackathon

Our 3rd event !

Once again a small crowd of Data Scientists has been courageous enough to fight their impulse for just chilling out in the wonderful sunday’s weather in HongKong and instead came to hone their skills on 2 topics :

  • An exploration of HKEX data and its links to HK financial markets
  • A study of the very hyped cryptocurrencies

Crypto-currencies correlation

This topic stemmed from the follow-up of the previous “Coindex” subject.
The study of correlation should give an idea of how much diversification would be important in a portfolio or index of crypto-currencies, in other words, how much an index would provide a sense of the true performance of the currencies in the crypto world.

Here the focus has been given to a classical-flavored study of correlation among the currencies available on Poloniex Exchange on sep 16th, 2017.
First of all a joyplot retrieved the shapes of return distributions for many currencies :
ridge_plot.jpegSome currencies such as OMG (OmiseGo) and CVC (Civic) are too new and then have a short historics that meks them not at all normally distributed, and are then considered as outliers and removed from the scope.

Then we came up with proper correlation calculations

heatmap.png

And we can get a 36% global average correlation (average of all 1 to 1 correlations), hinting that diversification could be an important driver of portfolio efficiency.

If we graph this measure along time, we see that the correlation tends to increase along time, suggesting that there is some re-correlation of crypto markets.

histocorrel.png
Next step might be to understand why this re-correlation happens.

The complete analysis, including the used data, can be found on github.

 

September un-Hackathon

original

Our second event!

Following the success of our first event, we again met up at the MakerHive in Kennedy Town for our un-hackathon. This is our term for a hackathon where the agenda would be set by participants and people would have fun coding together, instead of being a competition. It’s a way to improve your skills and share projects you are passionate about with the community.

Some projects from our previous event were pitched again while a number of new projects were also started. After teams were formed, the coding quickly got under way.Attendees gathered for the presentation as the teams showed off their results.

Web scraping

A initiative to scrape public data with Python and R, Scrapy was used to pull HKEX data.

Visualisation of the block chain

On 12th May, computers worldwide were hit by the WannaCry ransomware attack. The attackers asked ransom payments to be made to a number of bitcoin wallets. Blockchain data about these wallets from the period of the attack was sourced and visualised using D3.

Horse racing prediction

“Anomalies” in betting market for horse racing mean that the outcome of a horse race could be predicted. RapidMiner and Python was used to scrape the data and create a predictive model.

horse racing team

The team were well organised and even produced a presentation of their results!

Traffic analysis

This team scraped data on traffic incidents using Scrapy (Python) and then visualised using R.

clean

corr

 

Crypto-currencies investment strategies

This project is a follow-up of the previous unhackathon, at the end of which we remained puzzled by some unexplainable moves in certain currencies.
This time we had better grasp at it and we went for analysing correlations and properties of simple indices made of a basket of currencies.

The global correlation among 20 first currencies amounted to 36% since 2017

2017_10_13_13_40_32_Coindex_Google_Slides

this is low enough to hope for some diversification effect to take place.

Building an index where each currency has the same weight is indeed providing a real overperformance if we consider BTCUSD as the benchmark.
Moreover scaling down the index so that volatility, or risk, is equivalent to the one of Bitcoin vs USD then produces significant gain of 15% over BTC.
2017_10_13_13_41_10_Coindex_Google_Slides

On top of this the skew while negative for Bitcoin becomes positive for the index : this means that frequent small losses encountered by the index are compensated by less frequent big much bigger gains !

This is encouraging to build up some other indices and strategies, and this project could yield to promising applications :

  • Trading strategies, either short or medium term, dynamic or static, including machine learning algorithms for the discovery of alpha in this market
  • The development of an algorithmic trading tools following these strategies
  • Also some online analytics on single currencies or portfolio of them
  • Potentially some advisory for portfolio construction

 

Our first event: Unhackathon at the Hive

hackathonDSHK

What is an Unhackathon anyway?

Data Science Hong Kong was set up to as a way for people interested in data science to network and share ideas. We have an active public Slack group where people regularly share articles and discuss all things tech and data science. The group has organised a number of informal meetups before but we wanted to a start a regular event based around coding and presenting, and not just on talking and networking.

There are many IT, tech and data science events in Hong Kong but they are infrequent and often serve primarily as a marketing or recruitment tool. Not satisfied with the state of tech events in Hong Kong, we set out to create an event that was started from the bottom up and would focus on who knew the most and not who spoke the loudest, which is inviting to beginners but not to those uninterested in technical details.

We have therefore started a regular unhackathon. This is our term for a hackathon where the agenda would be set by participants and people would have fun coding together, instead of being a competition. It’s a way to improve your skills and share projects you are passionate about with the community.

Our first event gets under way

Our first gathering was made possible by The Hive. They were very keen on supporting the data science community in Hong Kong and let us use the MakerHive in Kennedy Town which was a fantastic venue for our first event.

The event started with the floor being opened to pitches. After signing up for a slot by putting up a post-it, pitchers were given 5 minutes to convince others to work on their project.

OLYMPUS DIGITAL CAMERA

There were many great ideas and teams were formed around those that attracted enough interest. Discussions were soon under way on what each team wanted to achieve by the end of the day.

 

Of course, being a hackathon, there was coding, coding and more coding!

 

As it became time for lunch, teams headed out to Kennedy Town center to find a restaurant. Any loss of coding output was more than made up for by the opportunity that people got to better know their teammates. Real data scientists don’t skip lunch!

Presentation time

4 hours and much coding later the deadline for presentations loomed. All the teams gladly accepted a 20 minute grace period to put the final touches on their work.

 

Some of the projects presented were :

  • Address mapping in Hong Kong
  • Twitter topic analysis
  • Crypto-currency analysis
    2017_08_25_16_11_50_Coindex_Google_Slides.jpg
    This team aimed at building an index of cryptocurrencies similar to usual financial market indices, to be used as a benchmark of refined to explore portfolio strategies.

 

  • Facial Expression Recognition using Keras
    内嵌图片 3
    The team of 3 used a MNIST convolutional neural network model and retrained it on facial expression data from Kaggle, with 55% accuracy over 7 categories

 
Everyone had made great progress on their projects and a common theme across presentations was that so much more could have been accomplished with just a bit more time. It’s good then that we already have started planning for our next event in September!

Just because the event is over does not mean the coding stops! If you enjoyed the project you worked on or more importantly enjoyed the people you worked with then do continue collaborating and share with us what you did at our next event!

If this event seems interesting then please contact us by email, social media or join our slack group. We’ll keep you updated there about any future events.

Data Science Hong Kong