Multiple Rstudio Server instances using a single R/LaTeX install with KVM

Introduction Rstudio Server Open Source Edition (OSE) is offered with some key limitations compared to the Pro Edition. A few of these limitations are easy to circumvent using basic Linux sysadmin skills (such as encrypting traffic by using a reverse proxy), but most of RStudio Server OSE’s limitations are not so easy to work with, … Read moreMultiple Rstudio Server instances using a single R/LaTeX install with KVM

Flask: An Easy Access Door to API development

Photo by Chris Ried on Unsplash The world has gone through a huge transition; from separating the piece of code as functions in procedural languages to the development of libraries; from RPC calls to Web Service specifications in Service Oriented Architecture(SOA) like SOAP and REST. This has paved a way to Web APIs and microservices, … Read moreFlask: An Easy Access Door to API development

Deep Learning Approach for Separating Fast and Slow Components

Some Background (A slide deck for this work can be found https://speakerdeck.com/jchin/decomposing-dynamics-from-different-time-scale-for-time-lapse-image-sequences-with-a-deep-cnn) I left my job as a Scientific Fellow in PacBio after 9-year venture helping to make single-molecule sequencing becoming useful for the scientific community (see my story about the first couple year in PacBio there). Most of my technical/scientific work had something to … Read moreDeep Learning Approach for Separating Fast and Slow Components

Combining supervised learning and unsupervised learning to improve word vectors

To achieve state-of-the-art result in NLP tasks, researchers try tremendous way to let machine understand language and solving downstream tasks such as textual entailment, semantic classification. OpenAI released a new model which named as Generative Pre-Training (GPT). After reading this article, you will understand: Finetuned Transformer LM Design Architecture Experiments Implementation Take Away Finetuned Transformer … Read moreCombining supervised learning and unsupervised learning to improve word vectors

How to deploy your website to a custom domain

This blog documents the steps needed to deploy a website written in Python with Flask framework to a custom domain using Heroku and NameCheap. Flask is a micro-framework that allows us to use Python in the back-end to interact with our front-end code in HTML/CSS or Javascript to build web sites. People also use other … Read moreHow to deploy your website to a custom domain

How to do Deep Learning on Graphs with Graph Convolutional Networks

Part 2: Semi-Supervised Learning with Spectral Graph Convolutions Machine learning on graphs is a difficult task due to the highly complex, but also informative graph structure. This post is the second in a series on how to do deep learning on graphs with Graph Convolutional Networks (GCNs), a powerful type of neural network designed to … Read moreHow to do Deep Learning on Graphs with Graph Convolutional Networks

Machine Learning Project: Predicting Boston House Prices With Regression

Introduction In this project, we will develop and evaluate the performance and the predictive power of a model trained and tested on data collected from houses in Boston’s suburbs. Once we obtain a good fit, we will use this model to predict about the monetary value of a house which is in that location. A … Read moreMachine Learning Project: Predicting Boston House Prices With Regression

My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 6,7,8

This is the final set of presentations in my series ‘Elements of Neural Networks and Deep Learning’. This set follows the earlier 2 sets of presentations namely1. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,32. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5 In this final set of … Read moreMy presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 6,7,8

Lessons Learned from Kaggle’s Airbus Challenge.

The challenge banner Over the last three months, I have participated in the Airbus Ship Detection Kaggle challenge. As evident from the title, it is a detection computer vision (segmentation to be more precise) competition proposed by Airbus (its satellite data division) that consists in detecting ships in satellite images. Before I start this challenge, … Read moreLessons Learned from Kaggle’s Airbus Challenge.

I wrote a program that speaks like the collective hive-mind of The Straits Times Forum

Results I very diligently studied thousands of the Straits Times Forum Letters and was able to create a second-order Markov chain capturing the “style” of the forum letters. I then generated my own articles using the above-mentioned second-order Markov chain — you can play with it here: Straits Times Forum Letter Generator. Here are some of my … Read moreI wrote a program that speaks like the collective hive-mind of The Straits Times Forum

Statistics is the Grammar of Data Science — Part 1

Data Types We cannot go more basic than this: Data is split in three categories, based on which a Data Scientist chooses how to further analyse and process it: #1. Numerical data represents some quantifiable information that is measurable and is further divided into two subcategories: Discrete data, which is integer based (e.g. number of … Read moreStatistics is the Grammar of Data Science — Part 1

A Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

“We trained a machine learning model with high performance. However, it did not work and was not useful in practice.” I have heard this sentence several times, and each time I was eager to find out the reason. There could be different reasons that a model failed to work in practice. As these issues are … Read moreA Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

Welcome to the Forest. London Borough of Culture 2019 Twitter Analysis

Welcome to the Forest. We’ve got fun and games! Last weekend between Friday 11th January to Sunday 13th January 2019, Waltham Forest, a Borough of London, threw a huge three-day event to celebrate being chosen as the first ever Mayor’s London Borough of Culture. The event was called Welcome to the Forest and was described as … Read moreWelcome to the Forest. London Borough of Culture 2019 Twitter Analysis

A Newbie’s Guide to Making A Pull Request (for an R package)

I had the wonderful opportunity to participate in the{tidyverse} Developer Daythe day after rstudio::conf2019officially wrapped up. One of the objectives of the eventwas to encourageopen-source contributor newbies (like me ?) togain some experience, namely through submittingpull requests to address issues with {tidyverse} packages. Having only ever worked with my own packages/repos before,I found this was … Read moreA Newbie’s Guide to Making A Pull Request (for an R package)

GeoPAT2: Entropy calculations for local landscapes

GeoPAT 2 is an open-source software written in C and dedicated to pattern-based spatial and temporal analysis.Four main types of analysis available in GeoPAT 2 are (i) search, (ii) change detection, (iii) segmentation, and (iv) clustering.However, additional applications are also possible, including extracting information about spatial patterns. Global landscape diversity (based on Shannon entropy of … Read moreGeoPAT2: Entropy calculations for local landscapes

AI or marketing hype? (My first lunch and learn at work)

I’m the only data scientist at my company. It allows me to have a huge amount of breadth in my work, which is great, but it leaves me few people to really nerd out with. I mean the type of nerding out that’s specific to data science- there’s definitely a lot of nerding out that … Read moreAI or marketing hype? (My first lunch and learn at work)

Roadmap for multi-class sentiment analysis with deep learning

A practical guide to create incrementally better models Sentiment analysis quickly gets difficult as we increase the number of classes. For this blog, we’ll have a look at what difficulties you might face and how to get around them when you try to solve such a problem. Instead of prioritizing theoretical rigor, I’ll focus on … Read moreRoadmap for multi-class sentiment analysis with deep learning

Ridesharing my way — Uber

USA Uber only provides you with the trip begin and end coordinates. I calculated the haversine distance between the coordinates. This provided me with a lower bound estimate for the ride distance. Haversine distance is basically euclidean distance but on a sphere. It takes into consideration the latitude and longitude to calculate the straight line … Read moreRidesharing my way — Uber

Rat City: Visualizing New York City’s Rat Problem

Is Your Neighborhood a Rat Hotspot too? Check out the interactive rat sighting map here: https://nbviewer.jupyter.org/github/lksfr/rats_nyc/blob/master/rats_for_nbviewer_only.ipynb Introduction If you have ever spent a significant amount of time in New York City, you have very likely come across rats. Regardless if you are waiting for the subway or strolling through Washington Square Park, your chances of running … Read moreRat City: Visualizing New York City’s Rat Problem

Simply deep learning: an effortless introduction

Conquer artificial neural network basics in less than 15 minutes This article is part of the Intro to Deep Learning: Neural Networks for Novices, Newbies, and Neophytes Series. Photo by ibjennyjenny on Pixabay What is an artificial neural network, how does it work, and what does it have to do with deep learning? Let’s start with a … Read moreSimply deep learning: an effortless introduction

Startup Funding, Investments, and Acquisitions

Exploratory Data Analysis (EDA) Funding I am just going to just jump straight in and figure out whether we can answer our first question. Well, we can break it down a bit since there are a number of parts to this question. Let’s first look at the average amount funded, total funding and the number of … Read moreStartup Funding, Investments, and Acquisitions

Gentle Introduction of XGBoost Library

If things don’t go your way in predictive modeling, use XGboost. XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data. In this article, you will discover XGBoost and get a gentle introduction to what it is, where … Read moreGentle Introduction of XGBoost Library

From FaceApp to Deepfakes

Thoughts on appropriation and AI Considering my background in both photography and Gender Studies, perhaps it’s no surprise that I became interested in the works of people like Yasumasa Morimura and Cindy Sherman. Both artists used self-portraiture to explore the performance of identity, often referencing other media. Sherman became known for her series Untitled Film Stills, … Read moreFrom FaceApp to Deepfakes

Prediction task with Multivariate TimeSeries and VAR model.

Time Series data can be confusing, but very interesting to explore. The reason this sort of data grabbed my attention is that it can be found in almost every business (sales, deliveries, weather conditions etc.). For instance: using Google BigQuery how to explore weather effects on NYC link. The main steps in the task: Problem … Read morePrediction task with Multivariate TimeSeries and VAR model.

Computer Designed Humans — The AI Revolution in the Test Tube

Forget self-driving cars and voice-controlled speakers: the most dramatic effects of artificial intelligence will be seen in a very different area in the coming years. These days there are always reports from the world of science whose cross connections and consequences are not immediately obvious. A current example can be found in the latest edition … Read moreComputer Designed Humans — The AI Revolution in the Test Tube

Pricing diamonds using scatterplots and predictive models

My last post railed against the bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks. Dot plots came out on top. I argued that this is because humans are good at the cognitive task of comparing … Read morePricing diamonds using scatterplots and predictive models

Implementing a Corporate AI Strategy

There is a cost to moving too slowly — almost as much as moving too fast In the wake of this generation’s digital transformation, machine learning and the greater promise of artificial intelligence creates wonder in people’s minds and effervescence within organizations. And the attraction to the field is justified: troves of process improvements are announced every day, … Read moreImplementing a Corporate AI Strategy

Create R Markdown reports and presentations even better with these 3 practical tips

Including R Markdown in the workflow for presenting and publishing analyses that use code in R or other languages is a great way to make presentations, dashboards or reports good looking, reproducible and version controllable. In this post, we will look at three simple ways to improve that workflow even further with methods that are … Read moreCreate R Markdown reports and presentations even better with these 3 practical tips

simmer 4.2.1

The 4.2.1 release of simmer, the Discrete-Event Simulator for R, is on CRAN with quite interesting new features and fixes. As discussed in the mailing list, there is a way to handle the specific case in which an arrival is rejected because a queue is full: library(simmer) reject <- trajectory() %>% log_(“kicked off…”) patient <- … Read moresimmer 4.2.1

A Crash course on proving the Halting Problem

Explained in an informally rigorous way A plan for Charles Babbage’s Analytical Engine circa 1840, which would have been a Turing complete mechanical computer had it ever been built. CC BY 4.0 Suppose Jeff Bezos announced over twitter: “I will offer $1 Billion to the person who can write a program that can test any and all … Read moreA Crash course on proving the Halting Problem

The easy way to use Maxmind GeoIP with Redshift

Photo by Westley Ferguson on Unsplash It always starts with an innocent observation. “We get a lot of traffic from Boston,” your boss remarks. You naturally throw out a guess or two and discuss why that might be. Until your boss drops the bomb — “Can you dig into that?” Darn it. You walked right into that … Read moreThe easy way to use Maxmind GeoIP with Redshift

Extracting colours from your images with Image Quantization

magick really does the “Magic!” I have been playing around bit with package “magick”, and I think I am now hooked… Although I haven’t been able to understand everything written in vignette just yet. One of function I got really excited is image_quantize. This function will reduce the number of unique colours used in the … Read moreExtracting colours from your images with Image Quantization

What is data?

Musings on information, memory, analytics, and distributions Everything our senses perceive is data, though its storage in our cranial wet stuff leaves something to be desired. Writing it down is a bit more reliable, especially when we write it down on a computer. When those notes are well-organized, we call them data… though I’ve seen … Read moreWhat is data?

Autoencoders for the compression of stock market data

A Pythonic exploration of diverse neural-network autoencoders to reduce the dimensionality of Bitcoin price time series Stock market data space is highly dimensional and, as such, algorithms that try to exploit potential patterns or structure in the price formation can suffer from the so-called “curse of dimensionality”. In this short article, we will explore the potential … Read moreAutoencoders for the compression of stock market data

Predicting Breast Cancer with Decision Trees

How to implement decision trees with bagging, boosting and random forest to predict breast cancer from routine blood tests Photo by Hello I’m Nik on Unsplash In a previous post, I introduced the theory of decision trees and its performance can be improved using bagging, boosting or random forests. Now, we implement these techniques to predict … Read morePredicting Breast Cancer with Decision Trees

Recommender Systems and Hyper-parameter tuning

Photo by rawpixel on Unsplash The (often) forgotten child of Machine Learning Everyone with an internet connection has been subjected to a recommender system (RS). Spotify suggestions to Almost all media services have a particular section where the system recommends things to you, being things a movie in Netflix, a product to buy in Amazon, a playlist … Read moreRecommender Systems and Hyper-parameter tuning

Data Science and the Paradox of Predictions

Paradox by Nick Youngson How the act of knowing changes what we know. Many data science projects are a hunt for knowledge. As history has taught us through the years, the mere act of knowing can change what it is we believe to know. Professor Harari explores this topic in Homo Deus with the skill we’ve become … Read moreData Science and the Paradox of Predictions

On the role of technology in Regulatory Modernization

The challenge with regulations Regulations are instruments of legislative power and have the force of law. They carry out the intent of corresponding Acts which set out requirements that businesses must adhere to. Regulations are necessary to protect the health, safety and security of individual consumers and the environment as well as to support commerce … Read moreOn the role of technology in Regulatory Modernization

Window Aggregate operator in batch mode in SQL Server 2019

So this came as a surprise, when working on calculating simple statistics on my dataset, in particular min, max and median. First two are trivial. The last one was the one, that caught my attention. While finding the fastest way on calculating the median (statistic: median) for given dataset, I have stumbled upon an interesting … Read moreWindow Aggregate operator in batch mode in SQL Server 2019

Improve your workflow by managing your machine learning experiments using Sacred

Model tuning is my least favorite task as a Data Scientist. I hate it. I think it’s because managing the experiments is something that always gets very messy. While searching for tools to help me with that I saw a lot of people mentioning Sacred, so I decided to give it a try. In this … Read moreImprove your workflow by managing your machine learning experiments using Sacred