Predicting Customer Churn with Spark

For many companies, churn is a major concern. It is natural that some people stop using the service, but if this proportion becomes too large it can hinder growth, regardless of revenue sources (ad sales, subscriptions or a mix of both). With that in mind, the ability for firms to predict churn by identifying customers … Read morePredicting Customer Churn with Spark

A Comprehensive List of Handy R Packages

Stuff I have found super useful for work and life Gang SuBlockedUnblockFollowFollowing Jan 21 Whether Python or R is more superior for Data Science / Machine Learning is an open debate. Despite of its quirkiness and not-so-true-but-generally-perceived slowness, R really shines in exploratory data analysis (EDA), in terms of data wrangling, visualizations, dashboards, myriad choices of … Read moreA Comprehensive List of Handy R Packages

Detecting malaria using deep learning.

Set-up First, create a folder/directory to store the project. Then, create a directory inside that called malaria, download the dataset into the directory and open it up. $ cd whatever-you-named-your-directory$ mkdir malaria$ cd malaria$ wget https://ceb.nlm.nih.gov/proj/malaria/cell_images.zip$ unzip cell_images.zip We’re going to switch back to our parent directory and make another directory called cnn where we … Read moreDetecting malaria using deep learning.

QuickBlarks

The next chart, generated from this “R” code, difficulty %>%group_by(block.bin) %>%summarize(sum.diff.delta = sum(diff.delta), na.rm=T) %>%ggplot(aes(x=block.bin, y=sum.diff.delta)) +geom_line() shows the accumulated sum of the diff.delta values. You can clearly see the battle waged by the pre-byzantium difficulty bomb. Up, down, up, down. The fact that the difficulty hovers around a target is exactly what the difficulty … Read moreQuickBlarks

Quality over quantity: building the perfect data science project

credit: https://www.housetohouse.com/diamonds-in-the-rough/ In startup lingo, a “vanity metric” is a number that companies keep track of in order to convince the world — and sometimes themselves — that they’re doing better than they actually are. To pick on a prominent example, about eight years ago Twitter announced that 200 million tweets per day were being sent on its app. … Read moreQuality over quantity: building the perfect data science project

3 Methods for Parallelization in Spark

Source: geralt on pixabay Scaling data science tasks for speed Spark is great for scaling up data science tasks and workloads! As long as you’re using Spark data frames and libraries that operate on these data structures, you can scale to massive data sets that distribute across a cluster. However, there are some scenarios where libraries may … Read more3 Methods for Parallelization in Spark

Artificial Intelligence is just a Tool

There are scenarios in which AI applications deliver better results, but no general superiority can be derived from this. More importantly, IT managers have to check very carefully what they want to use for each project. To answer this question, decision-makers must consider AI in connection with other concepts. AI does not replace, AI supplements … Read moreArtificial Intelligence is just a Tool

Building an interactive computer vision demo in a few hours on AWS DeepLens

A couple months ago, I posted an article on explaining my job as a technology consultant to my daughter’s preschool class of 3-year-olds. One of the more understandable parts of what I’m doing these days is working on computer vision problems. People (even the toddler crowd) inherently understand the idea of recognizing what is in … Read moreBuilding an interactive computer vision demo in a few hours on AWS DeepLens

Counting No. of Parameters in Deep Learning Models by Hand

5 simple examples to count parameters in FFNN, RNN and CNN models Counting the number of trainable parameters of deep learning models is considered too trivial, because your code can already do this for you. But I’d like to keep my notes here for us to refer to once in a while. Here are the models … Read moreCounting No. of Parameters in Deep Learning Models by Hand

Think your Data Different

Case study Taboola’s content recommender system gathers lots of data, some of which can be represented in a graphical manner. Let’s inspect one type of data as a case study for using node2vec. Taboola recommends articles in a widget shown in publishers’ websites: Each article has named entities — the entities described by the title. For example, … Read moreThink your Data Different

Seamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and Docker

When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU. Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive. Luckily there are some options in the cloud. One that I tried out was using the wonderful Google Compute … Read moreSeamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and Docker

PU Learning

Dealing with a negative class hidden in unlabelled data PU Learning — finding a needle in a haystack A challenge that keeps presenting itself at work is one of not having a labelled negative class in the context of needing to train a binary classifier. Typically, the issue is paired with horribly imbalanced data sets and pressed for … Read morePU Learning

A.I. Demilitarisation Won’t Happen

Artificial Intelligence is already being integrated in next-generation defence systems, and its demilitarisation is highly unlikely. Restricting it from military use is probably anyway not the smartest strategy to pursue. Photo by Rostislav Kralik on Public Domain Pictures This year’s World Economic Forum’s annual meeting is about to start. While browsing through this year’s agenda, I … Read moreA.I. Demilitarisation Won’t Happen

Sentiment of the Union: Analyzing Presidential State of the Union Addresses with Python

Analyzing Presidential State of the Union Addresses using Sentiment Analysis and Python tools Photo from 271277 on Pixabay In Article II, Section 3 of the Constitution, the President of the United States is directed to “give to the Congress information of the State of the Union, and recommend their consideration such measures as he shall judge necessary … Read moreSentiment of the Union: Analyzing Presidential State of the Union Addresses with Python

Key Steps for Building an Effective AI Organization

Recently, I got fascinated by the impact of Artificial Intelligence on any business from any sector (tech, banking, manufacturing, etc.) This led me to explore the subject further while trying to understand what a corporation should do to transform its processes using AI. In this article, I would love to summarize my observations into a … Read moreKey Steps for Building an Effective AI Organization

Visualizing Principal Component Analysis with Matrix Transforms

A guide to understanding eigenvalues, eigenvectors, and principal components Principal Component Analysis (PCA) is a method of decomposing data into correlated components by identifying eigenvalues and eigenvectors. The following is meant to help visualize what these different values represent and how they’re calculated. First I’ll show how matrices can be used to transform data, then … Read moreVisualizing Principal Component Analysis with Matrix Transforms

3 steps to a clean dataset with Pandas

Data Science isn’t all fancy charts! It’s a set of tools that we use to clean, explore, and model data in order to extract real-world, meaningful information. Getting real-world information first requires real-world data — that real-world data is dirty. Think of how companies big and small would collect their data. It’s usually done by a non-expert; … Read more3 steps to a clean dataset with Pandas

The Poisson Distribution and Poisson Process Explained

Waiting Time An intriguing part of a Poisson process involves figuring out how long we have to wait until the next event (this is sometimes called the interarrival time). Consider the situation: meteors appear once every 12 minutes on average. If we arrive at a random time, how long can we expect to wait to … Read moreThe Poisson Distribution and Poisson Process Explained

The basics of deploying Logstash pipelines to Kubernetes

Now that we’ve walked through the config of our pipeline we can move onto Kubernetes. What we have to do first of all is create a ConfigMap. A ConfigMap allows us to store key-value pairs of configuration data that is accessible by our Pods. So we could have a ConfigMap that would store a directory … Read moreThe basics of deploying Logstash pipelines to Kubernetes

How to visualize convolutional features in 40 lines of code

Feature visualizations Below you find feature visualizations for filters in several layers of a VGG-16 network. While looking at them, I would like you to observe how the complexity of the generated patterns increases the deeper we get into the network. Layer 7: Conv2d(64, 128) filters 12, 16, 86, 110 (top left to bottom right, … Read moreHow to visualize convolutional features in 40 lines of code

Flask: An Easy Access Door to API development

Photo by Chris Ried on Unsplash The world has gone through a huge transition; from separating the piece of code as functions in procedural languages to the development of libraries; from RPC calls to Web Service specifications in Service Oriented Architecture(SOA) like SOAP and REST. This has paved a way to Web APIs and microservices, … Read moreFlask: An Easy Access Door to API development

Deep Learning Approach for Separating Fast and Slow Components

Some Background (A slide deck for this work can be found https://speakerdeck.com/jchin/decomposing-dynamics-from-different-time-scale-for-time-lapse-image-sequences-with-a-deep-cnn) I left my job as a Scientific Fellow in PacBio after 9-year venture helping to make single-molecule sequencing becoming useful for the scientific community (see my story about the first couple year in PacBio there). Most of my technical/scientific work had something to … Read moreDeep Learning Approach for Separating Fast and Slow Components

Combining supervised learning and unsupervised learning to improve word vectors

To achieve state-of-the-art result in NLP tasks, researchers try tremendous way to let machine understand language and solving downstream tasks such as textual entailment, semantic classification. OpenAI released a new model which named as Generative Pre-Training (GPT). After reading this article, you will understand: Finetuned Transformer LM Design Architecture Experiments Implementation Take Away Finetuned Transformer … Read moreCombining supervised learning and unsupervised learning to improve word vectors

How to deploy your website to a custom domain

This blog documents the steps needed to deploy a website written in Python with Flask framework to a custom domain using Heroku and NameCheap. Flask is a micro-framework that allows us to use Python in the back-end to interact with our front-end code in HTML/CSS or Javascript to build web sites. People also use other … Read moreHow to deploy your website to a custom domain

How to do Deep Learning on Graphs with Graph Convolutional Networks

Part 2: Semi-Supervised Learning with Spectral Graph Convolutions Machine learning on graphs is a difficult task due to the highly complex, but also informative graph structure. This post is the second in a series on how to do deep learning on graphs with Graph Convolutional Networks (GCNs), a powerful type of neural network designed to … Read moreHow to do Deep Learning on Graphs with Graph Convolutional Networks

Machine Learning Project: Predicting Boston House Prices With Regression

Introduction In this project, we will develop and evaluate the performance and the predictive power of a model trained and tested on data collected from houses in Boston’s suburbs. Once we obtain a good fit, we will use this model to predict about the monetary value of a house which is in that location. A … Read moreMachine Learning Project: Predicting Boston House Prices With Regression

Lessons Learned from Kaggle’s Airbus Challenge.

The challenge banner Over the last three months, I have participated in the Airbus Ship Detection Kaggle challenge. As evident from the title, it is a detection computer vision (segmentation to be more precise) competition proposed by Airbus (its satellite data division) that consists in detecting ships in satellite images. Before I start this challenge, … Read moreLessons Learned from Kaggle’s Airbus Challenge.

I wrote a program that speaks like the collective hive-mind of The Straits Times Forum

Results I very diligently studied thousands of the Straits Times Forum Letters and was able to create a second-order Markov chain capturing the “style” of the forum letters. I then generated my own articles using the above-mentioned second-order Markov chain — you can play with it here: Straits Times Forum Letter Generator. Here are some of my … Read moreI wrote a program that speaks like the collective hive-mind of The Straits Times Forum

Statistics is the Grammar of Data Science — Part 1

Data Types We cannot go more basic than this: Data is split in three categories, based on which a Data Scientist chooses how to further analyse and process it: #1. Numerical data represents some quantifiable information that is measurable and is further divided into two subcategories: Discrete data, which is integer based (e.g. number of … Read moreStatistics is the Grammar of Data Science — Part 1

A Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

“We trained a machine learning model with high performance. However, it did not work and was not useful in practice.” I have heard this sentence several times, and each time I was eager to find out the reason. There could be different reasons that a model failed to work in practice. As these issues are … Read moreA Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

Welcome to the Forest. London Borough of Culture 2019 Twitter Analysis

Welcome to the Forest. We’ve got fun and games! Last weekend between Friday 11th January to Sunday 13th January 2019, Waltham Forest, a Borough of London, threw a huge three-day event to celebrate being chosen as the first ever Mayor’s London Borough of Culture. The event was called Welcome to the Forest and was described as … Read moreWelcome to the Forest. London Borough of Culture 2019 Twitter Analysis

AI or marketing hype? (My first lunch and learn at work)

I’m the only data scientist at my company. It allows me to have a huge amount of breadth in my work, which is great, but it leaves me few people to really nerd out with. I mean the type of nerding out that’s specific to data science- there’s definitely a lot of nerding out that … Read moreAI or marketing hype? (My first lunch and learn at work)

Roadmap for multi-class sentiment analysis with deep learning

A practical guide to create incrementally better models Sentiment analysis quickly gets difficult as we increase the number of classes. For this blog, we’ll have a look at what difficulties you might face and how to get around them when you try to solve such a problem. Instead of prioritizing theoretical rigor, I’ll focus on … Read moreRoadmap for multi-class sentiment analysis with deep learning

Ridesharing my way — Uber

USA Uber only provides you with the trip begin and end coordinates. I calculated the haversine distance between the coordinates. This provided me with a lower bound estimate for the ride distance. Haversine distance is basically euclidean distance but on a sphere. It takes into consideration the latitude and longitude to calculate the straight line … Read moreRidesharing my way — Uber

Rat City: Visualizing New York City’s Rat Problem

Is Your Neighborhood a Rat Hotspot too? Check out the interactive rat sighting map here: https://nbviewer.jupyter.org/github/lksfr/rats_nyc/blob/master/rats_for_nbviewer_only.ipynb Introduction If you have ever spent a significant amount of time in New York City, you have very likely come across rats. Regardless if you are waiting for the subway or strolling through Washington Square Park, your chances of running … Read moreRat City: Visualizing New York City’s Rat Problem

Simply deep learning: an effortless introduction

Conquer artificial neural network basics in less than 15 minutes This article is part of the Intro to Deep Learning: Neural Networks for Novices, Newbies, and Neophytes Series. Photo by ibjennyjenny on Pixabay What is an artificial neural network, how does it work, and what does it have to do with deep learning? Let’s start with a … Read moreSimply deep learning: an effortless introduction

Startup Funding, Investments, and Acquisitions

Exploratory Data Analysis (EDA) Funding I am just going to just jump straight in and figure out whether we can answer our first question. Well, we can break it down a bit since there are a number of parts to this question. Let’s first look at the average amount funded, total funding and the number of … Read moreStartup Funding, Investments, and Acquisitions

Gentle Introduction of XGBoost Library

If things don’t go your way in predictive modeling, use XGboost. XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data. In this article, you will discover XGBoost and get a gentle introduction to what it is, where … Read moreGentle Introduction of XGBoost Library

From FaceApp to Deepfakes

Thoughts on appropriation and AI Considering my background in both photography and Gender Studies, perhaps it’s no surprise that I became interested in the works of people like Yasumasa Morimura and Cindy Sherman. Both artists used self-portraiture to explore the performance of identity, often referencing other media. Sherman became known for her series Untitled Film Stills, … Read moreFrom FaceApp to Deepfakes

Prediction task with Multivariate TimeSeries and VAR model.

Time Series data can be confusing, but very interesting to explore. The reason this sort of data grabbed my attention is that it can be found in almost every business (sales, deliveries, weather conditions etc.). For instance: using Google BigQuery how to explore weather effects on NYC link. The main steps in the task: Problem … Read morePrediction task with Multivariate TimeSeries and VAR model.

Computer Designed Humans — The AI Revolution in the Test Tube

Forget self-driving cars and voice-controlled speakers: the most dramatic effects of artificial intelligence will be seen in a very different area in the coming years. These days there are always reports from the world of science whose cross connections and consequences are not immediately obvious. A current example can be found in the latest edition … Read moreComputer Designed Humans — The AI Revolution in the Test Tube

Pricing diamonds using scatterplots and predictive models

My last post railed against the bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks. Dot plots came out on top. I argued that this is because humans are good at the cognitive task of comparing … Read morePricing diamonds using scatterplots and predictive models

Implementing a Corporate AI Strategy

There is a cost to moving too slowly — almost as much as moving too fast In the wake of this generation’s digital transformation, machine learning and the greater promise of artificial intelligence creates wonder in people’s minds and effervescence within organizations. And the attraction to the field is justified: troves of process improvements are announced every day, … Read moreImplementing a Corporate AI Strategy

A Crash course on proving the Halting Problem

Explained in an informally rigorous way A plan for Charles Babbage’s Analytical Engine circa 1840, which would have been a Turing complete mechanical computer had it ever been built. CC BY 4.0 Suppose Jeff Bezos announced over twitter: “I will offer $1 Billion to the person who can write a program that can test any and all … Read moreA Crash course on proving the Halting Problem