Time Series of Price Anomaly Detection

Photo credit: Pixabay Anomaly detection detects data points in data that does not fit well with the rest of the data. Also known as outlier detection, anomaly detection is a data mining process used to determine types of anomalies found in a data set and to determine details about their occurrences. Automatic anomaly detection is critical in … Read more

Get Started with Support Vector Machines (SVM)

A hands-on tutorial with 4 examples on how to implement support vector machines for classification Photo by Randy Fath on Unsplash In a previous post, I introduced the theory of support vector machine (SVM). Now, I will further explain how SVMs work with fours different exercises! The first part will show how to perform classification with … Read more

AI: Why it Actually Makes a Difference

Simple Human Decisions are Hard for Computers: Think of all the decisions you make in a single day. From what you eat in the morning to how you get home from work at night. Many things you do right are like second nature by now, but they’re actually really hard to do. For instance, how … Read more

From prediction to decision making

Why your predictions might be falling short — opinion Photo by Mika Baumeister on Unsplash “There are a number of gaps between making a prediction and making a decision” Susan Athey [1] Correlation does not imply causation This is one of the most repeated phrases in statistical testing. It’s done so for a reason, I believe, and that … Read more

Information Flows in You — And Your Friends

Upper limits of predictability using social media information even if a person has deleted their social media presence You’ve had enough. Of baby pictures, of political rants by ‘friends’, even of cute cat pictures! Of fearing about your privacy and future career security. You decide to delete your accounts on Facebook, Twitter and Instagram. And you’re … Read more

Introducing Feast

Google’s New Feature Store for Machine Learning Applications Feature extraction and storage is one of the most important and often overlooked aspects of machine learning solutions. Features play a key role helping machine learning models to process and understand datasets for training and production. If you are building a single machine learning model, feature extraction … Read more

Simple Soybean Price Regression with Fast.ai Random Forests

As a student in the fast.ai Machine Learning for Coders MOOC¹ with an interest in agriculture the first application of the fast.ai random forest regression library that came to mind was prediction of soybean prices from historical data. Soybeans are a global commodity and their price-per-bushel has varied a great day over the past decade. … Read more

Level up your Data Visualizations with quick plot

K-Means plot for Spotify Data Visualization is an essential part of a Data Scientists workflow. It allows us to visually understand our problem, analyses our models, and allows us to provide deep meaningful understanding to communities. As Data Scientists, we always look new ways of improving our data science workflow. Why should I use this over … Read more

Geo Experiments (Part 1)

What Is It and How Will It Help You In Marketing This will be a three-part series discussing the topic of Geo Experimentation and its use in marketing. Part 1: What Is It and How Will It Help You In Marketing? Part 2: Understanding the mathematics behind Geo Experiments Part 3: Application of Geo Experiments … Read more

EEG Motor Imagery Classification in Node.js with BCI.js

Detecting brainwaves associated with imagined movements Brain-computer interfaces (BCIs) allow for the control of computers and other devices using only your thoughts. A popular way to achieve this is with motor imagery detected with electroencephalography (EEG). This tutorial will serve as an introduction to the detection and classification of motor imagery. I’ve broken it down … Read more

Mass Shootings and Terrorism

Our obsession with small probabilities and rare events I started considering this article last month around the anniversary of the death of my father. Even with Christmas, the weeks leading up to and after the holiday are always a little somber. Thoughts of death and mortality intermingle with my children’s innocent excitement for Santa’s arrival and … Read more

Getting Creative with Algorithms

How to stop being mechanical and keep your innovative edge always sharp in data science Get Creative with Algorithms In April 1972, New York times published an article “Workers Increasingly Rebel Against Boredom on Assembly Line”. Though car industry was considered very innovative, the type of work was very mechanical and repetitive. The reason was that … Read more

Graph Databases. What’s the Big Deal?

Continuing the analysis on semantics and data science, it’s time to talk about graph databases and what they have to offer us. Introduction Should we invest our precious time in learning a new way on ingesting, storing and analyzing data? With the touch on mathematics on graphs? For me the answer was unsure when I started … Read more

Experiment sample size calculation using power analysis

If you use experiments to evaluate a product feature, and I hope you do, the question of the minimum required sample size to get statistically significant results is often brought up. In this article, we explain how we apply mathematical statistics and power analysis to calculate AB testing sample size. Before launching an experiment, it … Read more

Is the Difference in Work Hours the Real Reason for the Gender Wage Gap? [Interactive Infographic]

Every year, the Department of Labor issues a report on the pay gap between women and men. Women earn a median of $30,0001 per year, while men earn $40,000 per year. In other words, working women earn 75% of what men earn. But this gap doesn’t take into account the fact that on average, men … Read more

Scrape Reddit data using Python and Google BigQuery

Let’s get started with data collection from Reddit Reddit API: While web scraping is one among the famous(or infamous!) ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic that scraping bots create, often crashing their websites … Read more

Why you should care about Docker?

The Dockerfile — where it all begins Docker is a powerful tool, but its power is harnessed through the use of things called Dockerfiles (as mentioned above). A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated … Read more

Introduction to BigQuery ML

Evaluate the model Once we have the trained model, we need to assess its predictive performance. This always has to be done on a test set different from the training set to avoid overfitting, which occurs when our model memorizes the patterns of our training data and consequently it is very precise in our training set … Read more

Python for Data Science: From Scratch(Part II)

2.2 Pandas: Pandas is an open source library for Python that was particularly created for data manipulation and analysis of huge chunks of data. Pandas offers robust data structures and functions for manipulating data easily. Photo by Debbie Molle on Unsplash But wait, that’s what lists, dict and Numpy’s ndarrays could do too, So why Pandas? … Read more

Getting Started with Recommender Systems and TensorRec

System Overview TensorRec is a Python package for building recommender systems. A TensorRec recommender system consumes three pieces of input data: user features, item features, and interactions. Based on the user/item features, the system will predict which items to recommend. The interactions are used when fitting the model: predictions are compared to the interactions and … Read more

Introduction to Logistic Regression

Introduction In this blog, we will discuss the basic concepts of Logistic Regression and what kind of problems can it help us to solve. GIF: University of Toronto Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not … Read more

Sudokus and Schedules

Pan Am’s Reservation Center in the 1950’s Using Constraint Programming and Tree Search Machine learning is quite the rage these days, so much it is easy to lose sight of the fact there are other algorithms in the “AI” space. As a matter of fact, these algorithms can be so crucial that it can be neglectful to … Read more

Predicting Customer Churn with Spark

For many companies, churn is a major concern. It is natural that some people stop using the service, but if this proportion becomes too large it can hinder growth, regardless of revenue sources (ad sales, subscriptions or a mix of both). With that in mind, the ability for firms to predict churn by identifying customers … Read more

A Comprehensive List of Handy R Packages

Stuff I have found super useful for work and life Gang SuBlockedUnblockFollowFollowing Jan 21 Whether Python or R is more superior for Data Science / Machine Learning is an open debate. Despite of its quirkiness and not-so-true-but-generally-perceived slowness, R really shines in exploratory data analysis (EDA), in terms of data wrangling, visualizations, dashboards, myriad choices of … Read more

Detecting malaria using deep learning.

Set-up First, create a folder/directory to store the project. Then, create a directory inside that called malaria, download the dataset into the directory and open it up. $ cd whatever-you-named-your-directory$ mkdir malaria$ cd malaria$ wget https://ceb.nlm.nih.gov/proj/malaria/cell_images.zip$ unzip cell_images.zip We’re going to switch back to our parent directory and make another directory called cnn where we … Read more


The next chart, generated from this “R” code, difficulty %>%group_by(block.bin) %>%summarize(sum.diff.delta = sum(diff.delta), na.rm=T) %>%ggplot(aes(x=block.bin, y=sum.diff.delta)) +geom_line() shows the accumulated sum of the diff.delta values. You can clearly see the battle waged by the pre-byzantium difficulty bomb. Up, down, up, down. The fact that the difficulty hovers around a target is exactly what the difficulty … Read more

Quality over quantity: building the perfect data science project

credit: https://www.housetohouse.com/diamonds-in-the-rough/ In startup lingo, a “vanity metric” is a number that companies keep track of in order to convince the world — and sometimes themselves — that they’re doing better than they actually are. To pick on a prominent example, about eight years ago Twitter announced that 200 million tweets per day were being sent on its app. … Read more

3 Methods for Parallelization in Spark

Source: geralt on pixabay Scaling data science tasks for speed Spark is great for scaling up data science tasks and workloads! As long as you’re using Spark data frames and libraries that operate on these data structures, you can scale to massive data sets that distribute across a cluster. However, there are some scenarios where libraries may … Read more

Artificial Intelligence is just a Tool

There are scenarios in which AI applications deliver better results, but no general superiority can be derived from this. More importantly, IT managers have to check very carefully what they want to use for each project. To answer this question, decision-makers must consider AI in connection with other concepts. AI does not replace, AI supplements … Read more

Building an interactive computer vision demo in a few hours on AWS DeepLens

A couple months ago, I posted an article on explaining my job as a technology consultant to my daughter’s preschool class of 3-year-olds. One of the more understandable parts of what I’m doing these days is working on computer vision problems. People (even the toddler crowd) inherently understand the idea of recognizing what is in … Read more

Think your Data Different

Case study Taboola’s content recommender system gathers lots of data, some of which can be represented in a graphical manner. Let’s inspect one type of data as a case study for using node2vec. Taboola recommends articles in a widget shown in publishers’ websites: Each article has named entities — the entities described by the title. For example, … Read more

Seamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and Docker

When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU. Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive. Luckily there are some options in the cloud. One that I tried out was using the wonderful Google Compute … Read more

PU Learning

Dealing with a negative class hidden in unlabelled data PU Learning — finding a needle in a haystack A challenge that keeps presenting itself at work is one of not having a labelled negative class in the context of needing to train a binary classifier. Typically, the issue is paired with horribly imbalanced data sets and pressed for … Read more

A.I. Demilitarisation Won’t Happen

Artificial Intelligence is already being integrated in next-generation defence systems, and its demilitarisation is highly unlikely. Restricting it from military use is probably anyway not the smartest strategy to pursue. Photo by Rostislav Kralik on Public Domain Pictures This year’s World Economic Forum’s annual meeting is about to start. While browsing through this year’s agenda, I … Read more

Automation Best Practices

Automation isn’t always about automated cars and drones that will deliver our purchases to our doorsteps. The goal of automation is to make people’s lives easier and not have them come in on work on Saturdays. There is a lot of tasks in the workplace that still can be automated to avoid having late nights … Read more

Sentiment of the Union: Analyzing Presidential State of the Union Addresses with Python

Analyzing Presidential State of the Union Addresses using Sentiment Analysis and Python tools Photo from 271277 on Pixabay In Article II, Section 3 of the Constitution, the President of the United States is directed to “give to the Congress information of the State of the Union, and recommend their consideration such measures as he shall judge necessary … Read more

Key Steps for Building an Effective AI Organization

Recently, I got fascinated by the impact of Artificial Intelligence on any business from any sector (tech, banking, manufacturing, etc.) This led me to explore the subject further while trying to understand what a corporation should do to transform its processes using AI. In this article, I would love to summarize my observations into a … Read more

Visualizing Principal Component Analysis with Matrix Transforms

A guide to understanding eigenvalues, eigenvectors, and principal components Principal Component Analysis (PCA) is a method of decomposing data into correlated components by identifying eigenvalues and eigenvectors. The following is meant to help visualize what these different values represent and how they’re calculated. First I’ll show how matrices can be used to transform data, then … Read more

3 steps to a clean dataset with Pandas

Data Science isn’t all fancy charts! It’s a set of tools that we use to clean, explore, and model data in order to extract real-world, meaningful information. Getting real-world information first requires real-world data — that real-world data is dirty. Think of how companies big and small would collect their data. It’s usually done by a non-expert; … Read more

Flask: An Easy Access Door to API development

Photo by Chris Ried on Unsplash The world has gone through a huge transition; from separating the piece of code as functions in procedural languages to the development of libraries; from RPC calls to Web Service specifications in Service Oriented Architecture(SOA) like SOAP and REST. This has paved a way to Web APIs and microservices, … Read more