Statistics is the Grammar of Data Science — Part 2

Probability Distribution Functions A probability distribution is a function that describes the likelihood of an event or outcome. We will now delve into the different types of distributions, in terms of the dataset being continuous or discrete. Probability Density Function (PDF) When we see a graph like the one in the figure below, we think that … Read moreStatistics is the Grammar of Data Science — Part 2

“Data Science” Has Become Too Vague

Let’s Specialize and Break it Up! I would not be opposed to downplaying the term “data science” and breaking it up into specialized disciplines. Do not misunderstand, I think the global “data science” movement was necessary and had a positive impact on the curmudgeon corporate world. But the campaign has been won and everybody is bought … Read more“Data Science” Has Become Too Vague

Monte Carlo Simulations with Python (Part 1)

Monte Carlo’s can be used to simulate games at a casino (Pic courtesy of Pawel Biernacki) This is the first of a three part series on learning to do Monte Carlo simulations with Python. This first tutorial will teach you how to do a basic “crude” Monte Carlo, and it will teach you how to … Read moreMonte Carlo Simulations with Python (Part 1)

Text to Image

This article will explain the experiments and theory behind an interesting paper that converts natural language text descriptions such as “A small bird has a short, point orange beak and white belly” into 64×64 RGB images. Following is a link to the paper “Generative Adversarial Text to Image Synthesis” from Reed et al. Article Outline … Read moreText to Image

The Simple Yet Practical Data Visualization Codes

In the previous article I shared about my little toolbox for data cleaning after realizing that some codes are applicable for most common scenarios of messy data. In other words, there is a pattern (or an approach) that is commonly used in data science for data cleaning and I compiled them into functions for reusability … Read moreThe Simple Yet Practical Data Visualization Codes

A Great Public Health Conspiracy?

The Facts on Public Water Fluoridation With any health topic, especially one that has attracted controversy, we must be careful about where we get our data. Even studies in peer-reviewed journals can have biases — intentional or not. Therefore, the best practice for reviewing medical evidence is to look at meta-analyses, reviews that evaluate results from dozens … Read moreA Great Public Health Conspiracy?

Canny Edge Detection Step by Step in Python — Computer Vision

Noise Reduction Since the mathematics involved behind the scene are mainly based on derivatives (cf. Step 2: Gradient calculation), edge detection results are highly sensitive to image noise. One way to get rid of the noise on the image, is by applying Gaussian blur to smooth it. To do so, image convolution technique is applied … Read moreCanny Edge Detection Step by Step in Python — Computer Vision

Weekly Selection — Jan 25, 2019

Attn: Illustrated Attention By Raimi Karim — 12 min read For decades, Statistical Machine Translation has been the dominant translation model, until the birth of Neural Machine Translation (NMT). NMT is an emerging approach to machine translation that attempts to build and train a single, large neural network that reads an input text and outputs a translation … Read moreWeekly Selection — Jan 25, 2019

Artificial Intelligence, Music, and the Human Sublime

Hannah Fry, in her book “Hello World”, talks of how computers can be programmed to mimic music, nearly perfectly. A program was written that perfectly mimicked Bach’s musical lexicon-right down to the notes, words, and phrases he used in all his body of work. Even the most astute musician could never get this “data’s eye … Read moreArtificial Intelligence, Music, and the Human Sublime

How Frequently Do People Use Different Drugs?

One of the most frustrating things about Tell Your Children, the anti-marijuana tract from former New York Times reporter (and spy novelist!) Alex Berenson, is that its most interesting points are almost entirely unrelated to its thesis. That central idea — roughly, that marijuana causes psychosis and schizophrenia, and psychosis causes violence, therefore marijuana causes violence—is compelling … Read moreHow Frequently Do People Use Different Drugs?

How to develop data products and not die trying

Authors:David Flórez Fernández, Data and AI Solution Architect @ Microsoft Pablo Peris, Digital Architect @ Microsoft Companies struggle to thrive with Analytics projects In the present days of data accumulation there is a global craving for the innovative and business use of AI at all levels. Maybe it’s time to stop and reflect on that burning … Read moreHow to develop data products and not die trying

Understanding Machine Learning on Point Clouds through PointNet++

Introduction Data can take on a variety of forms. For processing visual information, images are extremely common. Images store a two-dimensional grid of pixels that often represent our three-dimensional world. Some of the most successful advances in machine learning have come from problems involving images. However, for capturing data in 3D directly, it is less … Read moreUnderstanding Machine Learning on Point Clouds through PointNet++

A Neural Algorithm of Artistic Style: A Modern Form of Creation

Understanding Convolutional Neural Networks Seeing as convolutional neural networks are the underlying concept for the entirety of N.A.A.S., it is important to have a clear idea of what they do. If you already know about CNNs, that’s great. Move on to the next section. Conv nets are a type of artificial neural networks which lever a … Read moreA Neural Algorithm of Artistic Style: A Modern Form of Creation

Data Science vs Decision Science

What’s the difference between a Data Scientist and a Decision Scientist? At Instagram, we have many different job roles that analyze data. A few of the ‘data’ job titles include: Data Scientists, Analysts, Researchers and Growth marketing. But there’s often a lot of confusion between the roles of Data Scientist vs Decision Scientist. We have … Read moreData Science vs Decision Science

Data science unicorns might be right under your nose

Our society produces data at an astounding rate. By some estimates, as many as 2.5 million terabytes of new information appear on servers around the world every day. That’s as much data as could fit on a billion iPhones, a quantity of zeros and ones so large you need eighteen zeros just to count it. … Read moreData science unicorns might be right under your nose

Neural Networks Intuitions: 2. Dot product, Gram Matrix and Neural Style Transfer

Problem — 2. Generate Style: The problem is to produce an image which contains the style as in the style image. Solution: To extract the style of an image(or more specifically to compute the style loss), we need something called as Gram matrix. Wait, what is a Gram matrix? Before talking about how to compute the style … Read moreNeural Networks Intuitions: 2. Dot product, Gram Matrix and Neural Style Transfer

DeepTraffic – DQN Tuning for Traffic Navigation (75.01 MPH Solution)

Crowdsourced Hyperparameter Tuning Competition In today’s article, we are going to approach a traffic navigation problem with Reinforcement Learning (RL). In order to do so, we will revise our RL skills and participate in the DeepTraffic competition hosted by MIT Deep Learning. Americans spend 8 billion hours stuck in traffic every year.Deep neural networks can help! … Read moreDeepTraffic – DQN Tuning for Traffic Navigation (75.01 MPH Solution)

3 Tips to Improving Your Data Science Workflow

3. Optimising Parameters Efficiently When I first started learning to apply machine learning, I would manually change the parameter inputs one by one and take a note of the results for my final output. Although this helped my understanding with the parameters, it was time consuming and inefficient. As time has gone on, I have … Read more3 Tips to Improving Your Data Science Workflow

Reinforcement Learning with Exploration by Random Network Distillation

Ever since the seminal DQN work by DeepMind in 2013, in which an agent successfully learned to play Atari games at a level that is higher than an average human, Reinforcement Learning (RL) has been making headlines frequently. From Atari games to robotics, and the amazing defeat of world Go champion Lee-Sedol by AlphaGo, it … Read moreReinforcement Learning with Exploration by Random Network Distillation

Introduction to ResNets

‘We need to go Deeper’ Meme, classical CNNs do not perform well as the depth of the network grows past a certain threshold. ResNets allow for the training of deeper networks. This Article is Based on Deep Residual Learning for Image Recognition from He et al. [2] (Microsoft Research): https://arxiv.org/pdf/1512.03385.pdf In 2012, Krizhevsky et al. … Read moreIntroduction to ResNets

What is AI bias?

The AI bias trouble starts — but doesn’t end — with definition. “Bias” is an overloaded term which means remarkably different things in different contexts. Image: source. Here are just a few definitions of bias for your perusal. In statistics: Bias is the difference between the expected value of an estimator and its estimand. That’s awfully technical, so allow … Read moreWhat is AI bias?

How to beat Google’s AutoML – Hyperparameter Optimisation with Flair

This is a follow-up to our previous post about State of the Art Text Classification. We explain how to do hyperparameter optimisation using Flair to achieve optimal results in text classification outperforming Google’s AutoML Natural Language. What is hyperparameter optimisation and why can’t we simply do it by hand? Hyperparameter optimisation (or tuning) is the process … Read moreHow to beat Google’s AutoML – Hyperparameter Optimisation with Flair

Python’s Collections Module — High-performance container data types.

Let us now hop over to the actual objective of this article which is to get to know about the Python’s Collection module. This is just an overview and for detailed explanations and examples please refer to the official Python documentation. Collections Module Collections is a built-in Python module that implements specialized container datatypes providing … Read morePython’s Collections Module — High-performance container data types.

Time Series of Price Anomaly Detection

Photo credit: Pixabay Anomaly detection detects data points in data that does not fit well with the rest of the data. Also known as outlier detection, anomaly detection is a data mining process used to determine types of anomalies found in a data set and to determine details about their occurrences. Automatic anomaly detection is critical in … Read moreTime Series of Price Anomaly Detection

Tel Aviv artists: build yourself a mapping app

tl;dr — I went from experimenting with mapping libraries to building a reusable mapping app. This is how I did it and how you can re-use it. Intro As a data scientist, most of my work stays behind the scenes. When training models, the farthest I reach in exposure is deploying a simple flask web-app as REST … Read moreTel Aviv artists: build yourself a mapping app

Get Started with Support Vector Machines (SVM)

A hands-on tutorial with 4 examples on how to implement support vector machines for classification Photo by Randy Fath on Unsplash In a previous post, I introduced the theory of support vector machine (SVM). Now, I will further explain how SVMs work with fours different exercises! The first part will show how to perform classification with … Read moreGet Started with Support Vector Machines (SVM)

AI Thinks Rachel Maddow Is A Man (and this is a problem for all of us)

A data-driven review of AI bias in production systems In 2011, IBM Watson made headlines when it beat Jeopardy legends Ken Jennings and Brad Rutter in a $1M match. In Final Jeopardy, Jennings admitted defeat by writing “I, for one, welcome our new computer overlords.” That was in 2011, when a good score in a … Read moreAI Thinks Rachel Maddow Is A Man (and this is a problem for all of us)

From prediction to decision making

Why your predictions might be falling short — opinion Photo by Mika Baumeister on Unsplash “There are a number of gaps between making a prediction and making a decision” Susan Athey [1] Correlation does not imply causation This is one of the most repeated phrases in statistical testing. It’s done so for a reason, I believe, and that … Read moreFrom prediction to decision making

Information Flows in You — And Your Friends

Upper limits of predictability using social media information even if a person has deleted their social media presence You’ve had enough. Of baby pictures, of political rants by ‘friends’, even of cute cat pictures! Of fearing about your privacy and future career security. You decide to delete your accounts on Facebook, Twitter and Instagram. And you’re … Read moreInformation Flows in You — And Your Friends

Introducing Feast

Google’s New Feature Store for Machine Learning Applications Feature extraction and storage is one of the most important and often overlooked aspects of machine learning solutions. Features play a key role helping machine learning models to process and understand datasets for training and production. If you are building a single machine learning model, feature extraction … Read moreIntroducing Feast

Simple Soybean Price Regression with Fast.ai Random Forests

As a student in the fast.ai Machine Learning for Coders MOOC¹ with an interest in agriculture the first application of the fast.ai random forest regression library that came to mind was prediction of soybean prices from historical data. Soybeans are a global commodity and their price-per-bushel has varied a great day over the past decade. … Read moreSimple Soybean Price Regression with Fast.ai Random Forests

Level up your Data Visualizations with quick plot

K-Means plot for Spotify Data Visualization is an essential part of a Data Scientists workflow. It allows us to visually understand our problem, analyses our models, and allows us to provide deep meaningful understanding to communities. As Data Scientists, we always look new ways of improving our data science workflow. Why should I use this over … Read moreLevel up your Data Visualizations with quick plot

Playlist Classification on Spotify using KNN and Naive Bayes Classification

https://unsplash.com/@usefulcollective One day, I thought it would be cool if Spotify helped me pick a playlist when I like a song. The idea is to touch on the plus button when my phone is locked and Spotify add it into one of my playlists rather than library so that I don’t go into the app … Read morePlaylist Classification on Spotify using KNN and Naive Bayes Classification

EEG Motor Imagery Classification in Node.js with BCI.js

Detecting brainwaves associated with imagined movements Brain-computer interfaces (BCIs) allow for the control of computers and other devices using only your thoughts. A popular way to achieve this is with motor imagery detected with electroencephalography (EEG). This tutorial will serve as an introduction to the detection and classification of motor imagery. I’ve broken it down … Read moreEEG Motor Imagery Classification in Node.js with BCI.js

Mass Shootings and Terrorism

Our obsession with small probabilities and rare events I started considering this article last month around the anniversary of the death of my father. Even with Christmas, the weeks leading up to and after the holiday are always a little somber. Thoughts of death and mortality intermingle with my children’s innocent excitement for Santa’s arrival and … Read moreMass Shootings and Terrorism

Getting Creative with Algorithms

How to stop being mechanical and keep your innovative edge always sharp in data science Get Creative with Algorithms In April 1972, New York times published an article “Workers Increasingly Rebel Against Boredom on Assembly Line”. Though car industry was considered very innovative, the type of work was very mechanical and repetitive. The reason was that … Read moreGetting Creative with Algorithms

Graph Databases. What’s the Big Deal?

Continuing the analysis on semantics and data science, it’s time to talk about graph databases and what they have to offer us. Introduction Should we invest our precious time in learning a new way on ingesting, storing and analyzing data? With the touch on mathematics on graphs? For me the answer was unsure when I started … Read moreGraph Databases. What’s the Big Deal?

Experiment sample size calculation using power analysis

If you use experiments to evaluate a product feature, and I hope you do, the question of the minimum required sample size to get statistically significant results is often brought up. In this article, we explain how we apply mathematical statistics and power analysis to calculate AB testing sample size. Before launching an experiment, it … Read moreExperiment sample size calculation using power analysis

Is the Difference in Work Hours the Real Reason for the Gender Wage Gap? [Interactive Infographic]

Every year, the Department of Labor issues a report on the pay gap between women and men. Women earn a median of $30,0001 per year, while men earn $40,000 per year. In other words, working women earn 75% of what men earn. But this gap doesn’t take into account the fact that on average, men … Read moreIs the Difference in Work Hours the Real Reason for the Gender Wage Gap? [Interactive Infographic]

Scrape Reddit data using Python and Google BigQuery

Let’s get started with data collection from Reddit Reddit API: While web scraping is one among the famous(or infamous!) ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic that scraping bots create, often crashing their websites … Read moreScrape Reddit data using Python and Google BigQuery

Creating AI for GameBoy Part 1: Coding a Controller

Released in 2003, Fire Emblem, The Blazing Sword is a strategy game so successful that its characters are featured in Super Smash Bros and the 15th installment of the series will be released in early 2019. The game is played by selecting characters (aka units), making decisions on where to move them, and then deciding … Read moreCreating AI for GameBoy Part 1: Coding a Controller

Why you should care about Docker?

The Dockerfile — where it all begins Docker is a powerful tool, but its power is harnessed through the use of things called Dockerfiles (as mentioned above). A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated … Read moreWhy you should care about Docker?

Python for Data Science: From Scratch(Part II)

2.2 Pandas: Pandas is an open source library for Python that was particularly created for data manipulation and analysis of huge chunks of data. Pandas offers robust data structures and functions for manipulating data easily. Photo by Debbie Molle on Unsplash But wait, that’s what lists, dict and Numpy’s ndarrays could do too, So why Pandas? … Read morePython for Data Science: From Scratch(Part II)

Getting Started with Recommender Systems and TensorRec

System Overview TensorRec is a Python package for building recommender systems. A TensorRec recommender system consumes three pieces of input data: user features, item features, and interactions. Based on the user/item features, the system will predict which items to recommend. The interactions are used when fitting the model: predictions are compared to the interactions and … Read moreGetting Started with Recommender Systems and TensorRec

Introduction to Logistic Regression

Introduction In this blog, we will discuss the basic concepts of Logistic Regression and what kind of problems can it help us to solve. GIF: University of Toronto Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not … Read moreIntroduction to Logistic Regression