What follows AlphaStar for Academic AI Researchers?

DeepMind continues making progress, but the path forward for AI researchers in academia is unclear. Ten years ago I challenged AI researchers across the globe to build a professional-level bot for StarCraft 1. The Brood War API was recently released, and for the first time academics and professionals could test out AI systems on a highly-competitive … Read more What follows AlphaStar for Academic AI Researchers?

Using AI For Good

How to Help Developing Countries with Artificial Intelligence CE KanBlockedUnblockFollowFollowing Jan 27 Recently, I have come across quite a few articles stating how artificial intelligence may threaten the developing world by eliminating the need for repetitive, labor-intensive manufacturing roles. Automation of factories can potentially lead to higher unemployment rates in poorer nations, thereby disrupting local … Read more Using AI For Good

Hierarchical Bayesian Modeling for Ford GoBike Ridership with PyMC3 — Part II

Photo by sabina fratila on Unsplash In the first part of this series, we explored the basics of using a Bayesian-based machine learning model framework, PyMC3, to construct a simple Linear Regression model on Ford GoBike data. In this example problem, we aimed to forecast the number of riders that would use the bike share tomorrow … Read more Hierarchical Bayesian Modeling for Ford GoBike Ridership with PyMC3 — Part II

Mathematical Notation in Online R/exams

Many R/exams exercises employ mathematical notation that needs to be converted and rendered suitably for inclusion in online exams. While R/exams attempts to set suitable defaults, an overview is provided of possible adjustments and when these might be useful or even necessary. Overview A popular use case of the R/exams package is the generation of … Read more Mathematical Notation in Online R/exams

Handling imbalanced datasets in machine learning

Reworking the problem is better Up to now the conclusion is pretty disappointing: if the dataset is representative of the true data, if we can’t get any additional feature and if we target a classifier with the best possible accuracy, then a “naive behaviour” (answering always the same class) is not necessarily a problem and should … Read more Handling imbalanced datasets in machine learning

Interactive Controls for Jupyter Notebooks

How to use interactive IPython widgets to enhance data exploration and analysis There are few actions less efficient in data exploration than re-running the same cell over and over again, each time slightly changing the input parameters. Despite knowing this, I still find myself repeatedly executing cells just to make the slightest change, for example, choosing … Read more Interactive Controls for Jupyter Notebooks

Building Big Shiny Apps — A Workflow (1/2)

During the rstudio::conf(2019L), I’ve presented an eposter called “Building Big Shiny Apps — A Workflow”. You can find the poster here, and this blog post is an attempt at a transcription of what I’ve been talking about while presenting the poster. As this is a rather long topic, I’ve divided this post into two parts: … Read more Building Big Shiny Apps — A Workflow (1/2)

Understanding Entity Embeddings and It’s Application

As of late I’ve been reading a lot on entity embeddings after being tasked to work on a forecasting problem. The task at hand was to predict the salary of a given job title, given the historical job ads data that we have in our data warehouse. Naturally, I just had to seek out how … Read more Understanding Entity Embeddings and It’s Application

Mario vs. Wario — round 2: CNNs in PyTorch and Google Colab

Since quite some time I was getting round to playing with Google Colab (yes, free access to GPU…). I think this is a really awesome initiative, which enables people with no GPU on their personal computers to play around with Deep Learning and train model they would not be able to train otherwise. Basically we … Read more Mario vs. Wario — round 2: CNNs in PyTorch and Google Colab

10 Tips for Choosing the Optimal Number of Clusters

Matt.0BlockedUnblockFollowFollowing Jan 27 Photo by Pakata Goh on Unsplash Clustering is one of the most common unsupervised machine learning problems. Similarity between observations is defined using some inter-observation distance measures or correlation-based distance measures. There are 5 classes of clustering methods: + Hierarchical Clustering+ Partitioning Methods (k-means, PAM, CLARA)+ Density-Based Clustering+ Model-based Clustering+ Fuzzy Clustering My … Read more 10 Tips for Choosing the Optimal Number of Clusters

R tips and tricks – higher-order functions

A higher-order function is a function that takes one or more functions as arguments, and\or returns a function as its result. This can be super handy in programming when you want to tilt your code towards readability and still keep it concise.Consider the following code: # Generate some fake data > eps <- rnorm(10, sd= … Read more R tips and tricks – higher-order functions

A Gentle Introduction to Deep Learning : Part 3

PCA & Linear Algebra(Advance) Photo by Antoine Dautry “You can’t build great building on a weak foundation”. This quote truly justifies what I am trying to do here, you cannot learn the true form of machine learning or deep learning until you don’t have the knowledge of some of the important mathematical concepts like linear algebra … Read more A Gentle Introduction to Deep Learning : Part 3

Data Augmentation for Natural Language Processing

Lessons learned from a hate speech detection task to improve supervised NLP models Note: this post is mainly targeted at an audience unfamiliar with Natural Language Processing and will hence cover some basics concepts before moving on to data augmentation Source: Harvard Political Review Natural Language Processing (NLP) has become increasingly popular in both academia and … Read more Data Augmentation for Natural Language Processing

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those … Read more Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Learning to Drive Smoothly in Minutes

Learning to Drive in Minutes — The Updated Approach Although Wayve.ai technique may work in principle, it has some issues that needs to be addressed to apply it to a self-driving RC car. First, because the feature extractor (VAE) is trained after each episode, the distribution of features is not stationary. That is to say, the features are … Read more Learning to Drive Smoothly in Minutes

The New Dawn of AI: Federated Learning

The emerging AI market model is dominated by tech giants such as Google, Amazon and Microsoft, who offer cloud-based AI solutions and APIs. This model offers users little control over the usage of AI products and their own data that is collected from their devices, locations etc. In the long run, such a centralized model … Read more The New Dawn of AI: Federated Learning

Analytics Building Blocks: Regression

A modularized notebook to tune and compare 11 regression algorithms with minimal coding in a control panel fasion This article summarizes and explains key modules of my regression block (One of the simple modularized notebooks I am developing to execute common analysis tasks). The notebook is intended to facilitate quicker experimentation for the users with … Read more Analytics Building Blocks: Regression

Generative Adversarial Networks — Learning to Create

A peek into the design, training, loss functions and arithmetic behind GANs Let’s say we have a dataset of images of bedrooms and an image classifier CNN that was trained on this dataset to tells us if a given input image is a bedroom or not. Let’s say the images are of size 16 * 16. … Read more Generative Adversarial Networks — Learning to Create

Machine Learning from First Principles

Machine Learning ~ Applied Mathematics https://bit.ly/2Wns7eN Roadmap Goal: First and foremost machine learning carries with it this connotation that it is extremely complex. While it is mathematically rigorous it is really simple when you break it down into mathematical terms and even more simple to grasp once you see a real world example of how … Read more Machine Learning from First Principles

Tensorflow — The core concepts

[source: https://tensorflow.org] Like most machine learning libraries, TensorFlow is “concept-heavy and code-lite”. The syntax is not very difficult to learn. But it is very important to understand its concepts. What is a Tensor? According to the Wikipedia, “A tensor is a geometric object that maps in a multi-linear manner geometric vectors, scalars, and other tensors to … Read more Tensorflow — The core concepts

Summarizing rstudio::conf 2019 Summaries with Tidy Text Techniques

To be honest, I planned on writing a review of this past weekend’srstudio::conf 2019,but several other people have already done a great job of doingthat—just check out Karl Broman’s aggregation of reviews at the bottomof the page here!(More on this in a second.) In short, my thoughts on the wholeexperience are captured perfectly by NickStrayer’s … Read more Summarizing rstudio::conf 2019 Summaries with Tidy Text Techniques

Understanding Markov Decision Processes

At a high level intuition, a Markov Decision Process(MDP) is a type of mathematics model that is very useful for machine learning, reinforcement learning to be specific. The model allows machines and agents to determine the ideal behavior within a specific environment, in order to maximize the model’s ability to achieve a certain state in … Read more Understanding Markov Decision Processes

How to store financial market data for backtesting

I am working on moderately large financial price data sets. By moderately large I mean less than 4 million rows per asset. 4 million rows can cover the last 20 years of minute price bars done by a regular asset without extended trading hours — such as index futures contracts or regular cash stocks — . When dealing with … Read more How to store financial market data for backtesting

Learning NLP Language Models with Real Data

Part 2: Applying Language Models to Real Data Data Source and Pre-Processing For this demonstration, we will be using the IMDB large movie review dataset made available by Stanford. The data contains the rating given by the reviewer, the polarity and the full comment. For example, the first negative comment here in full is the following: … Read more Learning NLP Language Models with Real Data

How Twitter does it? Challenges in implementing recommender systems at scale

A summarized view of the challenges in implementing recommender systems from an industry point of view Most of the times data science projects stop at achieving some satisfactory accuracy based on a subset of data. This is the case with recommender systems also. In a controlled environment and with a limited dataset, it might be possible … Read more How Twitter does it? Challenges in implementing recommender systems at scale

Analyzing and Predicting Starbucks’ Location Strategy

Logistic Regression Prediction A basic logistic regression using demographic variables can correctly predict about 60% of zip codes that have a Starbucks and 90% of those that don’t. Given the unbalanced nature of the data set — 31K observations and ~5,500 with a Starbucks — a 60% prediction rate should be sufficient for the purposes of this exercise. Our … Read more Analyzing and Predicting Starbucks’ Location Strategy

Hypothesis Testing Glossary for the Weary Reader

From “alpha” to “z-score” TL;DR — Jump to glossary Why So Weary? When I try to read about statistics I get mired in the jargon. Even just moving past the phrase, “For a given parameterized distribution,” requires that I think about what it means for something to be “parameterized” and what a “distribution” is. I wind up reading in … Read more Hypothesis Testing Glossary for the Weary Reader

Artificial Neural Network Implementation using NumPy and Classification of the Fruits360 Image…

This tutorial builds artificial neural network in Python using NumPy from scratch in order to do an image classification application for the Fruits360 dataset. Everything (i.e. images and source codes) used in this tutorial, rather than the color Fruits360 images, are exclusive rights for my book cited as “Ahmed Fawzy Gad ‘Practical Computer Vision Applications … Read more Artificial Neural Network Implementation using NumPy and Classification of the Fruits360 Image…

Quick guide to run your Python scripts on Google Colaboratory

If you are looking for an interactive way to run your Python script, say you want to start a machine learning project with a couple of friends, look no further — Google Colab is the best solution for you. You can work online and save your code on your local Google Drive, and it allows you to … Read more Quick guide to run your Python scripts on Google Colaboratory

How to Learn More in Less Time with Natural Language Processing (Part 2)

And how to create your own bag of words classifier With the nifty extractive text summarizer we created in Part 1, we were able to take news articles and cut them down to half their size or more! Now it is time to take these articles and classify them by subject. In this part we … Read more How to Learn More in Less Time with Natural Language Processing (Part 2)

How to Learn More in Less Time with Natural Language Processing (Part 1)

And how to create your own extractive text summarizer Imagine you are given an assignment from school or work that involves A LOT of research. You spend all night grinding it out, so you can acquire the knowledge you need for a high-quality end product. Now imagine you are given the exact same assignment and … Read more How to Learn More in Less Time with Natural Language Processing (Part 1)

User guide to My First Data Product: Medium Post Metric Displayer

Know Your Medium Post Better with Data Origin As a regular writer on Medium as well as a data geek, after the busy year of 2018, I’d like to reflect what I have achieved on my Medium blog. Furthermore, based on the performance in 2018, I plan to make more aggressive writing plan in the year … Read more User guide to My First Data Product: Medium Post Metric Displayer

An Rstudio Addin for Network Analysis and Visualization

The ggraph package provides a ggplot-like grammar for plotting graphs and as such youcan produce very neat network visualizations. But as with ggplot, it takes a while to getused to the grammar. There are already a few amazing Rstudio Addins that assist you with ggplot(for example ggplotAssist andggThemeAssist),but there has not been any equivalent tools … Read more An Rstudio Addin for Network Analysis and Visualization

EMPOWERING A CITIZEN DATA SCIENTIST FOR HARDWARE DESIGN & MANUFACTURING

Improving productivity of a hardware design and manufacturing professional with an advanced AI tool Authors: Partha Deka and Rohit Mittal What is a citizen data scientist? Expert data scientists rely on custom coding to make sense out of data. The use case could be data cleansing, data imputation, creating segments, finding patterns in the data, … Read more EMPOWERING A CITIZEN DATA SCIENTIST FOR HARDWARE DESIGN & MANUFACTURING

How to do Bayesian hyper-parameter tuning on a blackbox model

Optimization of arbitrary functions on Cloud ML Engine Google Cloud ML Engine offers a hyper-parameter tuning service that uses Bayesian methods. It is not restricted to TensorFlow or scikit-learn. In fact, it is not even limited to machine learning. You can use the Bayesian approach to tune pretty much any blackbox model. To demonstrate, I’ll tune … Read more How to do Bayesian hyper-parameter tuning on a blackbox model

Creating AI for GameBoy Part 2: Collecting Data From the Screen

Welcome to part 2 of Creating an AI for GameBoy! If you missed Part 1: Coding a Controller, click here to catch up. In this edition, I will be going over how to intelligently get information out of the game through various image processing and classification techniques. This is important to any game AI, but … Read more Creating AI for GameBoy Part 2: Collecting Data From the Screen

Statistics is the Grammar of Data Science — Part 2

Probability Distribution Functions A probability distribution is a function that describes the likelihood of an event or outcome. We will now delve into the different types of distributions, in terms of the dataset being continuous or discrete. Probability Density Function (PDF) When we see a graph like the one in the figure below, we think that … Read more Statistics is the Grammar of Data Science — Part 2

“Data Science” Has Become Too Vague

Let’s Specialize and Break it Up! I would not be opposed to downplaying the term “data science” and breaking it up into specialized disciplines. Do not misunderstand, I think the global “data science” movement was necessary and had a positive impact on the curmudgeon corporate world. But the campaign has been won and everybody is bought … Read more “Data Science” Has Become Too Vague

Monte Carlo Simulations with Python (Part 1)

Monte Carlo’s can be used to simulate games at a casino (Pic courtesy of Pawel Biernacki) This is the first of a three part series on learning to do Monte Carlo simulations with Python. This first tutorial will teach you how to do a basic “crude” Monte Carlo, and it will teach you how to … Read more Monte Carlo Simulations with Python (Part 1)

Text to Image

This article will explain the experiments and theory behind an interesting paper that converts natural language text descriptions such as “A small bird has a short, point orange beak and white belly” into 64×64 RGB images. Following is a link to the paper “Generative Adversarial Text to Image Synthesis” from Reed et al. Article Outline … Read more Text to Image

satRdays Newcastle 2019 Conference is Here!

We are pleased to announce the very first Satrday event in Newcastle upon Tyne (and England). satRdays Newcastle is a one-day, low-cost, community organised R conference in the heart of Newcastle City Centre. Where? The event will be held at Newcastle University. Getting to Newcastle is really easy Train: 90 minutes from Edinburgh or 3 … Read more satRdays Newcastle 2019 Conference is Here!

The Simple Yet Practical Data Visualization Codes

In the previous article I shared about my little toolbox for data cleaning after realizing that some codes are applicable for most common scenarios of messy data. In other words, there is a pattern (or an approach) that is commonly used in data science for data cleaning and I compiled them into functions for reusability … Read more The Simple Yet Practical Data Visualization Codes

A Great Public Health Conspiracy?

The Facts on Public Water Fluoridation With any health topic, especially one that has attracted controversy, we must be careful about where we get our data. Even studies in peer-reviewed journals can have biases — intentional or not. Therefore, the best practice for reviewing medical evidence is to look at meta-analyses, reviews that evaluate results from dozens … Read more A Great Public Health Conspiracy?

Canny Edge Detection Step by Step in Python — Computer Vision

Noise Reduction Since the mathematics involved behind the scene are mainly based on derivatives (cf. Step 2: Gradient calculation), edge detection results are highly sensitive to image noise. One way to get rid of the noise on the image, is by applying Gaussian blur to smooth it. To do so, image convolution technique is applied … Read more Canny Edge Detection Step by Step in Python — Computer Vision