K-Means Clustering

Data set and Code As I mentioned before, we are going to be using text data and in particular, we will be taking a look at the Enron email data set which is available on Kaggle. For those of you that don’t know the story/scandal surrounding Enron, I would suggest checking out the smartest guys in … Read more

AnzoGraph: A W3C Standards-Based Graph Database

Introduction In this interview, I’m catching up with Barry Zane, Vice President at Cambridge Semantics. Barry is creator of AnzoGraph™, a native, massively parallel processing (MPP) distributed graph database. Barry has had quite a journey in database world. He served as Vice President of Technology of Netezza Corporation from 2000 to 2005, and was responsible … Read more

Real Net Profit: 150% in just 4 Months

Developing a post-commission profitable currency trading model using Pivot Billions and R. Needle, meet haystack. Searching for the right combination of features to make a consistent trading model can be quite difficult and takes many, many iterations. By incorporating Pivot Billions and R into my research process, I was able to dramatically improve the efficiency … Read more

Categories R Tags ExcerptFavorite

Benchmarking cast in R from long data frame to wide matrix

In my daily work I often have to transform a long table to a wide matrix so accommodate some function. At some stage in my life I came across the reshape2 package, and I have been with that philosophy ever since – I find it makes data wrangling easy and straight forward. I particularly like … Read more

Categories R Tags ExcerptFavorite

Deploying an R Shiny App With Docker

If you haven’t heard of Docker, it is a system that allows projects to be split into discrete units (i.e. containers) that each operate within their own virtual environment. Each container has a blueprint written in its Dockerfile that describes all of the operating parameters including operating system and package dependencies/requirements. Docker images are easily … Read more

Categories R Tags ExcerptFavorite

Superhuman “cell-sight” with Deep Learning

Using “in silico labeling” to predict fluorescent labels in unlabeled images and cell morphology, components, and structures. An analysis of the paper In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images published in Cell. Fluorescently tagged neuronal cell culture. Source Take a look at this image, and tell me what you see. Figure 1. Source: Finkbeiner … Read more

NSERC – Discovery Grants Program, over the past 5 years

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 library(XML) library(stringr) url=”http://www.nserc-crsng.gc.ca/NSERC-CRSNG/FundingDecisions-DecisionsFinancement/ResearchGrants-SubventionsDeRecherche/ResultsGSC-ResultatsCSS_eng.asp” download.file(url,destfile = “GSC.html”) library(XML) tables=readHTMLTable(“GSC.html”) GSC=tables[[1]]$V1 GSC=as.character(GSC[-(1:2)]) namesGSC=tables[[1]]$V2 namesGSC=as.character(namesGSC[-(1:2)]) Correction = function(x) as.numeric(gsub(‘[$,]’, ”, x)) YEAR=2013:2018 for(i in 1:length(YEAR)){ … Read more

Categories R Tags ExcerptFavorite

Launching codecentric.AI Bootcamp course!

Today, I am happy to announce the launch of our codecentric.AI Bootcamp! This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you … Read more

Categories R Tags ExcerptFavorite

Liverpool is the Most Popular City in the World (relative to use as password per inhabitant)

The API of pwnedpasswords.com is quite remarkable. It not only allows you to fetch the results generally obtained by typing in your e-mail into the browser interface and finding out whether or not you’ve been pwned from the comfort of your shell. It further allows you to very simply check whether a certain password has … Read more

Categories R Tags ExcerptFavorite

Artificial Intelligence and Business Value

Digital technologies are pervasive. Nearly 5 billion people in the world now have a mobile phone connection and more than 7 billion mobile phones are in use (some people have more than one phone). Approximately 2.5 billion of the phones are smartphones. Cell phone penetration is now approaching that of electricity — about 88% of the world’s … Read more

Introducing olsrr

I am pleased to announce the olsrr package, a set of tools for improvedoutput from linear regression models, designed keeping in mindbeginner/intermediate R users. The package includes: comprehensive regression output variable selection procedures heteroskedasticiy, collinearity diagnostics and measures of influence various plots and underlying data If you know how to build models using lm(), you … Read more

Categories R Tags ExcerptFavorite

“Correlation is not causation”. So what is?

Machine learning applications have been growing in volume and scope rapidly over the last few years. What’s Causal inference, how is it different than plain good ole’ ML and when should you consider using it? In this report I try giving a short and concrete answer by using an example. Imagine we’re tasked by the … Read more

Categories R Tags ExcerptFavorite

NLP Learning Series: Part 2 — Conventional Methods for Text Classification

NLP Learning Series (Part 2) Teaching Machines to Learn Text This is the second post of the NLP Text classification series. To give you a recap, recently I started up with an NLP text classification competition on Kaggle called Quora Question insincerity challenge. And I thought to share the knowledge via a series of blog posts on … Read more

Review: YOLOv3 — You Only Look Once (Object Detection)

Improved YOLOv2, Comparable Performance with RetinaNet, 3.8× Faster! YOLOv3 In this story, YOLOv3 (You Only Look Once v3), by University of Washington, is reviewed. YOLO is a very famous object detector. I think everybody must know it. Below is the demo by authors: YOLOv3 As author was busy on Twitter and GAN, and also helped out … Read more

Data Science with Optimus. Part 1: Intro.

Breaking down data science with Python, Spark and Optimus. Don’t worry if you don’t know what these logos are, I’ll explain them in next articles 🙂 Data science has reached new levels of complexity and of course awesomeness. I’ve been doing this for years now, I’m what I want for people is to have a clear and … Read more

Web scraping with Python — A to copy Z

Handling BeautifulSoup, avoiding blocks, enriching with API, storing in a DB and visualizing the data Photo by michael podger on Unsplash Introduction What is web scraping and when would you want to use it? The act of going through web pages and extracting selected text or images. An excellent tool for getting new data or enriching your … Read more

Naive Bayes: Intuition and Implementation

Introduction: What Are Naive Bayes Models? In a broad sense, Naive Bayes models are a special kind of classification machine learning algorithms. They are based on a statistical classification technique called ‘Bayes Theorem’. Naive Bayes model are called ‘naive’ algorithms becaused they make an assumption that the predictor variables are independent from each other. In other … Read more

Create data visualizations like BBC News with the BBC’s R Cookbook

If you’re looking a guide to making publication-ready data visualizations in R, check out the BBC Visual and Data Journalism cookbook for R graphics. Announced in a BBC blog post this week, it provides scripts for making line charts, bar charts, and other visualizations like those below used in the BBC’s data journalism.  The cookbook … Read more

Categories R Tags ExcerptFavorite

Clustered Globe

Setting Constraints & Variables First, we’re gonna set the boundaries of what detail we are going to cluster. At this stage I want to keep countries separated and only cluster activities within a single country. Therefore, by the nature of clustering, small countries will probably become a single cluster. And although there could be cross-border … Read more


I am stuck at home sick today, so I decided to provide a relational analysis of the Stats Package Wars that have been bubbling away for the past week. True in all its details. If you want something slightly more constructive, consider The Plain Person’s Guide to Plain-Text Social Science. Related To leave a comment … Read more

Categories R Tags ExcerptFavorite

Using NLP to build a search & discovery app for Regulators

Regulations need to be updated constantly in this era of rapid socio-economic and technological change. Regulators spend a substantial amount of time assessing the current stock of Acts to identify inconsistent use of language or markers that don’t support innovation and create a burden for businesses. Given the large number of Acts and their complex … Read more

Hybrid Humans and Conscious Robots

Musings on the intersection of Artificial Intelligence, Consciousness, and Reinforcement Learning At what level are you conscious? Staring into the eyes of a comatose loved one, many of us have agonized over whether the patient was conscious of caresses received or whispered prayers. Increasing we will have answers to such questions, thanks in a large … Read more

People Tracking using Deep Learning

Doing cool things with data! Introduction Object Tracking is an important domain in computer vision. It involves the process of tracking an object which could be a person, ball or a car across a series of frames. For people tracking we would start with all possible detections in a frame and give them an ID. In … Read more

Supervised Machine Learning: Model Validation, a Step by Step Approach

Model validation is the process of evaluating a trained model on test data set. This provides the generalization ability of a trained model. Here I provide a step by step approach to complete first iteration of model validation in minutes. The basic recipe for applying a supervised machine learning model are: Choose a class of model … Read more

BigQuery without a credit card: Discover, learn and share

If you ever had trouble signing up for BigQuery, worry no more — now it’s easier than ever to sign up and start querying. The new sandbox mode even includes free storage, no credit card required. See the official blog post “Query without a credit card: introducing BigQuery sandbox” for more details. Here we are going to … Read more

Learn Enough Python to be Useful Part 2

How to Use if __name__ == “__main__ “ This article is one in a series to help you become comfortable in Python scripting land. It’s for data scientists and anyone new to Python programming. if __name__ == “__main__”: is one of those things you see in Python scripts that often isn’t explained. You might have … Read more

Intuitive Deep Learning Part 1a: Introduction to Neural Networks

As mentioned above, Deep Learning is simply a subset of the architectures (or templates) that employs “neural networks” which we can specify during Step 1. “Neural networks” (more specifically, artificial neural networks) are loosely based on how our human brain works, and the basic unit of a neural network is a neuron. At the basic … Read more

Fashion Science takes on Seasonal Color Analysis

Turns out ‘seasons’ just aren’t found in the data Wear the right color clothes and be more attractive! That’s the allure of seasonal color analysis. By appropriately placing you into one of four seasons — spring, summer, autumn, and winter — each has its own palette of colors appropriate for you. This paper applies Fashion Science to explore a simple … Read more

Image Classification for E-commerce [Part I]

System Requirements Download or clone the ResNet model from Facebook’s Github link. Install the Torch ResNet dependencies on Ubuntu 14.04+: Install Torch on a machine with CUDA GPU (NVIDIA GPU with compute capability 3.5 or above) Install cuDNN v4 or v5 and the Torch cuDNN bindings See the installation instructions for a step-by-step guide. Let’s … Read more

Are you leaking h2o? Call plumber!

Create a predictive model with the h2o package. H2o is a fantastic open source machine learning platform with many different algorithms. There is Graphical user interface, a Python interface and an R interface. Suppose you want to create a predictive model, and you are lazy then just run automl. Lets say, we have both train … Read more

Categories R Tags ExcerptFavorite

Investigating words distribution with R – Zipf’s law

Hello again! Typically I would start by describing a complicated problem that can be solved using machine or deep learning methods, but today I want to do something different, I want to show you some interesting probabilistic phenomena! Have you heard of Zipf’s law? I hadn’t until recently. Zipf’s law is an empirical law that … Read more

Categories R Tags ExcerptFavorite

Perfume Recommendations using Natural Language Processing

Introduction Natural Language Processing(NLP) has many intriguing applications to Recommender Systems and Information Retrieval. As a perfume lover and a Data Scientist, the unusual and highly descriptive language used in the niche perfume community inspired me to use NLP to create a model to help me discover perfumes I might want to purchase. “Niche” perfumes … Read more

Le Monde puzzle [#1083]

A Le Monde mathematical puzzle that seems hard to solve without the backup of a computer (and just simple enough to code on a flight to Montpellier): Given the number N=2,019, find a decomposition of N as a sum of non-trivial powers of integers such that (a) the number of integers in the sum is … Read more

Categories R Tags ExcerptFavorite

Hand Keypoints Detection

Detect the keypoint positions on hand images with small train data set. How many labelled images are needed to train a network to accurately predict fingers and palm lines locations? I was inspired by this blog post where the author reported 97.5% classification accuracy to classify if a human was wearing glasses or not with … Read more

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and … Read more

Categories R Tags ExcerptFavorite

Your Data is Using a lot of Energy

All the data collected on users and saved in the cloud is having a big impact on the environment If you begin to look into the amount of data created each year, you’ll quickly find a statistic that at first look seems over used and outdated: 90% of data ever created was created in the … Read more

Algorithms for Text Classification — Part 1: Naive Bayes

Next, let’s see how to run this algorithm using Python with real data: import pandas as pdimport numpy as np spam_data = pd.read_csv(‘spam.csv’) spam_data[‘target’] = np.where(spam_data[‘target’]==’spam’,1,0)print(spam_data.shape)spam_data.head(10) from sklearn.model_selection import train_test_split#Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(spam_data[‘text’],spam_data[‘target’],random_state=0) from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import roc_auc_score #Train and evaluate … Read more

How Do I Write About Data Science On Medium

5 Core Principles to Write about Data Science, and Beyond (Source) 1. Be conversational Your articles are always read by individual readers — one reader at any given time. What this means is that readers mostly read your articles individually without anyone beside them. Therefore, to really attract and engage with readers, your writing should be in a … Read more

Community Forums Meets Data Science

Analysis of forum members’ activity, posts, and behavior SummaryAs a community builder and strategist with a passion for data science, I have found that the use of data science techniques has deepened my understanding of the communities I manage, allowing me to make better strategic and operational decisions. In this article, I aim to exemplify how … Read more

AI: The Future of Technology and the World

Artificial intelligence (AI) has now become a topic of controversy bigger than ever before. Many people are worried about robots taking over the world. The concept of AI scares people because they are afraid of the fact that we are creating bots in which we have no idea how they work. But what if I … Read more

Community detection of survey responses based on Pearson correlation coefficient with Neo4j

Just a few days ago a new version of Neo4j graph algorithms plugin was released. With the new release come new algorithms and Pearson correlation algorithm is one of them. To demonstrate how to use Pearson correlation algorithm in Neo4j we will use the data from “Young People Survey” Kaggle dataset made available by Miroslav … Read more

Toronto on Fire in Data, Part 1

Fire Incidents Analysis by Segmentation & Poisson in Practice Each year the Toronto Fire Services (TFS) are dispatched to between 9,000 and 10,000 fires in the city of 2.7 million inhabitants. The severity ranges from minor fires in grass or rubbish to major fires in warehouses or residential high-rises. In this study, the first of a … Read more

Introducing DoWhy

Microsoft’s Framework for Causal Inference The human mind has a remarkable ability to associate causes with a specific event. From the outcome of an election to an object dropping on the floor, we are constantly associating chains of events that cause a specific effect. Neuropsychology refers to this cognitive ability as causal reasoning. Computer science … Read more