Optimizing Jupyter Notebooks — A Comprehensive Guide

While we all know that premature micro optimizations are the root of all evil, thanks to Donald Knuth’s paper “Structured Programming With Go To Statements” [1], eventually at some point in your data exploration process you grasp for more than just the current “working” solution. The heuristic approach we usually follow considers: Make it work. … Read moreOptimizing Jupyter Notebooks — A Comprehensive Guide

10 Data Science Tools I Explored in 2018

Source: geralt (pixabay) New Languages, Libraries, and Services Dec 31, 2018 In 2018, I invested a good amount of time in learning and writing about data science methods and technologies. In the first half of 2018, I wrote a blog series on data science for startups, which I turned into a book. In the second half, … Read more10 Data Science Tools I Explored in 2018

Use Unsupervised Machine Learning To Find Potential Buyers of Your Products

Let the algorithm learn by itself… Halong Bay in Vietnam What Is Unsupervised Machine Learning Welcome to my third post about Data Science! In my previous post I discussed how I used supervised machine learning to find donors for a charity. Recall that in supervised machine learning you have input variables (X) and an output variable (Y ) … Read moreUse Unsupervised Machine Learning To Find Potential Buyers of Your Products

Deep Dive into Support Vector Machine

Mathematical Modelling Let us crack the mathematics behind Support Vector Machine. Given, training set {(Xᵢ,Yᵢ) where i=1,2,3,…,n}, Xᵢ ∈ ℜᵐ, Yᵢ ∈ {+1,-1}. Here, Xᵢ is the feature vector for the iᵗʰ data point and Yᵢ is the label for the iᵗʰ data point. The label can be either ‘+1’ for positive class or ‘-1’ … Read moreDeep Dive into Support Vector Machine

Problem solving with “AI Challenger Global AI Contest”

Experience from participating in computer vision competition hosted by Chinese data science platform AI Challenger Dec 31, 2018 In this article I will share my experience solving a video classification problem in a Chinese machine learning competition. There are a lot of data science platforms for competitors. We used to think about Kaggle – the … Read moreProblem solving with “AI Challenger Global AI Contest”

Introducing RcppDynProg

RcppDynProg is a new Rcpp based R package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note). The abstract problem The primary problem RcppDynProg::solve_dynamic_program() is designed … Read moreIntroducing RcppDynProg

Super SloMo fun. Or how you can make awesome YouTube videos with AI

Dec 31, 2018 Have you ever wondered how YouTubers or National Geographic make those super slow-motion videos? They’re so crazy cool it almost seems like magic! It used to be very expensive to create such a video. The only way to do it cleanly was with a super high-fps camera. Lucky for us, that’s no … Read moreSuper SloMo fun. Or how you can make awesome YouTube videos with AI

Predicting Crash Severity for NZ Road Accidents

Feature Exploration (a.k.a EDA) The dataset has an initial total of 655,697 samples. Each with 89 different features, both numerical and categorical. Simply by reading the definition of each feature from the pdf mentioned above, we can see that there are some features whose values are derived from other features. These derived features don’t add intrinsic … Read morePredicting Crash Severity for NZ Road Accidents

How to use machine learning for anomaly detection and condition monitoring

Technical section: It is hard to cover the topics of machine learning and statistical analysis for anomaly detection without also going into some of the more technical aspects. I will still avoid going too deep into the theoretical background (but provide some links to more detailed descriptions). If you are more interested in the practical … Read moreHow to use machine learning for anomaly detection and condition monitoring

Numpy Guide for People In a Hurry

Dec 31, 2018 Photo by Chris Ried on Unsplash The NumPy library is an important Python library for Data Scientists and it is one that you should be familiar with. Numpy arrays are like Python lists, but much better! It’s much easier manipulating a Numpy array than manipulating a Python list. You can use one Numpy … Read moreNumpy Guide for People In a Hurry

Supercharging word vectors

A simple technique to boost fastText and other word vectors in your NLP projects Over the last few years, word vectors have been transformative in their ability to create semantic linkages between words. It is now the norm for these to be fed into deep learning models for tasks such as classification or sentiment analysis. Despite … Read moreSupercharging word vectors

Exploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package

Well, 2018 has flown by and today seems like an appropriate time to take a look at the landscape of R bloggerdom as seen through the eyes of readers of R-bloggers and R Weekly. We’ll do this via a new package designed to make it easier to treat Feedly as a data source: seymour [GL … Read moreExploring 2018 R-bloggers & R Weekly Posts with Feedly & the ‘seymour’ package

Agile Data Science – Rethink

A revisit of agile practice inside the data science team Yi JinBlockedUnblockFollowFollowing Dec 31, 2018 Agile to Success within Data Science Nowadays, the Agile became very famous among software development. Data science is another trending discipline, Many and many more companies try to build and benefit from. Scrum as methodologies tries to help build a good software … Read moreAgile Data Science – Rethink

An NLP View on Holiday Movies — Part II: Text Generation using LSTM’s in Keras

Dec 30, 2018 Photo by rawpixel on Unsplash Continuing on the first part of this blog post, let’s see if we can train an RNN with the input sequences, and use that to generate some new ones. The code for this part can be found here. The full code and the data (in a pandas dataframe) … Read moreAn NLP View on Holiday Movies — Part II: Text Generation using LSTM’s in Keras

An NLP View on Holiday Movies — Part I: Topic Modeling using Gensim and SKlearn

Dec 30, 2018 Photo by Tom Coomer on Unsplash Holidays are a time for family, friends, snow and as far as my wife is concerned: corny holiday movies. To try and help her never-ending Holiday movie appetite, we’re going to check if we can create a new Christmas movie. The blog post is structured in two … Read moreAn NLP View on Holiday Movies — Part I: Topic Modeling using Gensim and SKlearn

117 Days Of Tinder In Data

BraydenBlockedUnblockFollowFollowing Dec 30, 2018 Nearly 4 months ago — after my relationship of 3 years ended — I decided to create a Tinder profile. The following is the story of that profile (in data). If you’re lazy and don’t care about the details, skip to the bottom and there is a Sankey diagram that sums up a lot. Let’s … Read more117 Days Of Tinder In Data

Build Log Analytics Application using Apache Spark

Step by step process of developing a real world application using Apache Spark, along with main focus on explaining the architecture of Spark. Image Source: techgyo.com Why Apache Spark Architecture if we have Hadoop? The Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster, and applies … Read moreBuild Log Analytics Application using Apache Spark

Activation Functions in Neural Networks

The motive, use cases, advantages and limitations tl;dr The post discusses the various linear and non-linear activation functions used in deep learning and neural networks. We also take a look into how each function performs in different situations, the advantages and disadvantages of each then finally concluding with one last activation function that out-performs the … Read moreActivation Functions in Neural Networks

Leaf Plant Classification: Statistical Learning Model – Part 2

Categories Advanced Modeling Tags Linear Regression Principal Component Analysis R Programming In this post, I am going to build a statistical learning model as based upon plant leaf datasets introduced in part one of this tutorial. We have available three datasets, each one providing sixteen samples each of one-hundred plant species. The features are: shape … Read moreLeaf Plant Classification: Statistical Learning Model – Part 2

Small Steps: A Experimental Case for Compound Prediction

A Experimental Case for Compound Prediction Uncovering the abstract non-linear relationship between variables is the utility which machine learning provides. However, does jumping from known inputs directly to the abstractly related desired values yield the most accurate results? What happens in cases where a series of more closely related variables can first be predicted and … Read moreSmall Steps: A Experimental Case for Compound Prediction

K-nearest Neighbors Algorithm with Examples in R (Simply Explained knn)

In this post I am going to exampling what k- nearest neighbor algorithm is and how does it help us. But before we move ahead, we aware that my target audience is the one who wants to get intuitive understanding of the concept and not very in-dept understanding, that is why I have avoided being … Read moreK-nearest Neighbors Algorithm with Examples in R (Simply Explained knn)

Running Fast.ai course notebooks on Kaggle Kernel

Kaggle Kernels offer ML optimized docker environment, Tesla K80 GPU, internet access and uninterrupted 6 hour sessions. Any Clouderizer project can now be run seamlessly on Kaggle Kernels. What this means is our community project for Fast.ai, can now be run on Kaggle Kernels, just as easily. Below are the steps. Following pre-requisite, one time, steps … Read moreRunning Fast.ai course notebooks on Kaggle Kernel

Modality tests and kernel density estimations

Dec 30, 2018 When processing a large number of datasets which can potentially have different data distributions, we are confronted with the following considerations: Is the data distribution unimodal and if it is the case, which model best approximates it( uniform distribution, T-distribution, chi-square distribution, cauchy distribution, etc)? If the data distribution is multimodal, can … Read moreModality tests and kernel density estimations

This dance, it’s like a weapon: Radiohead’s and Beck’s danceability, valence, popularity, and more from the LastFM and Spotify APIs

When it comes to surreal lyrics and videos, I’m always thinking of Beck. Above, I cited the beginning of the song “Wow” from his latest album “Colors” which has received rather mixed reviews. In this post, I want to show you what I have done with Spotify’s API. We will also see how “Colors” compares to all … Read moreThis dance, it’s like a weapon: Radiohead’s and Beck’s danceability, valence, popularity, and more from the LastFM and Spotify APIs

Deep Learning for Classical Japanese Literature

Overview The paper introduces 3 new benchmark datasets for Machine Learning, namely: – Kuzushiji-MNIST — A drop-in replacement for MNIST dataset (28×28)– Kuzushiji-49 — A much larger but imbalanced dataset containing 48 Hiragana characters and 1 Hiragana iteration mark (28×28)– Kuzushiji-Kanji — An imbalanced dataset of 3832 Kanji characters, including rare characters with very few samples. (64×64) Fig 2. An example of … Read moreDeep Learning for Classical Japanese Literature

Master Python through building real-world applications (Part 4)

Build and Deploy a Website using Flask and Heroku App Every once in a while, there comes a new programming language and along with that great community to support that. Python has been around for a while now so it is safe for me to say that Python is not a language, it is a religion. … Read moreMaster Python through building real-world applications (Part 4)

Monty Hall Problem using Python

Understanding mathematical proofs with the help of programming We have all heard the probability brain teaser for the three door game show. Each contestant guesses whats behind the door, the show host reveals one of the three doors that didn’t have the prize and gives an opportunity to the contestant to switch doors. It is … Read moreMonty Hall Problem using Python

How to use corporate e-mail analysis to reveal hidden stars and ensure equal opportunities!

mgm.com Article Series on Organizational Network Analysis and Communication Content Analysis Dec 30, 2018 In 1736, Leonhard Euler wrote a paper on the Seven Bridges of Königsberg. This paper is regarded as the first in the history of Graph Theory. Lately, Graph Theory has been utilized in the field of organizational communications and it is called … Read moreHow to use corporate e-mail analysis to reveal hidden stars and ensure equal opportunities!

Extract features of Music

Different type of audio features and how to extract them. Dec 30, 2018 MFCC feature extraction Extraction of features is a very important part in analyzing and finding relations between different things. The data provided of audio cannot be understood by the models directly to convert them into an understandable format feature extraction is used. It … Read moreExtract features of Music

Flower classification with Convolutional Neural Networks.

Agenda. Since I began to study deep learning on FastAI, this is my first attempt to implement image classifier. I’m going to tell you (and understand better) how to create simple and more or less accurate flower recognition model using FastAI library. Data acquisition. We will use public flowers dataset which stored on kaggle. It … Read moreFlower classification with Convolutional Neural Networks.

2. Machine Learning 101 – Problem solving workflow

How do you go from raw data to a fully working machine learning solutions? Dec 30, 2018 If you are a software engineer, I’m sure at some point you wanted to do ‘some machine learning’, crack the secrets of the Universe and find the ultimate answer to life, the Universe and everything. However, machine learning … Read more2. Machine Learning 101 – Problem solving workflow

Total Least Squares in comparison with OLS and ODR

Total least squares(aka TLS) is one of the methods of regression analysis to minimize the sum of squared errors between response variable(or, an observation) and estimated variable(we often say a fitted value). The most popular and standard methods of this is Ordinary least squares(aka OLS) for the same purpose, and TLS is one of other … Read moreTotal Least Squares in comparison with OLS and ODR

Explained: A Style-Based Generator Architecture for GANs – Generating and Tuning Realistic…

NVIDIA’s novel architecture for Generative Adversarial Networks Generative Adversarial Networks (GAN) are a relatively new concept in Machine Learning, introduced for the first time in 2014. Their goal is to synthesize artificial samples, such as images, that are indistinguishable from authentic images. A common example of a GAN application is to generate artificial face images … Read moreExplained: A Style-Based Generator Architecture for GANs – Generating and Tuning Realistic…

R or Python? Why not both? Using Anaconda Python within R with {reticulate}

This short blog post illustrates how easy it is to use R and Python in the same R Notebook thanks to the{reticulate} package. For this to work, you might need to upgrade RStudio to the current preview version.Let’s start by importing {reticulate}: library(reticulate) {reticulate} is an RStudio package that provides “a comprehensive set of tools … Read moreR or Python? Why not both? Using Anaconda Python within R with {reticulate}

Road to Revolution: Socialism vs Communism

Subreddit Classification via PushShift API and Natural Language Processing I have always found the ideology behind Socialism and Communism to be very compelling during an area where socio-economic inequity continues to plague society and inhibits true progression as human kind. And while no one political / economic system isn’t perfect, one must acknowledge the shortcomings … Read moreRoad to Revolution: Socialism vs Communism

Spotify ReWrapped

Spotify surprises us every December with their cool end-of-the-year specials. Nevertheless, this year some of the reports smelled fishy. This humble Medium account decided to investigate. Photoshop skills: level 9000 Every year I expect Spotify’s summary. 2016 came with the usual top 5 of songs, artists and genres. The amount of minutes spent listening to music, … Read moreSpotify ReWrapped

Adversarial Training: Creating Realistic Fakes With Machine Learning

Dec 29, 2018 At least 5 times throughout my childhood I had to draw a self-portrait in school, and every time, I kid you not, it turned out not much better than this: But get this, with a laptop and a couple hundred lines of code, a 15-year-old like me can create these impressively realistic … Read moreAdversarial Training: Creating Realistic Fakes With Machine Learning

Understanding Compositional Pattern Producing Networks

Details of Evolving CPPNs At the beginning of evolution, the population of CPPNS is initialized with simple structures containing no hidden nodes, but, due to the augmenting topology feature of CPPNs, the population becomes increasingly more complex as evolution continues and topological mutations are applied. Extra nodes and connections are added into network structures during each … Read moreUnderstanding Compositional Pattern Producing Networks

The Copernican Principle and How to Use Statistics to Figure Out How Long Anything Will Last

Being Right, Atomic Bombs, and Takeaways You might object the answers from this equation are ridiculously wide, a point I’ll concede. However, the objective is not to get a single number — there are almost no situations, even when using the best algorithm, that we can find the one number guaranteed to be spot on — but to find … Read moreThe Copernican Principle and How to Use Statistics to Figure Out How Long Anything Will Last

Leaf Plant Classification: An Exploratory Analysis – Part 1

Categories Getting Data Tags Data Management Data Visualisation Exploratory Analysis R Programming In this post, I am going to run an exploratory analysis of the plant leaf dataset as made available by UCI Machine Learning repository at this link. The dataset is expected to comprise sixteen samples each of one-hundred plant species. Its analysis was … Read moreLeaf Plant Classification: An Exploratory Analysis – Part 1

AirBnB in two cities: Seattle vs Boston

An analysis of pricing, availability and reviews of AirBnB listings AirBnB has become an increasingly popular alternative to regular hotel booking sites for travelers. It helps directly link those who have an extra room or apartment with travelers in need of a short term accommodation. In today’s post, we will delve into AirBnB datasets provided by … Read moreAirBnB in two cities: Seattle vs Boston

Web Scraping using Selenium and BeautifulSoup

Scrape data using Selenium Selenium is able to simulate the browser, and so we can make it wait until the page finished loading before we are getting the data. First we will import the libraries needed for scraping and processing the webdata. We will also define the url of the website we want to scrape the … Read moreWeb Scraping using Selenium and BeautifulSoup

Be Resourceful — One Of The Most Important Skills To Succeed In Data Science

1. Being Resourceful is a Mindset Resourcefulness is a mindset. Period. This is especially relevant when the goals — or the problems — you have set are difficult to achieve or you cannot envision a clear path to get to where you desire to go. And this is perfectly fine. Many of us (including me) very often don’t have a … Read moreBe Resourceful — One Of The Most Important Skills To Succeed In Data Science

Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method … Read morePart 5: Code corrections to optimism corrected bootstrapping series

Immutability in public/private blockchains — Part 1

Dec 28, 2018 Photo by Donnie Rosie on Unsplash Immutability is a core value of a blockchain implementation. Before talking about technical details about this matter in public and private blockchains, it may be useful to step back and think about what does it mean and why it’s important. What does immutability mean? Being as literal … Read moreImmutability in public/private blockchains — Part 1