Data Science in Algorithmic Trading

Dec 31, 2018 In this article I plan to give you a glimpse into an asset model for algorithmic trading. This model of the world should allow us to make predictions about what will happen, based upon what happened in the past, and to make money by trading on this information. The model and trading … Read more

Optimizing Jupyter Notebooks — A Comprehensive Guide

While we all know that premature micro optimizations are the root of all evil, thanks to Donald Knuth’s paper “Structured Programming With Go To Statements” [1], eventually at some point in your data exploration process you grasp for more than just the current “working” solution. The heuristic approach we usually follow considers: Make it work. … Read more

10 Data Science Tools I Explored in 2018

Source: geralt (pixabay) New Languages, Libraries, and Services Dec 31, 2018 In 2018, I invested a good amount of time in learning and writing about data science methods and technologies. In the first half of 2018, I wrote a blog series on data science for startups, which I turned into a book. In the second half, … Read more

Use Unsupervised Machine Learning To Find Potential Buyers of Your Products

Let the algorithm learn by itself… Halong Bay in Vietnam What Is Unsupervised Machine Learning Welcome to my third post about Data Science! In my previous post I discussed how I used supervised machine learning to find donors for a charity. Recall that in supervised machine learning you have input variables (X) and an output variable (Y ) … Read more

Deep Dive into Support Vector Machine

Mathematical Modelling Let us crack the mathematics behind Support Vector Machine. Given, training set {(Xᵢ,Yᵢ) where i=1,2,3,…,n}, Xᵢ ∈ ℜᵐ, Yᵢ ∈ {+1,-1}. Here, Xᵢ is the feature vector for the iᵗʰ data point and Yᵢ is the label for the iᵗʰ data point. The label can be either ‘+1’ for positive class or ‘-1’ … Read more

Problem solving with “AI Challenger Global AI Contest”

Experience from participating in computer vision competition hosted by Chinese data science platform AI Challenger Dec 31, 2018 In this article I will share my experience solving a video classification problem in a Chinese machine learning competition. There are a lot of data science platforms for competitors. We used to think about Kaggle – the … Read more

Introducing RcppDynProg

RcppDynProg is a new Rcpp based R package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note). The abstract problem The primary problem RcppDynProg::solve_dynamic_program() is designed … Read more

Categories R Tags ExcerptFavorite

Predicting Crash Severity for NZ Road Accidents

Feature Exploration (a.k.a EDA) The dataset has an initial total of 655,697 samples. Each with 89 different features, both numerical and categorical. Simply by reading the definition of each feature from the pdf mentioned above, we can see that there are some features whose values are derived from other features. These derived features don’t add intrinsic … Read more

How to use machine learning for anomaly detection and condition monitoring

Technical section: It is hard to cover the topics of machine learning and statistical analysis for anomaly detection without also going into some of the more technical aspects. I will still avoid going too deep into the theoretical background (but provide some links to more detailed descriptions). If you are more interested in the practical … Read more

Numpy Guide for People In a Hurry

Dec 31, 2018 Photo by Chris Ried on Unsplash The NumPy library is an important Python library for Data Scientists and it is one that you should be familiar with. Numpy arrays are like Python lists, but much better! It’s much easier manipulating a Numpy array than manipulating a Python list. You can use one Numpy … Read more

Supercharging word vectors

A simple technique to boost fastText and other word vectors in your NLP projects Over the last few years, word vectors have been transformative in their ability to create semantic linkages between words. It is now the norm for these to be fed into deep learning models for tasks such as classification or sentiment analysis. Despite … Read more

AI Predictions for 2019

Dec 31, 2018 Artificial Intelligence, specifically, machine learning and deep learning, have been fashionable keywords of 2018 and we don’t expect the hype to die down in the next few years. In the long run, AI will eventually become normal everyday news, being yet another technology powering our lives, just like what happened with the … Read more

Agile Data Science – Rethink

A revisit of agile practice inside the data science team Yi JinBlockedUnblockFollowFollowing Dec 31, 2018 Agile to Success within Data Science Nowadays, the Agile became very famous among software development. Data science is another trending discipline, Many and many more companies try to build and benefit from. Scrum as methodologies tries to help build a good software … Read more

117 Days Of Tinder In Data

BraydenBlockedUnblockFollowFollowing Dec 30, 2018 Nearly 4 months ago — after my relationship of 3 years ended — I decided to create a Tinder profile. The following is the story of that profile (in data). If you’re lazy and don’t care about the details, skip to the bottom and there is a Sankey diagram that sums up a lot. Let’s … Read more

Build Log Analytics Application using Apache Spark

Step by step process of developing a real world application using Apache Spark, along with main focus on explaining the architecture of Spark. Image Source: Why Apache Spark Architecture if we have Hadoop? The Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster, and applies … Read more

Activation Functions in Neural Networks

The motive, use cases, advantages and limitations tl;dr The post discusses the various linear and non-linear activation functions used in deep learning and neural networks. We also take a look into how each function performs in different situations, the advantages and disadvantages of each then finally concluding with one last activation function that out-performs the … Read more

Leaf Plant Classification: Statistical Learning Model – Part 2

Categories Advanced Modeling Tags Linear Regression Principal Component Analysis R Programming In this post, I am going to build a statistical learning model as based upon plant leaf datasets introduced in part one of this tutorial. We have available three datasets, each one providing sixteen samples each of one-hundred plant species. The features are: shape … Read more

Categories R Tags ExcerptFavorite

Small Steps: A Experimental Case for Compound Prediction

A Experimental Case for Compound Prediction Uncovering the abstract non-linear relationship between variables is the utility which machine learning provides. However, does jumping from known inputs directly to the abstractly related desired values yield the most accurate results? What happens in cases where a series of more closely related variables can first be predicted and … Read more

Running course notebooks on Kaggle Kernel

Kaggle Kernels offer ML optimized docker environment, Tesla K80 GPU, internet access and uninterrupted 6 hour sessions. Any Clouderizer project can now be run seamlessly on Kaggle Kernels. What this means is our community project for, can now be run on Kaggle Kernels, just as easily. Below are the steps. Following pre-requisite, one time, steps … Read more

Modality tests and kernel density estimations

Dec 30, 2018 When processing a large number of datasets which can potentially have different data distributions, we are confronted with the following considerations: Is the data distribution unimodal and if it is the case, which model best approximates it( uniform distribution, T-distribution, chi-square distribution, cauchy distribution, etc)? If the data distribution is multimodal, can … Read more

This dance, it’s like a weapon: Radiohead’s and Beck’s danceability, valence, popularity, and more from the LastFM and Spotify APIs

When it comes to surreal lyrics and videos, I’m always thinking of Beck. Above, I cited the beginning of the song “Wow” from his latest album “Colors” which has received rather mixed reviews. In this post, I want to show you what I have done with Spotify’s API. We will also see how “Colors” compares to all … Read more

Categories R Tags ExcerptFavorite

Deep Learning for Classical Japanese Literature

Overview The paper introduces 3 new benchmark datasets for Machine Learning, namely: – Kuzushiji-MNIST — A drop-in replacement for MNIST dataset (28×28)– Kuzushiji-49 — A much larger but imbalanced dataset containing 48 Hiragana characters and 1 Hiragana iteration mark (28×28)– Kuzushiji-Kanji — An imbalanced dataset of 3832 Kanji characters, including rare characters with very few samples. (64×64) Fig 2. An example of … Read more

Monty Hall Problem using Python

Understanding mathematical proofs with the help of programming We have all heard the probability brain teaser for the three door game show. Each contestant guesses whats behind the door, the show host reveals one of the three doors that didn’t have the prize and gives an opportunity to the contestant to switch doors. It is … Read more

How to use corporate e-mail analysis to reveal hidden stars and ensure equal opportunities! Article Series on Organizational Network Analysis and Communication Content Analysis Dec 30, 2018 In 1736, Leonhard Euler wrote a paper on the Seven Bridges of Königsberg. This paper is regarded as the first in the history of Graph Theory. Lately, Graph Theory has been utilized in the field of organizational communications and it is called … Read more

Extract features of Music

Different type of audio features and how to extract them. Dec 30, 2018 MFCC feature extraction Extraction of features is a very important part in analyzing and finding relations between different things. The data provided of audio cannot be understood by the models directly to convert them into an understandable format feature extraction is used. It … Read more

Flower classification with Convolutional Neural Networks.

Agenda. Since I began to study deep learning on FastAI, this is my first attempt to implement image classifier. I’m going to tell you (and understand better) how to create simple and more or less accurate flower recognition model using FastAI library. Data acquisition. We will use public flowers dataset which stored on kaggle. It … Read more

Using pix2pix to create SnapChat lenses

Approach Today we will use above mentioned model to create a SnapChat filter. I will use image below to show the results. All the images are 256×256 because I trained on the same size (Yes, I don’t have enough GPU power to train HD) Now, let’s gather training data! For that I will use my … Read more

Total Least Squares in comparison with OLS and ODR

Total least squares(aka TLS) is one of the methods of regression analysis to minimize the sum of squared errors between response variable(or, an observation) and estimated variable(we often say a fitted value). The most popular and standard methods of this is Ordinary least squares(aka OLS) for the same purpose, and TLS is one of other … Read more

Explained: A Style-Based Generator Architecture for GANs – Generating and Tuning Realistic…

NVIDIA’s novel architecture for Generative Adversarial Networks Generative Adversarial Networks (GAN) are a relatively new concept in Machine Learning, introduced for the first time in 2014. Their goal is to synthesize artificial samples, such as images, that are indistinguishable from authentic images. A common example of a GAN application is to generate artificial face images … Read more

A War Amongst Siblings… Part II

My sister Kelly and I started a game of War while on vacation together in British Columbia. We ended up playing over multiple days of downtime without ever finishing it. Kelly felt like the game was going to go on forever, but I didn’t believe it was possible. I wrote a simulator in Python to … Read more

R or Python? Why not both? Using Anaconda Python within R with {reticulate}

This short blog post illustrates how easy it is to use R and Python in the same R Notebook thanks to the{reticulate} package. For this to work, you might need to upgrade RStudio to the current preview version.Let’s start by importing {reticulate}: library(reticulate) {reticulate} is an RStudio package that provides “a comprehensive set of tools … Read more

Categories R Tags ExcerptFavorite

Grid Search for model tuning

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example, c in Support Vector Machines, k in k-Nearest Neighbors, the number of hidden layers in Neural … Read more

Road to Revolution: Socialism vs Communism

Subreddit Classification via PushShift API and Natural Language Processing I have always found the ideology behind Socialism and Communism to be very compelling during an area where socio-economic inequity continues to plague society and inhibits true progression as human kind. And while no one political / economic system isn’t perfect, one must acknowledge the shortcomings … Read more

Spotify ReWrapped

Spotify surprises us every December with their cool end-of-the-year specials. Nevertheless, this year some of the reports smelled fishy. This humble Medium account decided to investigate. Photoshop skills: level 9000 Every year I expect Spotify’s summary. 2016 came with the usual top 5 of songs, artists and genres. The amount of minutes spent listening to music, … Read more

Understanding Compositional Pattern Producing Networks

Details of Evolving CPPNs At the beginning of evolution, the population of CPPNS is initialized with simple structures containing no hidden nodes, but, due to the augmenting topology feature of CPPNs, the population becomes increasingly more complex as evolution continues and topological mutations are applied. Extra nodes and connections are added into network structures during each … Read more

The Copernican Principle and How to Use Statistics to Figure Out How Long Anything Will Last

Being Right, Atomic Bombs, and Takeaways You might object the answers from this equation are ridiculously wide, a point I’ll concede. However, the objective is not to get a single number — there are almost no situations, even when using the best algorithm, that we can find the one number guaranteed to be spot on — but to find … Read more

Leaf Plant Classification: An Exploratory Analysis – Part 1

Categories Getting Data Tags Data Management Data Visualisation Exploratory Analysis R Programming In this post, I am going to run an exploratory analysis of the plant leaf dataset as made available by UCI Machine Learning repository at this link. The dataset is expected to comprise sixteen samples each of one-hundred plant species. Its analysis was … Read more

Categories R Tags ExcerptFavorite

AirBnB in two cities: Seattle vs Boston

An analysis of pricing, availability and reviews of AirBnB listings AirBnB has become an increasingly popular alternative to regular hotel booking sites for travelers. It helps directly link those who have an extra room or apartment with travelers in need of a short term accommodation. In today’s post, we will delve into AirBnB datasets provided by … Read more

Be Resourceful — One Of The Most Important Skills To Succeed In Data Science

1. Being Resourceful is a Mindset Resourcefulness is a mindset. Period. This is especially relevant when the goals — or the problems — you have set are difficult to achieve or you cannot envision a clear path to get to where you desire to go. And this is perfectly fine. Many of us (including me) very often don’t have a … Read more

Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method … Read more

Categories R Tags ExcerptFavorite

Immutability in public/private blockchains — Part 1

Dec 28, 2018 Photo by Donnie Rosie on Unsplash Immutability is a core value of a blockchain implementation. Before talking about technical details about this matter in public and private blockchains, it may be useful to step back and think about what does it mean and why it’s important. What does immutability mean? Being as literal … Read more