Data Apocalypse!

The future of data storage What is Data? How is it stored, processed, transferred? What is the cloud? Will we eventually run out of space?! These are the questions that populated my fatigued mind as I tried to relax after a long day at the Flatiron School. [Disclaimer: an immersive program will do that you]. As … Read more

Quantum advantage

Quantum computing is becoming visible in the tech world. There are over a dozen of hardware companies, each trying to build their own quantum computer, from small startups like Xanadu through medium-sized ones like D-Wave or Rigetti to large enterprises like Google, Microsoft or IBM. On top of that there are couple of dozens software … Read more

Installing Hadoop 3.1.0 multi-node cluster on Ubuntu 16.04 Step by Step

Image Source: There are many links on the web about install Hadoop 3. Many of them are not working well or need improvements. This article is taken from the official documentation and other articles in addition of many answers from Note: All prerequisites must be applied on name node and data nodes First, … Read more

PyTorch 101 for Dummies like Me

Nov 5, 2018 What is PyTorch? It’s a Python-based package to serve as a replacement for Numpy arrays and to provide a flexible library forDeep Learning Development Platform. As for the why I prefer PyTorch over TensorFLow can be learned from this Fast AI’s blog post for the reason to switch to PyTorch. Or simply put, … Read more

The Austrian Quant: My Machine Learning Trading Algorithm Outperformed the SP500 For 10 Years

Austrian Quant The Austrian Quant is named after the Austrian School of Economics which serves as the inspiration for how I structured the portfolio. I designed a trading strategy composed of 3 different investment funds to gain a better understanding of investments, machine learning and programming and how they all combine together in the world … Read more

Building a Sentiment Detection Bot with Google Cloud, a Chat Client, and Ruby.

Introduction In this series, I’ll explain how to create a chat bot that is capable of detecting sentiment, analyzing images, and finally having the basis of a evolving personality. This is part 1 of that series. The Pieces Ruby Sinatra Google Cloud APIs Line (a chat client) Since I live in Japan: I’ll be using … Read more

A line-by-line layman’s guide to Linear Regression using TensorFlow

Computing the Graph With generate_dataset() and linear_regression(), we are now ready to run the program and begin finding our optimal gradient W and bias b! [line 2, 3] x_batch, y_batch = generate_dataset()x, y, y_pred, loss = linear_regression() In this run() function, we start off by calling generate_dataset() and linear_regression() to get x_batch, y_batch, x, y, y_pred … Read more

Perplexity Intuition (and Derivation)

The perplexity of a discrete probability distribution is defined as: from where H(p) is the entropy of the distribution p(x) and x is a random variable over all possible events. In the previous post, we derived H(p) from scratch and intuitively showed why entropy is the average number of bits that we need to … Read more

Neural Nets: From Linear Regression to Deep Nets

Neural networks, especially deep neural networks, have received a lot of attention over the last couple of years. They perform remarkably well on image and speech recognition and form the backbone of the technology used for self-driving cars. What many people find hard to believe is that the mathematics of neural networks have been around … Read more

SQL Server

Columnstore A columnstore index can provide a very high level of data compression, typically by 10 times, to significantly reduce your data warehouse storage cost. For analytics, a columnstore index offers an order of magnitude better performance than a btree index. Columnstore indexes are the preferred data storage format for data warehousing and analytics workloads. … Read more

The future of data visualization

Tools to shape the future In many product announcements from Google, Apple and BMW, more and more data will be overlaid in our physical environments through augmented reality or projection. That means not only will data be visualized more, but the visual reality around us will be turned into data. Data visualization of a new AR … Read more

The Best Public Datasets for Machine Learning

First, a couple of pointers to keep in mind when searching for datasets. According to Carnegie Mellon University: 1.- A high-quality dataset should not be messy, because you do not want to spend a lot of time cleaning data. 2.- A high-quality dataset should not have too many rows or columns, so it is easy … Read more

The intuition behind Shannon’s Entropy

Now, back to our formula 3.49: The definition of Entropy for a probability distribution (from The Deep Learning Book) I(x) is the information content of X. I(x) itself is a random variable. In our example, the possible outcomes of the War. Thus, H(x) is the expected value of every possible information. Using the definition of expected … Read more

Image Processing Class (EGBE443) #2 -Histogram

Computing the histogram In this section, the histogram was calculated by implementation of python programming code (Python 3.6). For python 3.6, There are a lot of common modules using in image processing such as Pillow, Numpy, OpenCV, etc. but in this program Pillow and Numpy module was used. To import the image from your computer, … Read more

Lazy Neural Networks

Before I get into solutions I think it is important to discuss some overarching themes of deep learning. Training Objectives Remember that when we create a neural network, what we are effectively doing is designing an experiment. We have data, a model architecture and a training objective. The data you provide is the models universe … Read more

How to get fbprophet working on AWS Lambda

Solving package size issues of fbprophet serverless deployment Adi Goldstein / Unsplash I assume you’re reading this post because you’re looking for ways to use the awesome fbprophet (Facebook open source forecasting) library on AWS Lambda and you’re already familiar with the various issues around getting it done. I will be using a python 3.6 … Read more

Multi-Layer perceptron using Tensorflow

Sep 11, 2018 In this blog, we are going to build a neural network(multilayer perceptron) using TensorFlow and successfully train it to recognize digits in the image. Tensorflow is a very popular deep learning framework released by, and this notebook will guide for build a neural network with this library. If you want to understand … Read more

Diving into K-Means…

Sep 9, 2018 We have completed our first basic supervised learning model i.e. Linear Regression model in the last post here. Thus in this post we get started with the most basic unsupervised learning algorithm- K-means Clustering. Let’s get started without further ado! Background: K-means clustering as the name itself suggests, is a clustering algorithm, … Read more

3 approaches for backtesting historical data

Reading and processing data for statistical and quantitative analysis in trading Sep 8, 2018 Anyone interested in the statistical analysis of financial markets has the need to process historical data. Historical data is needed in order to backtest or train: Quantitative trading. Statistical trading. Price action replay/walkthrough. Each need comes from different goals. 3 examples on … Read more

Microsoft Big Data Overview Block 1 – Data Fundamentals Learn data science basics. Explore topics like data queries, data analysis, data visualization and how statistics informs data science practices. Please choose from Course 2a or Course 2b to complete the unit. Course 1: Microsoft Professional Program: Introduction to Big Data Course 2a: Analyzing and Visualizing Data with Power … Read more

Box Cox Transformation

When we do time series analysis, we are usually interested either in uncovering causal relationships (Does \(X_t\) influence \(Y_{t+1}\)?) or in getting the most accurate forecast possible. Especially in the second case it can be beneficial to transform our historical data to make it easier to extract a signal. A very common transformation is to … Read more

Doing XGBoost hyper-parameter tuning the smart way — Part 1 of 2

Aug 29, 2018 Picture taken from Pixabay In this post and the next, we will look at one of the trickiest and most critical problems in Machine Learning (ML): Hyper-parameter tuning. After reviewing what hyper-parameters, or hyper-params for short, are and how they differ from plain vanilla learnable parameters, we introduce three general purpose discrete optimization … Read more

Introduction to stochastic control theory

I had my first contact with stochastic control theory in one of my Master’s courses about Continuous Time Finance. I found the subject really interesting and decided to write my thesis about optimal dividend policy which is mainly about solving stochastic control problems. In this post I want to give you a brief overview of … Read more

Automatic Image Quality Assessment in Python

Aug 28, 2018 Image quality is a notion that highly depends on observers. Generally, it is linked to the conditions in which it is viewed; therefore, it is a highly subjective topic. Image quality assessment aims to quantitatively represent the human perception of quality. These metrics are commonly used to analyze the performance of algorithms in … Read more

Neural Processes: Probabilistic Gaussian Process+Deep Learning

Neural Processes (NPs) caught my attention as they essentially are a neural network (NN) based probabilistic model which can represent a distribution over stochastic processes. So NPs combine elements from two worlds:

Deep Learning – neural networks are flexible non-linear functions which are straightforward to train
Gaussian Processes – GPs offer a probabilistic framework for learning a distribution over a wide class of non-linear functions

Categories Featured ExcerptFavorite

Opensourcing TransmogrifAI: Automated ML for Structured Data

Despite huge progress in machine learning over the past decade, building production-ready machine learning systems is still hard. Three years ago when we set out to build machine learning capabilities into the Salesforce platform, we learned that building enterprise-scale machine learning systems is even harder.

Categories Featured ExcerptFavorite

Program Sythesis: Can We Teach Computers to Write Code?

Can we teach computers to write code? This is the question that brings out an entire branch of research specialized in program synthesis. Programming is a demanding task that requires extensive knowledge, experience and not a frivolous degree of creativity.

Categories Featured ExcerptFavorite

The One Probability Review That You Need

Probability and statistics are everywhere: from finance and demographic projections to casino games, these disciplines help us make sense of the world. They also underlie much of the machine learning apparatus that is the rage nowadays. What resources should we turn to, if we were to dust off our knowledge of them? (Disclaimer: I received … Read more

Differentiable Rendering

Sounds cool, but … what is it? As I’ve started to pay more attention to machine learning, differentiable rendering is one topic that caught my attention and has been popping up with some frequency. My first thought was, “cooooool is this a new system for generating pixels that somehow can leverage machine learning?” After digging … Read more

Mapping the UK’s Traffic Accident Hotspots

While looking for some interesting geographical data to work with, I came across the Road Safety Data published by the UK government. This is a very comprehensive road accident data set that includes the incident’s geographical coordinates, as well as other related data such as the local weather conditions, visibility, police attendance and more. There … Read more

R Functions for Bayesian Stats and Summaries

A new update of my sjstats-package just arrived at CRAN. This blog post demontrates those functions of the sjstats-package that deal especially with Bayesian models. The update contains some new and some revised functions to compute summary statistics of Bayesian models, which are now described in more detail.

Categories R ExcerptFavorite

Google’s AutoML Killer: Auto-Keras Opensource Automated ML

Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search … Read more

Categories Python ExcerptFavorite

Multiplicative RNN-LSTM for Sequence-based Recommenders

Recommender Systems support the decision making processes of customers with personalized suggestions. They are widely used and influence the daily life of almost everyone in different domains like e-commerce, social media, or entertainment. Quite often the dimension of time plays a dominant role in the generation of a relevant recommendation. Which user interaction occurred just before … Read more

Categories Python ExcerptFavorite

A Guide to Restricted Boltzmann Machines Using Pytorch

A Boltzmann machine defines a probability distribution over binary-valued patterns. What makes Boltzmann machine models different from other deep learning models is that they’re undirected and don’t have an output layer. The other key difference is that all the hidden and visible nodes are all connected with each other. Due to this interconnection, Boltzmann machines … Read more

Categories Python ExcerptFavorite

Data Science Austria

The last few months I set out to build up to build a news and event aggregator. You can see the work in progress here: WordPress Plugins Here is a list of plugins that I use for the site grouped by the general overall purpose. The first one is a collection that I would … Read more

What Does It Really Mean to Operationalize a Predictive Model?

It is not enough to just stand up a web service that can make predictions. Aug 13, 2018 Original Image Source — Meme overlay by Imgflip In a 2017 SAS survey, 83% of organizations have made moderate-to- significant investments in big data, but only 33% say they have derived value from their investments. Other more recent surveys have … Read more

Practical tips for class imbalance in binary classification

4. Class weighted / cost sensitive learning Without resampling the data, one can also make the classifier aware of the imbalanced data by incorporating the weights of the classes into the cost function (aka objective function). Intuitively, we want to give higher weight to minority class and lower weight to majority class. scikit-learn has a … Read more