I had my first contact with stochastic control theory in one of my Master’s courses about Continuous Time Finance. I found the subject really interesting and decided to write my thesis about optimal dividend policy which is mainly about solving stochastic control problems. In this post I want to give you a brief overview of … Read more Introduction to stochastic control theory
Aug 28, 2018 Image quality is a notion that highly depends on observers. Generally, it is linked to the conditions in which it is viewed; therefore, it is a highly subjective topic. Image quality assessment aims to quantitatively represent the human perception of quality. These metrics are commonly used to analyze the performance of algorithms in … Read more Automatic Image Quality Assessment in Python
Neural Processes (NPs) caught my attention as they essentially are a neural network (NN) based probabilistic model which can represent a distribution over stochastic processes. So NPs combine elements from two worlds:
Deep Learning – neural networks are flexible non-linear functions which are straightforward to train
Gaussian Processes – GPs offer a probabilistic framework for learning a distribution over a wide class of non-linear functions
Despite huge progress in machine learning over the past decade, building production-ready machine learning systems is still hard. Three years ago when we set out to build machine learning capabilities into the Salesforce platform, we learned that building enterprise-scale machine learning systems is even harder.
Can we teach computers to write code? This is the question that brings out an entire branch of research specialized in program synthesis. Programming is a demanding task that requires extensive knowledge, experience and not a frivolous degree of creativity.
Probability and statistics are everywhere: from finance and demographic projections to casino games, these disciplines help us make sense of the world. They also underlie much of the machine learning apparatus that is the rage nowadays. What resources should we turn to, if we were to dust off our knowledge of them? (Disclaimer: I received … Read more The One Probability Review That You Need
Sounds cool, but … what is it? As I’ve started to pay more attention to machine learning, differentiable rendering is one topic that caught my attention and has been popping up with some frequency. My first thought was, “cooooool is this a new system for generating pixels that somehow can leverage machine learning?” After digging … Read more Differentiable Rendering
While looking for some interesting geographical data to work with, I came across the Road Safety Data published by the UK government. This is a very comprehensive road accident data set that includes the incident’s geographical coordinates, as well as other related data such as the local weather conditions, visibility, police attendance and more. There … Read more Mapping the UK’s Traffic Accident Hotspots
A new update of my sjstats-package just arrived at CRAN. This blog post demontrates those functions of the sjstats-package that deal especially with Bayesian models. The update contains some new and some revised functions to compute summary statistics of Bayesian models, which are now described in more detail.
TDAstats is an R pipeline for topological data analysis, specifically, the use of persistent homology in Vietoris-Rips simiplicial complexes to study the shape of data.
Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search … Read more Google’s AutoML Killer: Auto-Keras Opensource Automated ML
Recommender Systems support the decision making processes of customers with personalized suggestions. They are widely used and influence the daily life of almost everyone in different domains like e-commerce, social media, or entertainment. Quite often the dimension of time plays a dominant role in the generation of a relevant recommendation. Which user interaction occurred just before … Read more Multiplicative RNN-LSTM for Sequence-based Recommenders
A Boltzmann machine defines a probability distribution over binary-valued patterns. What makes Boltzmann machine models different from other deep learning models is that they’re undirected and don’t have an output layer. The other key difference is that all the hidden and visible nodes are all connected with each other. Due to this interconnection, Boltzmann machines … Read more A Guide to Restricted Boltzmann Machines Using Pytorch
The last few months I set out to build up to build a news and event aggregator. You can see the work in progress here: data-science-austria.at WordPress Plugins Here is a list of plugins that I use for the site grouped by the general overall purpose. The first one is a collection that I would … Read more Data Science Austria
It is not enough to just stand up a web service that can make predictions. Aug 13, 2018 Original Image Source — Meme overlay by Imgflip In a 2017 SAS survey, 83% of organizations have made moderate-to- significant investments in big data, but only 33% say they have derived value from their investments. Other more recent surveys have … Read more What Does It Really Mean to Operationalize a Predictive Model?
4. Class weighted / cost sensitive learning Without resampling the data, one can also make the classifier aware of the imbalanced data by incorporating the weights of the classes into the cost function (aka objective function). Intuitively, we want to give higher weight to minority class and lower weight to majority class. scikit-learn has a … Read more Practical tips for class imbalance in binary classification
The nature of the problem: medical fraud and abuse The U.S. department of health and human services in a pamphlet Avoiding Medicare Fraud and Abuse: A Roadmap for Physicians states “most physicians strive to work ethically, render high-quality medical care to their patients, and submit proper claims for payment,” yet “the presence of some dishonest … Read more Feature Engineering for Healthcare Fraud Detection
There are a multitude of options when it comes to storing and processing data. In this post I want to give you a brief overview of Azure SQL datawarehouse, Microsoft’s datawareshouse solution for the Azure cloud and its answer to Amazon Redshift on AWS. I will start of by talking briefly about its technical architecture … Read more Azure SQL DWH – Overview
Aug 2, 2018 Photo by JESHOOTS.COM on Unsplash Look at this equation: Value function of Reinforcement Learning If it does not intimidate you, then you are a mathematical savvy and there is no point in reading this article 🙂 This article is not about teaching Reinforcement Learning (RL) but about explaining the math behind it. So it … Read more Math Behind Reinforcement Learning, the Easy Way
Recently I came across this cooking recipes data set in Kaggle, and it inspired me to combine 2 of my main interests in life. Food and machine learning. What makes this data set special is that it contains recipes from 20 different cuisines, 6714 different ingredients, but only 26648 samples. Some cuisines have way fewer … Read more Cooking with Machine Learning: Dimension Reduction
So you’ve seen the recent news about how artificial intelligence (AI) is changing everything. However, the idea of AI has been around for a long time. Machines that think and talk like humans have been the inspiration for movies and stories for decades. But what’s the deal? Why has AI been getting better and better … Read more An In-depth Review of Andrew Ng’s deeplearning.ai Speciliazation
Estimators were introduced in version 1.3 of the Tensorflow API, and are used to abstract and simplify training, evaluation and prediction. If you haven’t worked with Estimators before I suggest to start by reading this article and get some familiarity as I won’t be covering all of the basics when using estimators. In no means … Read more An Advanced Example of Tensorflow Estimators Part (1/3)
Jul 19, 2018 Hypothesis analysis is a widely known concept and is used extensively by researchers, statisticians and quantitative analysts. It allows them to follow a set of formal steps to perform calculated analysis on their data. It is also widely used in machine learning and artificial intelligence. In this article, I will be explaining core concepts of … Read more Hypothesis Analysis Explained
Docker is a tool which helps developers build and ship high quality applications, faster, anywhere. Source Why Docker With Docker, developers can build any app in any language using any toolchain. Dockerized apps are completely portable and can run anywhere. Developers can get going by just spinning any container out of list on Docker Hub. … Read more Docker Basics
Jul 8, 2018 In this tutorial we will discuss about integrating PySpark and XGBoost using a standard machine learing pipeline. We will use data from the Titanic: Machine learning from disaster one of the many Kaggle competitions. Before getting started please know that you should be familiar with Apache Spark and Xgboost and Python. The … Read more PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset
In the previous post I covered the basics you need to know to work with SQL Server. In this post, I want to show you some more advanced techniques that I found pretty helpful. The topics I will cover include: How to speed up your queries with indices and using columnstore Using Views and Table … Read more More advanced SQL Server for Data Scientists
DIY Noise-Cancellation System prototype made with TensorFlow. Jun 25, 2018 Image by TheDigitalArtist on Pixabay In this post I describe how I built an active noise cancellation system by means of neural networks on my own. I’ve just got my first results which I am sharing, but the system looks like a ravel of scripts, binaries, … Read more Acoustic Noise Cancellation by Machine Learning
After getting scrum.org the PSM I I wanted to capture the relevant content. The complete guido can be downloaded here: scrumguides.org 1. What is Scrum? Scrum is a framework for developing and sustaining complex products. A framework in which complex adaptive problems can be addressed. It is lightweight, simple to understand and yet difficult to … Read more Scrum PSM I
Introduction Learning rate might be the most important hyper parameter in deep learning, as learning rate decides how much gradient to be back propagated. This in turn decides by how much we move towards minima. The small learning rate makes model converge slowly, while the large learning rate makes model diverge. So, the learning rate … Read more Finding Good Learning Rate and The One Cycle Policy.
I’ve been involved in building several different types of recommendation systems, and one thing I’ve noticed is that each use case is different from the next, as each aims to solve a different business problem. Let’s consider a few examples: Movie/Book/News Recommendations — Suggest new content that increases user engagement. The aim is to introduce users to … Read more Recommendation Systems — Models and Evaluation
Many data professionals are strict on the language to be used for ANN models limiting their dev. environment exclusively to Python. I decided to test performance of Python vs. R in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post … Read more R vs Python: Image Classification with Keras
Jun 14, 2018 This post is about implementing simple linear regression model for ML beginners in step by step way with detailed explanation. If you are new to machine learning, check this post for getting a clear idea about Machine Learning and it’s basics. What is the logic behind simple linear regression model? As the … Read more Linear Regression Model
Using MQTT protocol, we will get captured data from sensors, logging them to an IoT service, ThingSpeak.com and to a mobile App, Thingsview. 1. Introduction In my previous article, MicroPython on ESP using Jupyter, we learned how to install and run MicroPython on an ESP device. Using Jupyter Notebook as our development environment, we also … Read more IoT Made Easy: ESP-MicroPython-MQTT-ThingSpeak
When you are using Google’s Colaboratory (Colab) for running your Deep Learning models the most obvious way to access the large datasets is by storing them on Google Drive and then mounting Drive onto the Colab environment. But a lot of open sourced large datasets that are available for research purposes, are hosted on Github/Gitlab. … Read more From Git to Colab, via SSH
Since R is mostly a functional language and data science work lends itself to be expressed in a functional form you can come by just fine without learning about object-oriented programming. Personally, I mostly follow a functional programming style (although often not a pure one, i.e. w/o side-effects, because of limited RAM). Expressing mathematical concepts in … Read more Object Oriented Programming in Data Science with R
Over the past few decades, four key change initiatives have been taking place in the organizations: strategic planning, re-engineering, total quality management and downsizing. The aim of these initiatives was to achieve economic effectiveness, but around 75% of them failed or created problems that were serious enough to threaten organization’s survival (1). It has been … Read more DevOps: To do or not to do?
Measuring the effect of an intervention on some metric is an important problem in many areas of business and academia. Imagine, you want to know the effect of a recently launched advertising campaign on product sales. In an ideal setting, you would have a treatment and a control group so that you can measure the … Read more Estimating Intervention Effects using Baysian Models in R
One of the things I particularly like about working in data science, is the science part: Figuring out the right questions to ask, how to frame a problem correctly and finally trying to solve it. While there are many problems that you can simply solve by library(caret) or from sklearn import * and dumping your … Read more A Framework to tackle tough Data Science Problems
data.table is an awesome R package, but there are a few things you need to watch out for when using it. R usually does not modify objects in place (e.g. by reference), but makes a copy when you change a value and saves this copy. This can be a problem if you work with large datasets … Read more Using data.table deep copy
When you work for a large corporation you often have little choice in picking a specific operating system for your company laptop. This post is a collection of random problems I ran into in the past mostly with R and Python on Windows 10 and how to resolve them. I plan to update this post … Read more Data Science with Windows 10 – Quick Fixes
This post attempts to consolidate information on tree algorithms and their implementations in Scikit-learn and Spark. In particular, it was written to provide clarification on how feature importance is calculated. There are many great resources online discussing how decision trees and random forests are created and this post is not intended to be that. Although … Read more The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark
Here is a great tutorial on how to host hugo on netlify Other examples using the exact same theme: Creating the hugo site In order to create a new hugo site simply go: hugo new site [path] [flags] Create a new repository via git init the git repo and push it to the guthub repo: … Read more Blogging with hugo & netlify
SQL is not the sexiest language on the block and many/most data scientists I know prefer to stick to R and/or Python. Some common complains I hear about SQL are: It is hard to read and as a consequence large SQL statements are hard to debug. Version control with databases often requires additional tooling to … Read more SQL Server for Data Scientists
Creating an R package is as easy as typing: package.skeleton(name = “YourPackageName”) As you might have guessed, this function creates the basic file and folder structure you need to create an R package. You will get: YourPackageName/ DESCRIPTION man/ NAMESPACE R/ You can also use RStudio to create a package with File > New Project … Read more Package development in R – Overview
Many data scientists are former academics who are used to working on a specific and often quite narrow research problems for long periods of time, often years. With data science being in high demand at the moment in nearly all industries, more and more researchers switch from an academic career to one in the private … Read more Agile Project Management for Data Science
Apr 15, 2018 In this post, we will tackle one of the most challenging yet interesting problems in Natural Language Processing, aka Question Answering. We will implement Google’s QANet in Tensorflow. Just like its machine translation counterpart Transformer network, QANet doesn’t use RNNs at all which makes it faster to train / test. I’m assuming … Read more Implementing QANet (Question Answering Network) with CNNs and self attentions
On centralizing siloed data Apr 12, 2018 I still get nostalgic looking at the very first Pebbles. (Photo courtesy of Pebble’s first Kickstarter) In 2014, I joined Pebble, the smartwatch maker later acquired by Fitbit, to lead their data science & analytics team. I was interested in the challenges of managing a data organization at a … Read more What I wish I’d done differently as a data science manager
Greg Lamp, previous co-founder of the data science startup Yhat, and current co-founder & CTO of Waldo shares his thoughts on Machine Learning for those of us who just don’t care about Machine Learning. What is Machine Learning? The definition I have come up with for Machine Learning is as follows… machine learning is using … Read more Machine Learning for People Who Don’t Care About Machine Learning
I am an absolute fan of adapting your work environment to your needs. Spending an hour to set up some shortcuts is virtually always a good time investment. Then you can easily drag your most used commands into a new bar. You should be able to save a lot of time on, e.g. aligning objects … Read more Office Ribbons
Dissimilarity MatrixArguably, this is the backbone of your clustering. Dissimilarity matrix is a mathematical expression of how different, or distant, the points in a data set are from each other, so you can later group the closest ones together or separate the furthest ones — which is a core idea of clustering. This is the step where … Read more Hierarchical Clustering on Categorical Data in R