Data Passing & Races

We already covered different ways to pass data to threads during thread construction. Now we will take a look at sending data from the worker to the parent thread as well. To do this, threads follow a very strict sync protocol implemented in the c++ standard. C++ includes a single-use channel to pass data (and … Read more

Mutexes and Locks

To avoid data races, c++ has a concept called mutex, short for Mutual EXclusion. A mutex manages access to a shared resource and ensures that only one thread at a time is able to access the resource: thread 1 ————– | locks resource | mutex —> access –> shared variable x is blocked | thread … Read more

Intro to C++ Concurrency

Let’s start with the very basics before we work through a simple example in c++: Overview When we want to perform multiple pieces of work simultaneously, we can either choose to run them in a synchronos or asynchronos manner: Synchronous time main() ——— ————————-> func() | | return() thread() ——-> Asynchronous time main() ——————————————> func() … Read more

CallBy: Pointer/Reference/Const Reference

By default c++ uses call-by-value, but copying large arguments can be expensive so we can also use call by: pointer reference const reference Since as a user we do not see if a function that takes an argument actually modifies it, we can use const references to have the compiler guarantee that an argument that … Read more

Pointers – Syntax Overview

This post is a short reminder to my future self summarising pointer syntax: Pointer Basics #include <iostream> int main() { int a = 5; int b = 15; int* p1 = &a; //initialize int pointer p1 and assign address of a to pointer p1 int* p2 = &b; //initialize int pointer p2 and assign address … Read more

Intro Memory Management in C++

Most people’s dislike of c++ probably comes from the fact that when you write in cpp, you have to manage memory yourself. While this can be frustrating at times (hello there, segfault and bad_alloc🙂 it is actually immensely powerful and interesting. So let’s get right to it:) Memory Addresses and Hexadecimal Numbers Computers store information … Read more

From Basic to Advanced OOP in C++

C++ is also commonly called C with classes and in this post I want to give you an overview about basic and advanced oop concepts in c++. Let’s start from the very beginning with structures: Structures Structs allow you to define your own custom types to complement the build-in types (also called ‘primitives’ or fundamental … Read more

C++ Foundations

When I left my old job at Austrian Post, my colleagues got me an amazing parting-gift: An Amazon Deep Racer 🙂 While it is quite easy to work with Deep Racer using AWS Sagemaker, I wanted to program Deep Racer directly. Deep Racer runs on Ubuntu Linux with the Robot Operating System (ROS). ROS is … Read more

A* Search in C++

In this post I want to discuss the a* algorithm (pronounced ‘a-star’), how it is used for motion planning and how we can implement it in cpp. The process of determining how to get from a start to an endpoint is called ‘planning’ and for a robot it’s called ‘robot motion planning’. There are two … Read more

Scikit-learn Complete Walk Through (for R-Users)

Python is one if not the most used language by data scientists interested in machine learning and deep learning techniques. One reason for Python’s popularity is without a doubt scikit-learn. scikit-learn does not need an introduction, but in case you are new to the machine learning space in Python, scikit-learn is an all-in-one machine learning … Read more

PCA, Eigenvectors and the Covariance Matrix

Almost every data science course will at some point (usually) sooner than later cover PCA, i.e. Principal Component Analysis. PCA is an important tool used in exploratory data analysis for dimensionality reduction. In this post I want to show you (hopefully in an intuitive way) how PCA works its mathematical magic. Let’s start with a short … Read more

Linear Algebra Refresher

Since quite some time has passed since I took my linear algebra courses, I thought I could comb through my old course notes and write a small post about linear algebra stuff that is quite useful to remember in my opinion. Let`s start with some computation rules for working with matrices: Some Matrix Calculation Rules … Read more

Linux on Razer Blade 15 (mid 2019)

As a long time Macbook user I recently made the switch to a Windows laptop, specifically a Razer Blade 15 (mid 2019) model. While Apple has thankfully started to use proper keyboards again in their recent lineup, the lack of Nvidia gpus is somewhat sub-optimal if you want to use your laptop for deep learning … Read more

Operations Research 101 with Python

Recently my sister asked me to help her with some linear programming problems in Excel. I have not used Excel solver for what feels like decades and I was pleasantly surprised how easy it was to setup and solve a linear programming problem with it. But since I am not really an Excel fan, I … Read more

An Introduction to Statistical Learning

In this blog post I want to give a brief overview of the core ideas behind the statistical learning framework and show you how to implement a few simple models in this framework. This post is based on the excellent work “Tree Boosting With XGBoost – Why Does XGBoost Win ‘Every’ machine Learning Competition” by … Read more

Creating Azure Logic Apps from R using httr

Logic Apps is a serverless framework in Azure quite similar to IFTTT (if this, then that) and Zapier that allows you to connect different services and create workflows. You can define different types of triggers based on: time and events (e.g. http requests, messages received, …) to start workflows. Logic Apps can be created using a … Read more

OAuth 2.0 in R

Many APIs require some form of authentication. A very common form used with many cloud providers and commercial APIs is OAuth 2.0. In this post, I want to give you an introduction to how OAuth 2.0 works and use it to authenticate with Microsoft Azure services. OAuth 2.0 – Overview The OAuth 2.0 specification is … Read more

Working with REST APIs for Data Scientists in R

With the growing importance of cloud computing more and more services are exposed as REST APIs. In this post, I want to give a hands-on introduction for data scientists from non-software-engineering backgrounds on how to work with REST APIS. But before we dive straight into the code, let’s start with some background information: A (short) … Read more

Introduction – Analysing Customer Churn

At first glance, analysing customer churn seems pretty easy. All we have to know is how many customers we have at a certain point in time and how many customers chose to leave our business over a given period in order to calculate a churn rate. We could simply define customer churn rate as: \[ … Read more

Git & SSH with Powershell Core

In this post I want to give a quick outline of how to setup Powershell Core (Microsoft’s cross-platfrom version of Powershell) to work with git and ssh. While you can simply install Git for Windows and work with Git Bash, personally I quite like Powershell Core, because it is more tightly integrated with Windows and … Read more

Pandas for data.table Users

R and Python are both great languages for data analysis. While they are remarkably similar in some aspects, they are drastically different in others. In this post, I will focus on the similarities and differences between Pandas and data.table, two of the most prominent data manipulation packages in Python/R. There is alreay an excellent post … Read more

The Perceptron Algorithm

In my blog post Neural Nets: From Linear Regression to Deep Nets I talked about how a deep neural net is simply a sequence of simple building blocks of the form: \[\sigma(\underbrace{w^T}_{weights}x + \overbrace{b}^{bias}) = a\] and that a linear regression model is one of the most basic neural networks where the activation function \(\sigma\) … Read more

Blogging with Hugo and Jupyter

I really love blogging with Hugo+Blogdown, but unfortunately Blogdown is still mostly restricted to R (although Python is now also possible using the reticulate package). Jupyter offers a great literate programming environment for multiple languages and so being able to publish Jupyter notebooks as Hugo blogposts would be a huge plus. I have been looking … Read more

Neural Nets: From Linear Regression to Deep Nets

Neural networks, especially deep neural networks, have received a lot of attention over the last couple of years. They perform remarkably well on image and speech recognition and form the backbone of the technology used for self-driving cars. What many people find hard to believe is that the mathematics of neural networks have been around … Read more

Box Cox Transformation

When we do time series analysis, we are usually interested either in uncovering causal relationships (Does \(X_t\) influence \(Y_{t+1}\)?) or in getting the most accurate forecast possible. Especially in the second case it can be beneficial to transform our historical data to make it easier to extract a signal. A very common transformation is to … Read more

Introduction to stochastic control theory

I had my first contact with stochastic control theory in one of my Master’s courses about Continuous Time Finance. I found the subject really interesting and decided to write my thesis about optimal dividend policy which is mainly about solving stochastic control problems. In this post I want to give you a brief overview of … Read more

Azure SQL DWH – Overview

There are a multitude of options when it comes to storing and processing data. In this post I want to give you a brief overview of Azure SQL datawarehouse, Microsoft’s datawareshouse solution for the Azure cloud and its answer to Amazon Redshift on AWS. I will start of by talking briefly about its technical architecture … Read more

More advanced SQL Server for Data Scientists

In the previous post I covered the basics you need to know to work with SQL Server. In this post, I want to show you some more advanced techniques that I found pretty helpful. The topics I will cover include: How to speed up your queries with indices and using columnstore Using Views and Table … Read more

Object Oriented Programming in Data Science with R

Since R is mostly a functional language and data science work lends itself to be expressed in a functional form you can come by just fine without learning about object-oriented programming. Personally, I mostly follow a functional programming style (although often not a pure one, i.e. w/o side-effects, because of limited RAM). Expressing mathematical concepts in … Read more

Estimating Intervention Effects using Baysian Models in R

Measuring the effect of an intervention on some metric is an important problem in many areas of business and academia. Imagine, you want to know the effect of a recently launched advertising campaign on product sales. In an ideal setting, you would have a treatment and a control group so that you can measure the … Read more

Using data.table deep copy

data.table is an awesome R package, but there are a few things you need to watch out for when using it. R usually does not modify objects in place (e.g. by reference), but makes a copy when you change a value and saves this copy. This can be a problem if you work with large datasets … Read more

SQL Server for Data Scientists

SQL is not the sexiest language on the block and many/most data scientists I know prefer to stick to R and/or Python. Some common complains I hear about SQL are: It is hard to read and as a consequence large SQL statements are hard to debug. Version control with databases often requires additional tooling to … Read more

Package development in R – Overview

Creating an R package is as easy as typing: package.skeleton(name = “YourPackageName”) As you might have guessed, this function creates the basic file and folder structure you need to create an R package. You will get: YourPackageName/ DESCRIPTION man/ NAMESPACE R/ You can also use RStudio to create a package with File > New Project … Read more

Agile Project Management for Data Science

Many data scientists are former academics who are used to working on a specific and often quite narrow research problems for long periods of time, often years. With data science being in high demand at the moment in nearly all industries, more and more researchers switch from an academic career to one in the private … Read more

Parallel processing in R using Azure Batch and Docker

While (personal) computers have become increasingly powerful over the last years there are still lots of workloads that easily bring even the best workstation to its knees. Running huge Monte-Carlo simulations or training thousands of models takes hours, if not days even on very beefy machines. Now enter Azure Batch processing. Azure Batch is a … Read more

Azure Container Registry – Quick Start Guide

Azure Container Registry is the Microsoft equivalent to private Dockerhub repositories. First, I will show you how to quickly push an image to Azure Container Registry. In a second step, I will cover how to manage your registries and repositories using the PowerShell cmdlet AzureRM as well as the Azure CLI. Quick start To push … Read more

Azure Machine Learning Services – Overview

We rely heavily on Microsoft’s cloud platform Azure during for our analytics workloads at the Austrian Postal Service. Azure has grown rapidly over the past few years and is adding features at a very fast pace, so it is easy to lose track which services are (still) offered and what services one should use . … Read more

About

Hi, I am Christoph, the Lead Data Scientist in the BI Competence Center at the Austrian Postal Service. I am responsible for designing the data science architecture, building the data science team and for coding up predictive models. Prior to joining the Austrian Post, I worked as a financial consultant at KPMG. I have a … Read more