Python Iterables, Iterators and Generators

Python is very popular for data processing and munging and has some powerful tools for data streaming/lazy evaluation that allow us to work with datasets that do not fit into memory. In this post I want to talk about the tools that Python has build in to make stream processing possible, namely: Iterables, Iterators and … Read more

Data Passing & Races

We already covered different ways to pass data to threads during thread construction. Now we will take a look at sending data from the worker to the parent thread as well. To do this, threads follow a very strict sync protocol implemented in the c++ standard. C++ includes a single-use channel to pass data (and … Read more

Mutexes and Locks

To avoid data races, c++ has a concept called mutex, short for Mutual EXclusion. A mutex manages access to a shared resource and ensures that only one thread at a time is able to access the resource: thread 1 ————– | locks resource | mutex —> access –> shared variable x is blocked | thread … Read more

Intro to C++ Concurrency

Let’s start with the very basics before we work through a simple example in c++: Overview When we want to perform multiple pieces of work simultaneously, we can either choose to run them in a synchronos or asynchronos manner: Synchronous time main() ——— ————————-> func() | | return() thread() ——-> Asynchronous time main() ——————————————> func() … Read more

CallBy: Pointer/Reference/Const Reference

By default c++ uses call-by-value, but copying large arguments can be expensive so we can also use call by: pointer reference const reference Since as a user we do not see if a function that takes an argument actually modifies it, we can use const references to have the compiler guarantee that an argument that … Read more

Pointers – Syntax Overview

This post is a short reminder to my future self summarising pointer syntax: Pointer Basics #include <iostream> int main() { int a = 5; int b = 15; int* p1 = &a; //initialize int pointer p1 and assign address of a to pointer p1 int* p2 = &b; //initialize int pointer p2 and assign address … Read more

Intro Memory Management in C++

Most people’s dislike of c++ probably comes from the fact that when you write in cpp, you have to manage memory yourself. While this can be frustrating at times (hello there, segfault and bad_alloc🙂 it is actually immensely powerful and interesting. So let’s get right to it:) Memory Addresses and Hexadecimal Numbers Computers store information … Read more

From Basic to Advanced OOP in C++

C++ is also commonly called C with classes and in this post I want to give you an overview about basic and advanced oop concepts in c++. Let’s start from the very beginning with structures: Structures Structs allow you to define your own custom types to complement the build-in types (also called ‘primitives’ or fundamental … Read more

C++ Foundations

When I left my old job at Austrian Post, my colleagues got me an amazing parting-gift: An Amazon Deep Racer 🙂 While it is quite easy to work with Deep Racer using AWS Sagemaker, I wanted to program Deep Racer directly. Deep Racer runs on Ubuntu Linux with the Robot Operating System (ROS). ROS is … Read more

A* Search in C++

In this post I want to discuss the a* algorithm (pronounced ‘a-star’), how it is used for motion planning and how we can implement it in cpp. The process of determining how to get from a start to an endpoint is called ‘planning’ and for a robot it’s called ‘robot motion planning’. There are two … Read more

Scikit-learn Complete Walk Through (for R-Users)

Python is one if not the most used language by data scientists interested in machine learning and deep learning techniques. One reason for Python’s popularity is without a doubt scikit-learn. scikit-learn does not need an introduction, but in case you are new to the machine learning space in Python, scikit-learn is an all-in-one machine learning … Read more

PCA, Eigenvectors and the Covariance Matrix

Almost every data science course will at some point (usually) sooner than later cover PCA, i.e. Principal Component Analysis. PCA is an important tool used in exploratory data analysis for dimensionality reduction. In this post I want to show you (hopefully in an intuitive way) how PCA works its mathematical magic. Let’s start with a short … Read more

Linear Algebra Refresher

Since quite some time has passed since I took my linear algebra courses, I thought I could comb through my old course notes and write a small post about linear algebra stuff that is quite useful to remember in my opinion. Let`s start with some computation rules for working with matrices: Some Matrix Calculation Rules … Read more

Linux on Razer Blade 15 (mid 2019)

As a long time Macbook user I recently made the switch to a Windows laptop, specifically a Razer Blade 15 (mid 2019) model. While Apple has thankfully started to use proper keyboards again in their recent lineup, the lack of Nvidia gpus is somewhat sub-optimal if you want to use your laptop for deep learning … Read more

Operations Research 101 with Python

Recently my sister asked me to help her with some linear programming problems in Excel. I have not used Excel solver for what feels like decades and I was pleasantly surprised how easy it was to setup and solve a linear programming problem with it. But since I am not really an Excel fan, I … Read more

An Introduction to Statistical Learning

In this blog post I want to give a brief overview of the core ideas behind the statistical learning framework and show you how to implement a few simple models in this framework. This post is based on the excellent work “Tree Boosting With XGBoost – Why Does XGBoost Win ‘Every’ machine Learning Competition” by … Read more

Test-Driven Development (TDD)

“Test-driven development (TDD) is a software development process that relies on the repetition of a very short development cycle: requirements are turned into very specific test cases, then the software is improved to pass the new tests, only. This is opposed to software development that allows software to be added that is not proven to … Read more

REST Calls with Postman

How to set up Postman In order to test REST calls one tool has emerged over the last few years: Postman The following 2-minute-video is great, great summary of how to set it up quickly: [embedded content] I do not want do get into the details. If you need them, you can find them here: … Read more

SQL Server Advanced

Sometimes you have a series of stored procedures that themselves are managed by another master stored procedure like so. This usually just means a series of EXEC statements after each other. It is quite handy to create a login event after each stored procedure call in order to check its progress. Logging with try-catch CREATE … Read more

Creating Azure Logic Apps from R using httr

Logic Apps is a serverless framework in Azure quite similar to IFTTT (if this, then that) and Zapier that allows you to connect different services and create workflows. You can define different types of triggers based on: time and events (e.g. http requests, messages received, …) to start workflows. Logic Apps can be created using a … Read more

Azure Functions

This blogpost will demonstrate how to set up Azure Functions with some Python code. More precisely, it will show how to call an Azure Function, add a parameter that specifies the name of the file that we want to read from and store that information in a database. Between reading and storing we have the … Read more

RStudio Addin

If you want to create your own RStudio addins, all you need to do is: Create an R package Create some R functions Create a file at inst/rstudio/addins.dcf Links 1. Create am R Package Set up tools for package development library(devtools) library(roxygen2) # getwd() # setwd(“path/to/repo”) Create Package I am mainly following: create(“rstudio_addin”) This … Read more

OAuth 2.0 in R

Many APIs require some form of authentication. A very common form used with many cloud providers and commercial APIs is OAuth 2.0. In this post, I want to give you an introduction to how OAuth 2.0 works and use it to authenticate with Microsoft Azure services. OAuth 2.0 – Overview The OAuth 2.0 specification is … Read more

Working with REST APIs for Data Scientists in R

With the growing importance of cloud computing more and more services are exposed as REST APIs. In this post, I want to give a hands-on introduction for data scientists from non-software-engineering backgrounds on how to work with REST APIS. But before we dive straight into the code, let’s start with some background information: A (short) … Read more

R Travis

In this post we will explore how to set up R package development on github focusing on implementing an automatic Travis and codecoverage check. I set up a sample repo that will include a very basic configuration: TravisR Travis is a great You can easily sign up by connecting your github account: You will need … Read more

Introduction – Analysing Customer Churn

At first glance, analysing customer churn seems pretty easy. All we have to know is how many customers we have at a certain point in time and how many customers chose to leave our business over a given period in order to calculate a churn rate. We could simply define customer churn rate as: \[ … Read more

Git & SSH with Powershell Core

In this post I want to give a quick outline of how to setup Powershell Core (Microsoft’s cross-platfrom version of Powershell) to work with git and ssh. While you can simply install Git for Windows and work with Git Bash, personally I quite like Powershell Core, because it is more tightly integrated with Windows and … Read more

Pandas for data.table Users

R and Python are both great languages for data analysis. While they are remarkably similar in some aspects, they are drastically different in others. In this post, I will focus on the similarities and differences between Pandas and data.table, two of the most prominent data manipulation packages in Python/R. There is alreay an excellent post … Read more

The Perceptron Algorithm

In my blog post Neural Nets: From Linear Regression to Deep Nets I talked about how a deep neural net is simply a sequence of simple building blocks of the form: \[\sigma(\underbrace{w^T}_{weights}x + \overbrace{b}^{bias}) = a\] and that a linear regression model is one of the most basic neural networks where the activation function \(\sigma\) … Read more

Blogging with Hugo and Jupyter

I really love blogging with Hugo+Blogdown, but unfortunately Blogdown is still mostly restricted to R (although Python is now also possible using the reticulate package). Jupyter offers a great literate programming environment for multiple languages and so being able to publish Jupyter notebooks as Hugo blogposts would be a huge plus. I have been looking … Read more

Neural Nets: From Linear Regression to Deep Nets

Neural networks, especially deep neural networks, have received a lot of attention over the last couple of years. They perform remarkably well on image and speech recognition and form the backbone of the technology used for self-driving cars. What many people find hard to believe is that the mathematics of neural networks have been around … Read more

SQL Server

Columnstore A columnstore index can provide a very high level of data compression, typically by 10 times, to significantly reduce your data warehouse storage cost. For analytics, a columnstore index offers an order of magnitude better performance than a btree index. Columnstore indexes are the preferred data storage format for data warehousing and analytics workloads. … Read more

Box Cox Transformation

When we do time series analysis, we are usually interested either in uncovering causal relationships (Does \(X_t\) influence \(Y_{t+1}\)?) or in getting the most accurate forecast possible. Especially in the second case it can be beneficial to transform our historical data to make it easier to extract a signal. A very common transformation is to … Read more

Introduction to stochastic control theory

I had my first contact with stochastic control theory in one of my Master’s courses about Continuous Time Finance. I found the subject really interesting and decided to write my thesis about optimal dividend policy which is mainly about solving stochastic control problems. In this post I want to give you a brief overview of … Read more

Azure SQL DWH – Overview

There are a multitude of options when it comes to storing and processing data. In this post I want to give you a brief overview of Azure SQL datawarehouse, Microsoft’s datawareshouse solution for the Azure cloud and its answer to Amazon Redshift on AWS. I will start of by talking briefly about its technical architecture … Read more

Docker Basics

Docker is a tool which helps developers build and ship high quality applications, faster, anywhere. Source Why Docker With Docker, developers can build any app in any language using any toolchain. Dockerized apps are completely portable and can run anywhere. Developers can get going by just spinning any container out of list on Docker Hub. … Read more

More advanced SQL Server for Data Scientists

In the previous post I covered the basics you need to know to work with SQL Server. In this post, I want to show you some more advanced techniques that I found pretty helpful. The topics I will cover include: How to speed up your queries with indices and using columnstore Using Views and Table … Read more

Object Oriented Programming in Data Science with R

Since R is mostly a functional language and data science work lends itself to be expressed in a functional form you can come by just fine without learning about object-oriented programming. Personally, I mostly follow a functional programming style (although often not a pure one, i.e. w/o side-effects, because of limited RAM). Expressing mathematical concepts in … Read more

Estimating Intervention Effects using Baysian Models in R

Measuring the effect of an intervention on some metric is an important problem in many areas of business and academia. Imagine, you want to know the effect of a recently launched advertising campaign on product sales. In an ideal setting, you would have a treatment and a control group so that you can measure the … Read more

Using data.table deep copy

data.table is an awesome R package, but there are a few things you need to watch out for when using it. R usually does not modify objects in place (e.g. by reference), but makes a copy when you change a value and saves this copy. This can be a problem if you work with large datasets … Read more

SQL Server for Data Scientists

SQL is not the sexiest language on the block and many/most data scientists I know prefer to stick to R and/or Python. Some common complains I hear about SQL are: It is hard to read and as a consequence large SQL statements are hard to debug. Version control with databases often requires additional tooling to … Read more

Package development in R – Overview

Creating an R package is as easy as typing: package.skeleton(name = “YourPackageName”) As you might have guessed, this function creates the basic file and folder structure you need to create an R package. You will get: YourPackageName/ DESCRIPTION man/ NAMESPACE R/ You can also use RStudio to create a package with File > New Project … Read more

Agile Project Management for Data Science

Many data scientists are former academics who are used to working on a specific and often quite narrow research problems for long periods of time, often years. With data science being in high demand at the moment in nearly all industries, more and more researchers switch from an academic career to one in the private … Read more

Office Ribbons

I am an absolute fan of adapting your work environment to your needs. Spending an hour to set up some shortcuts is virtually always a good time investment. Then you can easily drag your most used commands into a new bar. You should be able to save a lot of time on, e.g. aligning objects … Read more