﻿ March 2019 – Page 2 – Data Science Austria

## Explore your Researcher Degrees of Freedom

I am an applied economist working in the area of accounting and corporate transparency. I work with observational data a lot, meaning with data that is already available and not under my control. Whenever I set sails to design a test, there are a lot of decisions to take: Which sample should I use? What … Read moreExplore your Researcher Degrees of Freedom

## Le Monde puzzle [#1088]

A board (Ising!) Le Monde mathematical puzzle in the optimisation mode, again: On a 7×7 board, what is the maximal number of locations that one can occupy when imposing at least two empty neighbours ? Which I tried to solve by brute force and simulated annealing (what else?!), first defining a target targ=function(tabz){ sum(tabz[-c(1,9),-c(1,9)]-1.2*(tabz[-c(1,9),-c(1,9)]*tabz[-c(8,9),-c(1,9)] +tabz[-c(1,9),-c(1,9)]*tabz[-c(1,2),-c(1,9)] … Read moreLe Monde puzzle [#1088]

## Paper Summary. Stiffness: A New Perspective on Generalization in Neural Networks

Source This is a summary of Stiffness: A New Perspective on Generalization in Neural Networks (01/2019). Stiffness? This paper aims at improving our understanding of how neural networks generalize from the point of view of stiffness. The intuition behind stiffness is how a gradient update on one point affects another: [it] characterizes the amount of … Read morePaper Summary. Stiffness: A New Perspective on Generalization in Neural Networks

## Predicting geographic origin of fish samples using Random Forest models

How machine learning concepts can support fishery management The Problem I was trying to show the utility of a type of analysis that groups the origin of fish samples from a particular species given the shape of the fish’s ear bone. The basic concept is that fish in distinct groups for a specific species, say … Read morePredicting geographic origin of fish samples using Random Forest models

## Causal Inference using Difference in Differences, Causal Impact, and Synthetic Control

Correlation is not causation. Then what is causation? How can it be measured? Causation is measuring the real impact on Y because of X. E.g., What is the effect of ad campaigns on the sales of a product? It is critical to precisely understand the causal effects of these interventions on the subject. One of … Read moreCausal Inference using Difference in Differences, Causal Impact, and Synthetic Control

## Conversational AI: Design & Build a Contextual AI Assistant

Installation and Setup Now, let’s install Rasa and start creating the initial set of training data for our travel assistant. Rasa can be setup in two ways. You can either install the Rasa stack using python/pip on your local machine or you can use docker to setup Rasa stack using preconfigured docker images. We’re going to … Read moreConversational AI: Design & Build a Contextual AI Assistant

## Organizing data driven experiments with PLynx

“An easy way to build reproducible data science workflows” Continuous improvement, promoting innovative solutions and understanding complex domain are essential parts of challenges Data Scientists face today. On top of it they deal with various engineering problems starting with data collection and transformation to deploying and monitoring. Engineers and data scientists developed various tools and … Read moreOrganizing data driven experiments with PLynx

## Classification algorithm for non-time series data

One of the critical problems of “identification”, be it NLP — speech/text or solving an image puzzle from pieces like a jigsaw, is to understand the words, or pieces of data and the context. The words or pieces individually don’t give any meaning and tying them together gives an idea about the context. Now the … Read moreClassification algorithm for non-time series data

## Using RStudio and LaTeX

This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX. Using tikzdevice to insert R Graphs into LaTeX I am a very visual thinker. If I … Read moreUsing RStudio and LaTeX

## I trained an AI to imitate my own art style. This is what happened.

Alright, let’s talk about artificial intelligence. AI is everywhere today, and if you think we just DO NOT NEED another AI article, you’re probably right. But before you close this tab, please hear me out. This one is different. First of all, I’m not a developer or software engineer. I did not create another AI. … Read moreI trained an AI to imitate my own art style. This is what happened.

## #21: A Third and Final (?) Post on Stripping R Libraries

Welcome to the 21th post in the reasonably relevant R ramblings series, or R4 for short. Back in August of 2017, we wrote two posts #9: Compating your Share Libraries and #10: Compacting your Shared Libraries, After The Build about “stripping” shared libraries. This involves removing auxiliary information (such as debug symbols and more) from … Read more#21: A Third and Final (?) Post on Stripping R Libraries

## Introduction – Analysing Customer Churn

At first glance, analysing customer churn seems pretty easy. All we have to know is how many customers we have at a certain point in time and how many customers chose to leave our business over a given period in order to calculate a churn rate. We could simply define customer churn rate as: \[ … Read moreIntroduction – Analysing Customer Churn

## Super Dark IDE Theme, R-Studio, Inverted Color

A dark IDE theme may increase visual comfort and productivity for those spending extended amounts of time coding, writing, and reading at a computer terminal. Why? If your 9 to 5 has you chained to a computer, you’ve likely experienced eye strain. Typical symptoms include soreness, irritation, and difficulty focusing your vision. All of which … Read moreSuper Dark IDE Theme, R-Studio, Inverted Color

## Jazz & Bossa Nova: Siblings (?)

1. Importing Libraries First off, let’s import the required libraries: Matplotlib and Seaborn will be imported for data visualization; Pandas, for data analysis; Bs4 and Requests, for web scraping. import matplotlib.pyplot as pltimport seaborn as sns import pandas as pd import requestsfrom bs4 import BeautifulSoup 2. Web Scraping Now that the libraries have been imported, which … Read moreJazz & Bossa Nova: Siblings (?)

## What are you optimizing for?

15 months ago I joined to lead the product management of an early stage startup. Without getting into too much details, we develop an AI-based software solution for customer success teams in B2B companies. In this post, I will discuss the relationship between user experience and Artificial Intelligence (AI) and what’s in it for a … Read moreWhat are you optimizing for?

## A bit more understanding of Cronbach’s alpha

Cronbach’salpha reliability coefficient is one of the most widely used indicators of thescale reliability. It is used often without concern for the data (this will bea different text) because it is simple to calculate and it requires only oneimplementation of a single scale. The aim of this article is to provide some moreinsight into the … Read moreA bit more understanding of Cronbach’s alpha

## Significance of ACF and PACF Plots In Time Series Analysis

This article is for folks who want to know the intuition behind determining the order of auto-regressive (AR) and moving average (MA) series using ACF and PACF plots. Most of us know how to use ACF and PACF plots to obtain the values of p and q to feed into the AR-I-MA model, but we … Read moreSignificance of ACF and PACF Plots In Time Series Analysis

## An Introduction to Logistic Regression

Yang SBlockedUnblockFollowFollowing Mar 26 This blog will cover five questions: 1. What is Logistic Regression? 2. Why not use Linear Regression? 3. Maximum Likely Estimation (MLE) 4. Gradient Decent 5. Implement Gradient Decent in python What is Logistic Regression? Logistic regression is a traditional and classic statistical model, which has been widely used in the … Read moreAn Introduction to Logistic Regression

## Deployment of Binning Outcomes in Production

In my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), I’ve shown different monotonic binning algorithm that I developed over time. However, these binning functions are all useless without a deployment vehicle in production. During the weekend, I finally had time to draft a R function(https://github.com/statcompute/MonotonicBinning/blob/master/code/calc_woe.R) that can be used to deploy the binning outcome and to apply the WoE … Read moreDeployment of Binning Outcomes in Production

## Tuning a Multi-Task Fate Grand Order Trained Pytorch Network

In a previous post I did some multi-task learning in Keras (here) and after finishing that one I wanted to do a follow up post on doing a multi-task learning in Pytorch. This was mostly because I thought it would be a good exercise for me to build it in another framework, however in this … Read moreTuning a Multi-Task Fate Grand Order Trained Pytorch Network

## The 3 questions you need to ask to get a data science job

Photo by Ken Treloar on Unsplash About a decade ago, people would ask me what I did for a living and would pull blank, puzzled faces when I told them I was a data scientist. I’d always try to explain what I did in terms of the tools I used or the people I was around. … Read moreThe 3 questions you need to ask to get a data science job

## Pruned Cross Validation for hyperparameter optimization

Speed benchmarking Photo by chuttersnap on Unsplash The main advantage of the pruned cross-validation is a search speed increase. If the hyperparameter set yields poor results, the cross-validation is pruned and therefore time, and computation resources are saved. Below you can find a comparison between standard grid search and pruned grid search: Search speed benchmarking Grid … Read morePruned Cross Validation for hyperparameter optimization

## Bio7 3.0 Released

27.03.2019 A new release of Bio7 is available which is built upon Eclipse 4.11 and the latest Java OpenJDK. This new version comes bundled with OpenJDK 12, supports the dynamic compilation of Java 11 and fixes several annoying bugs on MacOSX (e.g., shutdown crashes). The R interface has been improved and the R-Shell now updates … Read moreBio7 3.0 Released

## R Studio Shortcuts and Tips

How can you work faster in R Studio? Do you really want to know? In this article, I would like to share with you some of my favorite productivity features of R Studio along with their respective shortcuts. As well I will provide information about some other tools and techniques that are useful. I also prepared … Read moreR Studio Shortcuts and Tips

## Rome Was Not Built In A Day But widgetcard Was!

I saw a second post on turning htmlwidgets into interactive Twitter Player cards and felt somewhat compelled to make creating said entities a bit easier so posited the following: Wld this be useful packaged up, #rstats?https://t.co/sfqlWnEeJVhttps://t.co/troKzmzTNv (TLDR/V: Single function to turn an HTML widget into a deployable interactive Twitter card) pic.twitter.com/uahB52YfE2 — boB Rudis (@hrbrmstr) … Read moreRome Was Not Built In A Day But widgetcard Was!

## Could you be the next graduate Mango?

At Mango, we firmly believe that any decision can be better made using analytics and data. We also know that a company’s success is increasingly dependent on becoming data-driven. That’s where we come in. Our mission is to empower organisations to make informed decisions using data science and advanced analytics to drive bigger gains, lower … Read moreCould you be the next graduate Mango?

## Society Desperately Needs An Alternative Web

Is it too late to steer a path towards an internet intended to free information, preserve our privacy and be accountable to the needs of humanity? Deposit Photos: Leave Me Alone I see a society that is crumbling. The rampant technology is simultaneously capsizing industries that were previously the bread and butter of economic growth. The … Read moreSociety Desperately Needs An Alternative Web

## A step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib

Although there’re tons of great visualization tools in Python, Matplotlib + Seaborn still stands out for its capability to create and customize all sorts of plots. Photo by Jack Anstey on Unsplash In this article, I will go through a few sections first to prepare background knowledge for some readers who are new to Matplotlib: Understand the … Read moreA step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib

## Data Clustering Using Hamiltonian Dynamics

A brief introduction to a flexible clustering algorithm for data on flat and curved surfaces. Imagine you finally land a data science job, and it entails checking, labelling and classifying every new datum added to the dataset manually. Such a job would be a dull and tedious job! Furthermore, with the volume of data being … Read moreData Clustering Using Hamiltonian Dynamics

## The data is in: Ethiopia has the best coffee

Building each country’s coffee profile The dataset records each coffee sample’s country of origin, and that allows me to aggregate the grades for each country and build that country’s coffee profile. Figure 1 shows the results for a few countries and one thing that becomes pretty apparent is that there is actually not much variation between … Read moreThe data is in: Ethiopia has the best coffee

## Inverse Statistics – and how to create Gain-Loss Asymmetry plots in R

Asset returns have certain statistical properties, also called stylized facts. Important ones are: Absence of autocorrelation: basically the direction of the return of one day doesn’t tell you anything useful about the direction of the next day. Fat tails: returns are not normal, i.e. there are many more extreme events than there would be if … Read moreInverse Statistics – and how to create Gain-Loss Asymmetry plots in R

## “GANs” vs “ODEs”: the end of mathematical modeling?

Disentangling neural networks representations [source] Hi everyone! In this article, I would like to make a connection between classical mathematical modeling, that we study in school, college, and machine learning, that also models objects and processes around us in a totally different manner. While mathematicians create models based on their expertise and understanding of the … Read more“GANs” vs “ODEs”: the end of mathematical modeling?

## Koning Filip lijkt op …

Last call for the course on Text Mining with R, held next week in Leuven, Belgium on April 1-2. Viewing the course description as well as subscription can be done at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r Some things you’ll learn … is that King Filip of Belgium is similar to public expenses if we just look at open data … Read moreKoning Filip lijkt op …

## Use RStudio Server in a Virtual Environment with Docker in Minutes!

A fundamental aspect of the reproducible research framework is that (statistical) analysis can be reproduced; that is, given a set of instructions (or a script file) the exact results can be achieved by another analyst with the same raw data. This idea may seem intuitive, but in practice it can be difficult to achieve in … Read moreUse RStudio Server in a Virtual Environment with Docker in Minutes!

## February 2019: “Top 40” New CRAN Packages

One hundred and fifty-one new packages arrived at CRAN in February. Here are my “Top 40” picks organized into eight categories: Bioinformatics, Data, Machine Learning, Medicine, Statistics, Time Series, Utilities and Visualization. Bioinfomatics Cascade v1.7: Implements a modeling tool allowing gene selection, reverse engineering, and prediction in cascade networks. See Jung et al. (2014) for … Read moreFebruary 2019: “Top 40” New CRAN Packages

## Statistical Learning and Knowledge Engineering All the Way Down

A path to combining machine learning and knowledge bases Photo by Joao Tzanno on Unsplash Doug Lenat, CEO of Cycorp, Inc. and AAAI Fellow, gave an interesting keynote talk at the AAAI Spring Symposium at Stanford University during the AAAI-Make session. Current trends in society contrast with the common perception that people are becoming more and … Read moreStatistical Learning and Knowledge Engineering All the Way Down

## Constructivist Machine Learning

A vision towards bringing machine learning closer to humans Photo by The Roaming Platypus on Unsplash Is there a way to re-interpret machine learning in a constructivist way? And more importantly, why should we do it? The answers to both questions are quite straightforward. Yes, we can do it, and the motivation for that may address one … Read moreConstructivist Machine Learning

## Feature Selection and Dimensionality Reduction

Remove features with missing values Checking for missing values is a good first step in any machine learning problem. We can then remove columns exceeding a threshold we define. # check missing valuestrain.isnull().any().any() False Unfortunately for our dimensionality reduction efforts, this dataset has zero missing values. Remove features with low variance In sklearn’s feature selection module we … Read moreFeature Selection and Dimensionality Reduction

## Why Retailers Want to Fill More Data Scientist Positions

The role of data scientist is one of the most in-demand positions overall, but there’s particularly an effort in the retail sector to make use of the skills data scientists offer. Whether you’re currently a data scientist or aspire to become one, there are various reasons why retailers may want to hire you. Here are … Read moreWhy Retailers Want to Fill More Data Scientist Positions

## Data science productionization: trust

I devote most of the posts in this series to the more technological aspects of productionization, although even those aspects are heavily dependent upon some very human processes. But let’s say our code is all packaged, containerized, and version-controlled; that our workflows have all been automated; that all of the processes have technical and non-technical … Read moreData science productionization: trust

## Neural Style Transfer Series : Part 2

TensorFlow and pyTorch Implementation of Neural Style Transfer This article follows from what we discussed in the first article. While we spoke about the intuition and the theory of how Neural Style Transfer works, we will now move onto implementing the original paper. If this is the first article you’ve read of this series, I would … Read moreNeural Style Transfer Series : Part 2

## Machine Learning Algorithms In Layman’s Terms, Part 2

With that, onto the data science! Now that we have covered gradient descent, linear regression, and logistic regression in Part 1, let’s get to Decision Trees and Random Forest models. Decision Trees A decision tree is a super simple structure we use in our heads everyday. It’s just a representation of how we make decisions, … Read moreMachine Learning Algorithms In Layman’s Terms, Part 2

## Markov chain Monte Carlo doesn’t “explore the posterior”

First some background, then the bad news, and finally the good news. Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it. Sampling to characterize the posterior There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose … Read moreMarkov chain Monte Carlo doesn’t “explore the posterior”

## Generate Piano Instrumental Music by Using Deep Learning

Hello everyone! Finally, I can write again on my Medium and have free time to do some experiments on Artificial Intelligence (AI). This time, I am going to write and share about how to generate music notes by using Deep Learning. Unlike my previous article about generating lyrics, this time we will generate the notes … Read moreGenerate Piano Instrumental Music by Using Deep Learning

## What it the interpretation of the diagonal for a ROC curve

Last Friday, we discussed the use of ROC curves to describe the goodness of a classifier. I did say that I will post a brief paragraph on the interpretation of the diagonal. If you look around some say that it describes the “strategy of randomly guessing a class“, that it is obtained with “a diagnostic … Read moreWhat it the interpretation of the diagonal for a ROC curve

## Human Pose Estimation : Simplified

Take a peek into the world of Human Pose Estimation What is Human Pose Estimation anyway? Human pose estimation is an important problem in the field of Computer Vision. Imagine being able to track a person’s every small movement and do a bio-mechanical analysis in real time. The technology will have huge implications. Applications may … Read moreHuman Pose Estimation : Simplified

## Automation, Risk and Robust Artificial Intelligence

An interview with Professor Thomas Dietterich on the need for high reliability in socio-technical systems involving AI. Photo by Laurent Perren on Unsplash The ways in which artificial intelligence (AI) is woven into our everyday lives can hardly be overstated. Powerful deep machine-learning algorithms increasingly predict what movies we want to watch, which ads we’ll respond … Read moreAutomation, Risk and Robust Artificial Intelligence

## Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform. The idea is simple, yet powerful. First let’s start with some data. d <- wrapr::build_frame( “id”, “measure”, “value” | 1 , “AUC” , 0.7 | 1 , “R2” , 0.4 | 2 , “AUC” , 0.8 | 2 , “R2” , 0.5 … Read moreOperator Notation for Data Transforms

## Critical Thinking in Data Science

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Debbie Berebichez, a physicist, TV host and data scientist and is currently the Chief Data Scientist at Metis in NY. Hugo: Hi there, Debbie, and welcome to DataFramed. Debbie: Hi, Hugo. It’s a pleasure of mine to be here. Hugo: It is such a … Read moreCritical Thinking in Data Science

## Tinkering with Tensors and Other Great Adventures

Motivations Why implement a research paper? And why NLP? Let’s start with the latter. Assuming you want to create beneficial AI for the sustained good of humanity (and I mean, who wouldn’t?), you’d necessarily have to create a system capable of advanced reasoning, and which could preferably explain the contents of its consciousness to a … Read moreTinkering with Tensors and Other Great Adventures