Data Science Austria

A short guide to using Docker for your data science environment

WHY One of the most time consuming part of starting your work on a new system/starting a new job or just plain sharing your work is the variation of tools available (or lack thereof) due to differences in hardware/software/security policies and what not. Containerization has risen up in recent years … Read moreA short guide to using Docker for your data science environment

Data network effects for an artificial intelligence startup

Artificial intelligence (AI) ecosystem matures and it is becoming increasingly difficult to impress customers, investors, and potential acquirers by just attaching an .ai domain to whatever you are doing. Therefore, the significance of building a defensible business model in the long run becomes obvious. In this post, I explore how an … Read moreData network effects for an artificial intelligence startup

R some blog 2018-12-08 04:19:00

Motivation The dplyr functions select and mutate nowadays are commonly applied to perform data.frame column operations, frequently combined with magrittrs forward %>% pipe. While working well interactively, however, these methods often would require additional checking if used in “serious” code, for example, to catch column name clashes. In principle, the … Read moreR some blog 2018-12-08 04:19:00

Feel discouraged on the sparse data in your hand? Give Factorization Machine a shot (2)

By laying a solid foundation of Matrix Factorization, your exploration on a series of advanced models derived from the concept of matrix factorization will be much more smoother, such as LDA, LSI, PLSA and Tensor Factorization and etc. The models derived from the concept of Matrix Factorization In last session, … Read moreFeel discouraged on the sparse data in your hand? Give Factorization Machine a shot (2)

“Increase sample size until statistical significance is reached” is not a valid adaptive trial design; but it’s fixable.

TLDR: Begin with N of 10, increase by 10 until p < 0.05 or max N reached. This design has inflated type-I error. Lower p-value threshold needed to ensure specified type-I error rate. The number of interim analyses and max N affect the type-I error rate. Threshold can be identified … Read more“Increase sample size until statistical significance is reached” is not a valid adaptive trial design; but it’s fixable.

Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

What, Why, Possible Solution and Ultimate Utility In one of my previous articles, “Under-sampling : A Performance Booster on Imbalanced Data”: I have applied Cluster Centroid based Majority Under-sampling Technique (CCMUT) on Adult Census Data and proved the Model Performance Improvement w.r.t State-of-the-Art Model, “A Statistical Approach to Adult Census Income … Read moreShortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

“Artist” in Matplotlib — something I wanted to know before spending tremendous hours on googling…

Originally published at dev.to and modified a bit to fit Medium’s editing system. It’s true that matplotlib is a fantastic visualizing tool in Python. But it’s also true that tweaking details in matplotlib is a real pain. You may easily lose hours to find out how to change a small … Read more“Artist” in Matplotlib — something I wanted to know before spending tremendous hours on googling…