An Introduction to Knowledge Graphs

Understanding how to work with knowledge graphs can give data scientists the ability to not just extract interrelated facts and assumptions from massive collections of data, but can also help in understanding how to form contextual connections and understanding from data via linking and semantic metadata which helps provide a unified approach to data analytics … Read more

Building bespoke stochastic process models using Rcpp

Advanced model building in R with Rcpp The scientific literature contains a vast “zoo” of elegant mathematical models for analyzing complex biological systems. Implementing them in R or Python can lead to some interesting, or indeed unexpected, numerical challenges. Fig1. Spatial (1-D) logistic growth model estimated using Rcpp. Original application was for advantageous alleles spreading … Read more

Conversational AI: Trends and Predictions for 2022

In this article, I propose 6 trends and predictions for the market evolution in 2022. The digital transition is accelerating rapidly due to new personal and professional lifestyles brought about by the current pandemic situation. Conversational assistants are part of this transformation by enabling the automation of support and self-service requests. Photo by Mathew Schwartz … Read more

Comparing Kruve Coffee Sifters: New and Old

Immediately, I noticed they felt a bit different in terms of the texture of the screens. I compared a few shots and saw an immediate shift in the distribution. All images by author So I first used a microscope to take a look at the screens. 400um, Top/Bottom: Wide/Zoomed In; Left/Right: Old/New 200um, Top/Bottom: Wide/Zoomed … Read more

Discovering The Matrix Determinant

In the previous section, we learned some basic facts about the determinant and how to interpret it. But how do we calculate it? The procedure to calculate the determinant is quite tedious. Fortunately, there exist some shortcuts, which are only applicable to small matrices (2×2, 3×3). We will talk about the shortcuts first, before talking … Read more

Insights From Visualizing Communities and Connections on YouTube

Interactive map available at Each bubble is a YouTube channel. The size of each bubble is determined by the channel’s subscriber count (in August 2021) A line between two bubbles shows that those two channels share at least 350 commenters of the 20,000 collected. The color of the bubbles indicate communities of YouTube channels … Read more

Comparing Decision Trees

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. In the last article of the current year, we will examine and compare … Read more

Categories R Tags ExcerptFavorite

Trends that shaped the Modern Data Stack in 2021

1. Democratization: both the data and the data stack. By definition, democratization is the action of making something accessible to everyone. As companies strive to become more data-driven, they have made considerable efforts to ensure relevant data is accessible to everyone in the organization. Emphasis on “relevant”. Democratization starts with setting up the right processes … Read more

How to find a Trimmed Mean in R

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Visit finnstats for the most up-to-date information on Data Science, employment, … Read more

Categories R Tags ExcerptFavorite

Merit Order and Marginal Abatement Cost Curve in Python

An electricity authority or a utility in a country could have competing power plants of different types in its portfolio to offer electricity output to retailers. This is known as a wholesale electricity market (also known as spot market). The cost of generating electricity differs according to the type of power plant. For example, there … Read more

Cosine Similarity Explained using Python — Machine Learning — PyShark

Cosine similarity is a measure of similarity between two non-zero vectors. It is calculated as the angle between these vectors (which is also the same as their inner product). Well that sounded like a lot of technical information that may be new or difficult to the learner. We will break it down by part along … Read more

How To Build An AutoML API

Simple guide to building reusable ML classes Image from Unsplash by Scott Graham There’s been a lot of interest in AutoML recently. Ranging from open-source projects to scalable algorithms in the cloud, there’s been a surge in projects that make ML more accessible for non-technical users. Examples of AutoML in the Cloud includes SageMaker Canvas … Read more

An End-to-End Machine Learning Project — Heart Failure Prediction Part 1

Data exploration, model training, validation and storage In this series, I will be walking through an end-to-end machine learning project which will cover everything from data exploration to model deployment via a web application. My goal is to provide general insight into the different components involved in getting a model to production; this series is … Read more

22 predictions about the Software Development trends in 2022

Like the few Giant big supermarkets that replaced the local shops in the western world, the public Cloud will continue to replace the regional Data Center. In the coming years, the public Cloud will also be the go-to Infrastructure for Enterprises, Government, and Startups. The public Cloud is now the hotbed of digital innovation, and … Read more

10 Features Your Streamlit ML App Can’t Do Without — Implemented

Add Jupyter lab, session managment, multi-page, files explorer, conda envs, parallel processing, and deployment to your app Much has been written about Streamlit killer data apps, and it is no surprise to see Streamlit is the fastest growing platform in this field: Image by Star history, edited by author However, developing an object segmentation app … Read more

Bank Customer Churn with Tidymodels — Part 1 Model Development

Load Packages library(tidymodels)library(themis) #Recipe functions to deal with class imbalanceslibrary(tidyposterior) #Bayesian Resampling Comparisonslibrary(baguette) #Bagging Model Specificationslibrary(corrr) #Correlation Plotslibrary(readr) #Read .csv Fileslibrary(magrittr) #Pipe Operatorslibrary(stringr) #String Manipulationlibrary(forcats) #Handling Factorslibrary(skimr) #Quick Statistical EDAlibrary(patchwork) #Create ggplot Patchworkslibrary(GGally) #Pair Plotsoptions(yardstick.event_first = FALSE) #Evaluate second factor level as factor of interest for yardstick metrics Load Data Data is taken from … Read more

Seasonal Adjustment of Daily Time Series

Deseasonalization Introducing the novel DSA procedure by the German Central Bank Photo by Federico Beccari on Unsplash With the advent of Big Data, there has been quite a push concerning time series that are available on a daily basis. Unfortunately, daily data tends to be noisy. For example, the time series could spike on weekends … Read more

The History and Future of Artificial Intelligence through the lenses of Computer Chess and Legal A.I

Photo by Damiano Lingauri on Unsplash An in-depth interview with Professor Jaap van den Herik Although we might only be at the dawn of the A.I. revolution, a world without A.I. around us is already unthinkable. This was not the case forty years ago; at that moment barely, anyone knew the term. During the last … Read more

Decoding the Top 10 Data Science Jargons For Beginners (Commonly Asked In Interviews)

This article is about decoding some of the popular jargon used in data science. It is important to understand these concepts better. They are commonly asked in data science job interviews. Let’s get into the topics. A dependent variable (target variable) is driven by the independent variables in the study. For example, the revenue for … Read more

Battery Storage ROI Analysis

[This article was first published on Commodity Stat Arb, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.  It’s been a while since my last post and I’m taking … Read more

Categories R Tags ExcerptFavorite

Intro to Comparing and Analyzing Multiple Unevenly Spaced Time-Series Signals

Methods to analyze multiple time-series signals that occur over the same time period but have different timestamps and time spacings Photo by Nathan Dumlao on Unsplash Say we have the following scenario — we have two different sensors that are measuring the current and voltage across a battery pack. Now, we want to do some … Read more

From Chemist to ML Researcher

I applied for ~100 jobs, both on LinkedIn & Indeed. I discovered that it costs employers money to post jobs/do recruitment on LinkedIn so I may be missing out on opportunities if I just apply via LinkedIn. All The Fringe Stuff I Did What else did I do this month… Hm… Oh yeah, I wrote … Read more

Counterfactuals for Reinforcement Learning I: “What if… ?”

Introduction to the POMDP framework and counterfactuals In philosophy, a counterfactual thought experiment asks: “What would have happened if A had happened instead of B?”. Gaining insights into the real world from such hypothetical considerations is an important aspect of human intelligence. This two-part series will explore how counterfactual thinking can be modeled within the … Read more

More player analysis with gganimate()

This post continues the analysis of IPL and T20 (men) batsmen and bowlers through animated charts. In my last post Analyzing player performance with animated charts! I had used animated horizontal bars to display the totalRuns or totalWickets over a 3 year ‘sliding window’. While that was cool, the only drawback of that animation was the … Read more

Categories R Tags ExcerptFavorite

Bayesian Linear Regression with Bambi

Leverage Bayesian inference to get a distribution of your predictions When fitting a regression line to sample data, you might get a regression line like below: Image by Author Instead of getting one single regression line, wouldn’t it be nice if you can get a distribution of predictions instead? Image by Author That is when … Read more

8 Guidelines to Create Professional Data Science Notebooks

Sometimes during the analysis, you add code to cells and execute them, and then, after that, you modify and run another cell that comes before them. This may obviously cause some inconsistencies. For example, using variables defined in cells below the current cell will produce errors. See the straightforward example below, where we create a … Read more

My Data Science Journey in 2021

I was in the interview loop with multiple tech companies in the following few months. At first, it didn’t go as expected, and I bombed a few interviews for various reasons. First, I treated technical interviews as academic discussions rather than a structured way of evaluating candidates’ competency, which was a rookie mistake. Fortunately, I … Read more

Forecasting Chess Elo On A Time Series

Using the Glicko rating system to make prediction on your future chess rating. Photo by Hassan Pasha on Unsplash Not long ago, I’ve come across this video[1] by 1littlecoder showing how you can use berserk, the Python client for the Lichess API, to extract information on your chess games. As a regular player on Lichess, … Read more

How to Deploy Machine Learning Models

The easiest way to deploy machine learning models on the web Introduction I will introduce the easiest way to deploy machine learning applications on the web. In the previous notebooks, I have built machine learning models using linear and tree-based models. It turned out that hyper tuned XGboost model performed best. That is why today … Read more

Leveling up your Machine Learning Projects

Project Configuration One important objective for writing good code for machine learning is to change the models without having to change any code. This is why we need a config file, to store important parameters and variables in a single place that allows us to modify them quickly. There are many different ways of writing … Read more

Artificial Intelligence in Magic The Gathering

The execution of tasks on this project was very standard of a data science workflow and can be summarized in the list below. I’m planning on writing a specific article about each of them with great level of detail: Getting the data (5% of the effort) Transforming the data (25% of the effort) Feature Engineering … Read more

Simulating dice bingo

Note: This post was inspired by the “Classroom Bingo” probability puzzle in the Royal Statistical Society’s Significance magazine (Dec 2021 edition). Set-up Imagine that we are playing bingo, but where the numbers are generated by the roll of two 6-sided dice with faces 1, 2, …, 6. Each round, the two dice are rolled. If … Read more

Categories R Tags ExcerptFavorite

Nearest Neighbor Analysis for Large Datasets in Helsinki Region

In the last few months, I have been part of the well-known Automating GIS course at the University of Helsinki as a Research Assistant. My experience has been remarkable while giving my tips for automating GIS processes to students during their tasks. I am glad to see how geographers are taking over the GIS automation … Read more

Reinventing adversarial machine learning: adversarial ML from scratch

Bear with me! I think this might be a half-decent motivation! I want to explain why I think adversarial ML is so interesting. To give it context, let’s start with a ludicrous party question: is a Pop-Tart a ravioli? … The metaphorical question Let’s unpack why the question makes for a fun debate among friends. … Read more

Applying data science in the life insurance industry — a perspective from a qualified actuary

How data science is changing the traditional landscape for life insurance actuaries (explained with a few use cases) Photo by Andrew Neel on Unsplash I’m a qualified actuary with 10 years of experience practising in the life insurance industry in Australia. For those of you who may not be familiar with what an actuary does … Read more

Get and Set working directory (setwd / getwd) in R

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Visit for the most up-to-date information on Data Science, employment, and … Read more

Categories R Tags ExcerptFavorite

Introduction to Geospatial Visualization with the tmap package

[This article was first published on Rami Krispin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I had super fun exploring the tmap package functionality while preparing a … Read more

Categories R Tags ExcerptFavorite