Performance Metrics in ML - Part 3: Clustering

In the first two parts of this series, we explored the main types of performance metrics used to evaluate Machine Learning models. These covered the two major types of ML tasks, Classification and Regression. While this type of tasks make up of most of the usual applications, another key category exists: Clustering. To read the … Read more

Simple SVD algorithms

Aim of this post is to show some simple and educational examples how to calculate singular value decomposition using simple methods. If you are interested in industry strength implementations, you might find this useful. Singular value decomposition (SVD) is a matrix factorization method that generalizes the eigendecomposition of a square matrix (n x n) to … Read more

Functional Modeling and Quantitative System Analysis in Python

We restrict ourselves to the case of unidirectional functional chain without feedback loops. Feedback loops introduce dynamic effects requiring an extension of the pattern — this will be covered in a follow-up article. Consider the following illustrative functional block diagram. Illustrative functional block diagram (image by author). It shows a transformation chain from polar coordinates … Read more

Simple GPS data visualization using Python and Open Street Maps

The image below shows the goal of this method. There are three main elements that need to be included: Map image — map in some image format like .png, .jpg, etc. GPS records — records that consist of (latitude, longitude) pairs. Geographical coordinates — conversion from pixels to geographical coordinates. The final result of the … Read more

When Did the US Senate Best Reflect the US Population?

The data for this analysis will come from two primary sources. Information on the US Senators will come from the same ProPublica Congress API as the original visualization. Information on the US Population Age Distribution will come from a variety of source from the US Census Bureau. Setting up the libraries While the workhorse functions … Read more

Categories R Tags ExcerptFavorite

Shiny 1.6

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. We are thrilled to announce that Shiny 1.6.0 is now on CRAN! … Read more

Categories R Tags ExcerptFavorite

Enjoy More Rstudio::global(2021)

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. It’s been just over a week since we wrapped up the final … Read more

Categories R Tags ExcerptFavorite

All You Need To Know About Merging (Joining) Datasets in R

Merging—also known as joining—two datasets by one or more common ID variables (keys) is a common task for any data scientist. If you get the merge wrong you can create some serious damage to your downstream analysis so you’d better make sure you’re doing the right thing! In order to do so, I’ll walk you … Read more

Categories R Tags ExcerptFavorite

HAPPY KAGGLING | Data Scientist’s Competition

Kaggle, a popular platform for data science competitions, can be intimidating for beginners to get into. After all, some of the listed competitions have over $1,000,000 prize pools and hundreds of competitors. Top teams boast decades of combined experience, tackling ambitious problems such as improving airport security or analyzing satellite data. It’s no surprise that … Read more

Creating beautiful maps with Python

I always liked city maps and a few weeks ago I decided to build my own artistic versions of it. After googling a little bit I discovered this incredible tutorial written by Frank Ceballos. It is a fascinating and handy tutorial, but I prefer a more detailed/realistic blueprint maps. Because of that, I decided to … Read more

Everything You Ever Wanted to Know About Decision Trees in Python

From first principles to deployment in a production environment including worked examples and easy-to-follow explanations Photo by Shahadat Rahman on Unsplash I have come across many articles on decision tree machine learning algorithms in Python across various mediums but they have always left me wanting more. They either seem to leap in part-way through the … Read more

Displaying Logging While Drilling (LWD) Image Logs in Python

Utilizing the power of matplotlib to display wellbore image data Logging While Drilling image data displayed using matplotlib in Python. Image created by the author. Borehole image logs are false-color pseudo images of the borehole wall generated from different logging measurements/tools. How borehole images are acquired differs between wireline logging and logging while drilling (LWD). … Read more

Getting data from the Canada Covid-19 Tracker using R

Last semester (Fall 2020) I taught a new course in healthcare data science for the Johnson Shoyama Graduate School in Public Policy. One of the final topics of the course was querying application programming interfaces (APIs) from within R. The example we used was querying data on the Covid 19 pandemic from the Covid-19 Tracker … Read more

Categories R Tags ExcerptFavorite

Implementing Random Forest

If you don’t consider runtime, building more trees might help solve this problem a bit, as each tree has another set of randomly selected features. However, this method is not recommended, because Random Forest is prone to sparsity/density by design. If you prefer using tree algorithms, XGBoost is insensitive to sparse/dense data and worth trying, … Read more

How to Speed up Your K-Means Clustering by up to 10x Over Scikit-Learn

Using the Faiss library Chire, CC BY-SA 4.0, via Wikimedia Commons K-Means Clustering is one of the most well-known and commonly used clustering algorithms in Machine Learning. Specifically, it is an unsupervised Machine Learning algorithm, meaning that it is trained without the need for ground-truth labels. Indeed, all you have to do to use it … Read more

Use R To Pull Energy Data From The Department of Energy’s EIA API

Now that we have our API key and the Series IDs, we can write the R code to access the data. First, import the necessary libraries. We need to use the httr and jsonlite libraries. #Import librariesinstall.packages(c(“httr”, “jsonlite”))library(httr)library(jsonlite) Now, paste your API key into the code. Then paste in the series IDs you want to … Read more

Must-read Guide to Hypothesis Tests You Will Never Use

Hypothesis Testing Pipeline So far, we have talked about the first two steps of hypothesis testing: setting up the null and alternative identify error types and set a significance threshold Now, we will look at a simple scenario using Python code. Below, we have the tips dataset from Seaborn which contains 244 records of clients … Read more

BachGAN: Using GANs to generate original Baroque Music

With playable audio files to listen to the generated music Photo by Marius Masalar on Unsplash GANs are highly versatile, allowing for the generation of anything that can be synthesized into images. By utilizing this feature of GANs, it is possible to generate very unorthodox content, at least from the perspective of machine learning. This … Read more

Message queues for data UI

IPC to simplify medium and large applications development Image prepared by the author. All rights reserved. Introduction Data Science visualisation normally gravitates around displaying individual charts and graphs. Therefore, there is a lot of material covering graphic libraries and frameworks to generate charts. Most advanced charting and plotting libraries allow interaction with data, but normally … Read more

Time Series Demystified

Components of Time Series Time series can be decomposed into four components, each expressing a particular aspect of the movement of the values of the time series. They are: Trend, Seasonality, Cycles, Irregularities Time Series components – Image by Author Seasonal and Cyclic Variations are the short-term fluctuations, whereas the trend is long-term movements and … Read more

Productivity Tip: Adding Jupyter and Anaconda prompts to Windows’ right-click context menu

After searching for the possibility online, I discovered that it’s not so cut and dry to do, but is not that complicated either, so bear with me! The first thing to do is discover the PATH to your Anaconda installation. If you used the default location during the installation process it should be located somewhere … Read more

What is Data Condensation?

The topic of data-efficient learning an important topic in Data Science and is an active area of research. Training large models on big data could take a lot of time and resources, so the question is can we replace a large data set with a smaller one, that will nevertheless contain all useful information from … Read more

Use R to Exploit Unexplored Data Territories!

Source: https://unsplash.com/@andrewtneel A case study on how to use R to collect data from outside sources. Let’s assume for a project you need data about your customers’ socioeconomic background such as the average income of the neighborhood where they live, education level, employment level, and so on. Typically such data are made available by some … Read more

Data science won’t improve your business decisions. Here’s what will.

Decision Intelligence is the Missing link in most Projects Digital transformation is the flavor of the season. Every company has accelerated its efforts to digitize operations, gather intelligence, and rapidly respond to a changing market. McKinsey senior partner Kate Smaje says that organizations are now accomplishing in 10 days what used to take them 10 … Read more

Eyes on RT-PCR tests with echarts and french open data — COVID-19

[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. ShareTweet Data on COVID-19 screening tests for France ; a shared dashboard … Read more

Categories R Tags ExcerptFavorite

LogicGamesSolver— How to solve logic games using Computer Vision and Artificial Intelligence

So, here we are! We have all the elements to solve the game. Like many others logic puzzle games, Sudoku, Star Battle and Skyscrapers can be described as Constraints Satisfaction Problems³. A CSP consists of three elements: A set of variables of which we want find the right value A domain of possible values for … Read more

Parsing portfolio optimization

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Our last few posts on risk factor models haven’t discussed how … Read more

Categories R Tags ExcerptFavorite

Value-based Methods in Deep Reinforcement Learning

Deep Reinforcement learning has been a rising field in the last few years. A good approach to start with is the value-based method, where the state (or state-action) values are learned. In this post, a comprehensive review is provided where we focus on Q-learning and its extensions. unsplash There are three types of common machine … Read more

Rolling Regression and Pairs Trading in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. In a previous post, we have provided an example of … Read more

Categories R Tags ExcerptFavorite

Controlling Gradient Descent

…is like driving an old car Photo by Ralph (Ravi) Kayden on Unsplash The algorithm to be discussed will allow you to routinely reach lower cost function values by a factor of about 10¹⁰ compared to the Adam optimization algorithm. Keep on reading to see why it’s like “driving an old car”! There are many … Read more

Replicate Avro Messages To Target, Conflicting Schema Exists On Target Schema Registry under same subject

This is a follow-up article based on this, where we discussed what to expect when replicator tries to copy topic with Avro messages to target but target schema registry already have same schema ID (which is embedded into messages) residing in different subject and that schema object is completely different with what it had on … Read more

Practical experimentation tips using the Robinhood/GME fiasco

tl;dr You don’t need (and probably can’t use) an A/B test to know that Robinhood churned its user base by restricting GME trading. I’m going to use the recent Robinhood/GME fiasco as a hypothetical example in sharing a couple of practical ideas around experimentation I’ve picked up over the years. Disclaimer: this is obviously hypothetical. … Read more

Two new versions of gratia released

While the Covid-19 pandemic and teaching a new course in the fall put paid to most of my development time last year, some time off work this January allowed me time to work on gratia 📦 again. I released 0.5.0 to CRAN in part to fix an issue with tests not running on the new … Read more

Categories R Tags ExcerptFavorite

4 Must-Know Properties of Databases

How to make a database ACID-compliant Photo by Code Mnml on Unsplash Everything about data science starts with data. Without proper and accurate data, data science is like a luxury car with no gas. A well-maintained, easily accessible, scalable, and hard-to-fail database is essential to provide access to data. In order to make sure a … Read more

Why Data Analysts Should Apply to Data Scientist Jobs

Trends versus predictions Data analysts use data at an aggregate level to find trends and provide recommendations to improve business performance. Data scientists will use data in machine learning models to predict an event typically at a customer level. Data analysts look at the past to find trends while data scientists use the past to … Read more

Large-Scale Analysis of On-line Conversation about Vaccines before COVID-19

Twitter and news sources played a role in the pre-pandemic world Do you remember the old days of anti-vaccination debates? Will it affect today’s attitude? Data can expose it all Photo by Mehmet Turgut Kirkgoz on Unsplash Discussion over the role and need of vaccines has never been so strong. The COVID-19 pandemic has changed … Read more