My First Month as a Data Scientist

Photo by Antoine Dautry on Unsplash With the hype surrounding machine learning these days, it’s quite easy to into the trap of overcomplicating a problem that could be solved with a simple linear or logistic regression. In some cases, the required infrastructure for a complex machine learning pipeline might not even be available. Most data … Read more

Simple KNN Classifier with Four Lines of Code for Beginners: Machine Learning

Photo by Annie Spratt on Unsplash A clear explanation of Some Core Machine Learning Concepts with a Project The KNN classifier is a very popular and well known supervised machine learning technique. This article will explain KNN classifier with an example What is a supervised learning model? I will explain it in detail. But here … Read more

Combining logistic regression and decision tree

Making logistic regression less linear Logistic regression is one of the most used machine learning techniques. Its main advantages are clarity of results and its ability to explain the relationship between dependent and independent features in a simple manner. It requires comparably less processing power, and is, in general, faster than Random Forest or Gradient … Read more

Imbalanced Data, What Can You Do?

When finding that one trustworthy politician in your dataset is like looking for a needle in the haystack, here is what you can do. Illustration of standard techniques, Image by Author When we speak of imbalanced data, what we mean is that at least one class is underrepresented. For example, when considering the problem of … Read more

Course sequence: Data analytics for the liberal arts

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I’m a proud liberal arts graduate myself who, with some fumbling, ended up in the world of data analytics. It may sound odd, but I never fancied myself much of a “math person,” and I still love to explore the world through … Read more

Categories R Tags ExcerptFavorite

Why I’m Starting Data Science Over.

Introducing #66DaysOfData I’m feeling stuck. In my current work and in the content I create (videos and blog posts), I feel like I’ve begun to stall out. Most of the consumers of my content are at the start of their data science journey. The longer I’m in the field, the less I feel I can … Read more

Classification Framework for Imbalanced Data.

Understanding and utilizing imbalanced data. Classification is a type of supervised learning in Machine Learning that deals with categorizing data into classes. Supervised learning implies models that take input with their matched output to train a model which can later make useful predictions on a new set of data with no output. Some examples of … Read more

Quick Tips for Data Cleaning in R

[This article was first published on Exploring Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Exploring-Data is a place where I share easily digestible content aimed at … Read more

Categories R Tags ExcerptFavorite

100 Time Series Data Mining Questions (with answers!) – Part 1

I decided to start this series of Time Series Data Mining base on Eamonn’s presentation, so that’s why the title is “100”. That’s the idea, but for now, we only have 19 questions ready to go. I’ll use the datasets available at so you can try this at home. The original code (MATLAB) and … Read more

Categories R Tags ExcerptFavorite

Survey categorical variables with KableExtra

In my in-progress thesis I decided I’ll analyze my survey results in something other than SPSS we learned in undergrad, which eventually led me to begin using R. The time came and I started analyzing my pilot survey data from Qualtrics. In this post I’ll address how I used {KableExtra} to nicely print a frequency … Read more

Categories R Tags ExcerptFavorite

Data Professionals: The Arbiters of Truth

Source: Unsplash Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Parker (aka Spiderman), “With great power, comes great responsibility”. The proverb echoed … Read more

Custom Coloring Dendrogram Ends in R

As a graduate student studying microbial community data, most of the projects I work on involve some sort of clustering analysis. For one of them, I wanted to color the ends of a dendrogram by a variable from my metadata, to visualize whether that variable followed the clustering as part of another figure. There exist … Read more

TabNet on AI Platform: High-performance, Explainable Tabular LearningTabNet on AI Platform: High-performance, Explainable Tabular LearningResearch Scientist, Google Cloud AISoftware Engineer, Google Cloud AI

Today, we’re making TabNet available as a built-in algorithm on Google Cloud AI Platform, creating an integrated tool chain that makes it easier to run training jobs on your data without writing any code.  TabNet combines the best of two worlds: it is explainable (similar to simpler tree-based models) while benefiting from high performance (similar … Read more

Exploratory Data Analysis on Anime Data

Now looking at this we can see that actually speaking, comedy is the most common genre unlike hentai which was supposedly the most common before we did the preprocessing and plotting. When we have a categorical variable and a numerical variable, we can resort to using barcharts. Depending on whether the categorical column is on … Read more

Data manipulation in r using data frames – an extensive article of basics part2 – aggregation and sorting

[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Welcome to the second part of this two-part series on data manipulation in … Read more

Categories R Tags ExcerptFavorite

Farewell RNNs, Welcome TCNs

4.2 Model Architecture Overview The fundamental TCN model architecture mentioned here is derived from Section 3 above —a generic TCN architecture consisting of causal convolutions, residual connections and dilated convolutions. The overview of KDTCN architecture is shown below: Illustration of the KDTCN framework Original model inputs are price values X, news corpus N , and … Read more

Explore Harry Potter via a dynamic social network of characters

Description: In Harry Potter, we obtain four different communities. One involves Harry’s family, including his true parents and the Dudleys, as well as their entourage. A second one is formed by Dumbledore’s friends, which are a bit outside of the scope of the book’s main plot and are mentioned during one chapter. The members of … Read more

Advancing a culture of reliability at the pace of Azure

“Customers value cloud services because they are agile and adaptable, scaling and transforming to meet the changing needs of business. Since the velocity of change can work against the tenets of reliability, our Azure engineering teams have evolved their culture, processes, and frameworks to balance the pace of innovation with assurance of performance and quality. … Read more

An Interesting Aspect of the Omitted Variable Bias

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model: Assume we want to estimate the causal effect beta of x on y. However, we have an unobserved confounder z that affects both x and y. If we don’t add the confounder … Read more

Categories R Tags ExcerptFavorite

Course Launch: High-Performance Time Series Forecasting in 7 Days!

[This article was first published on, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. High-Performance Time Series Forecasting Course is an amazing course designed to teach Data … Read more

Categories R Tags ExcerptFavorite

130,780-point Quantum Classification

Quantum Classification Using a Real-World Dataset Dr. Vishal Sharma, a Postdoctoral Research Fellow, recently asked me via LinkedIn messaging about quantum classification. I won’t reveal anything about his 5-column, 26,156-row dataset other than its size — 130,780 total data points — but I will reveal the modified algorithm that I shared with him. Introduction Quantum … Read more

Why is AI So Smart and Yet So Dumb?

The reason behind Moravec’s Paradox. Photo by Stephen Andrews on Unsplash At the most basic level, the reason for Moravec’s Paradox is simple: We don’t know how to program general intelligence (yet). We’re already good at getting AI to do specific things, but most toddler level skills require learning new things and transferring them into … Read more

How to Become a Better Storyteller: 3 Key Points

Great movies, speeches, and business presentations all share one thing: an emotional arc. How can it spice up your story? You’re launching a digital transformation initiative in the middle of the ongoing pandemic. You are pretty excited about this big-ticket investment, which has the potential to solve remote-work challenges that your organization faces. However, the … Read more

Supervised learning — the What, When, Why, Good and Bad (Part 1)

Before I explore the above regression techniques further. Let’s gain an understanding of the assumptions associated with regression techniques. Regression Assumptions Residuals of the regression line are normally distributed — To ensure that the result from the regression model are valid, the residuals (difference between observed and predicted values) should follow a normal distribution (with … Read more

Detecting Pneumonia from Chest X-Rays with Deep Learning

Now that we finished examining and analyzing our data, we can go ahead and build some machine learning models. We will begin simple, using the K-nearest neighbours, and logistic regression classifiers. K-Nearest Neighbours The key concept of K-nearest neighbours is that, when we see an unknown example, we will look at what the unknown is … Read more

Text Summarization for Clustering documents

Spacy isn’t great at identifying the “Named Entity Recognition” of healthcare documents. See below: doc = nlp(notes_data[“TEXT”][178])text_label_df = pd.DataFrame({“label”:[ent.label_ for ent in doc.ents], “text”: [ent.text for ent in doc.ents]})display(HTML(text_label_df.head(10).to_html())) Image by Author: Poor job at POS tagging in healthcare jargon But, that does not mean it can not be used to summarize our text. It … Read more

Sampling Distribution — sample mean

with Python simulation and examples One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better comprehend and interpret results from our descriptive as well as predictive data analysis investigations. Sampling distributions are also frequently used in decision making under … Read more

Task 1 – Retail Strategy and Analytics

To inspect if certain columns are in their specified format for eg. date column is in date format etc. changes!We saw that the date format is in numeric format which is wrong so we convert it to the date format as shown below Examine PROD_NAME Generating summary of the PROD_NAME column #head(transaction_data$PROD_NAME) transaction_data[, .N, PROD_NAME] … Read more

Categories R Tags ExcerptFavorite

Crowd Counting Consortium Crowd Data and Shiny Dashboard

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, … Read more

Categories R Tags ExcerptFavorite

Handling errors using purrr’s possibly() and safely()

One topic I haven’t discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a map() or lapply() loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can … Read more

Categories R Tags ExcerptFavorite

3 Ways to Handle Args in Python

If you want to have a proper command-line interface for your application, the argparse is the module to go. This is a full-fledge module that offers out-of the-box arguments parsing, help messages and automated error throwing when arguments get misused. This module comes pre-installed with Python. To fully utilize the functionalities provided by argparse, it … Read more

Predicting Adoption Speed for PetFinder

Our team is currently acting as a consulting agency who works on behalf of PetFinder, a non-profit organization, contains a database of animals and aims to improve the animal welfare through collaborations with related parties. The core task of this project is to predict how long it will take for a pet to be adopted. … Read more

5 Data Science Interview Mistakes I’ve Made

Introduction Discussing the same past project Not asking enough questions Assuming interviewers know my past experiences Not considering the business impact Not overviewing the whole Data Science process Summary References I have interviewed with several companies, having some repeated and key mistakes along the way. As a result, I have learned from them and have … Read more

Transitioning a Soil Mechanics Course to R/exams

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Experiences with transitioning to R/exams for multiple automated and randomized online exams in … Read more

Categories R Tags ExcerptFavorite

8 Determining Factors for the Selection of the Model Approach

What you need to know Corporations have defined project processes. Stage-gate or steering committee meetings are part of that where outcomes must be presented. Presentations have to be submitted a few days in advance and must contain certain expected information. Also, corporates are always under pressure to deliver financial results. That leads to consistently tight … Read more

Why you are throwing money away if your cloud data warehouse doesn’t separate storage and compute

What you should consider before migrating to the cloud to make your data warehouse and data lake future-proof & how the separation of storage and compute was approached by Snowflake, Amazon, Google, SAP, and IBM Photo by John Schnobrich on Unsplash Not so long ago, establishing an enterprise data warehouse involved a project that would … Read more

Exploratory Data Analysis with 1 line of Python code

Overview of Pandas-Profiling library Image by Peggy und Marco Lachmann-Anke from Pixabay Exploratory data analysis (EDA) is an approach to analyze the data and summarize its main characteristics. One spends a lot of time doing EDA to get an understanding of data. EDA involves a lot of steps including some statistical tests, visualization of data … Read more

An introduction to weather forecasting with deep learning

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Same with weekly climatology: Looking back at how warm it was, … Read more

Categories R Tags ExcerptFavorite

From R Hub – JavaScript for the R package developer

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Originally posted on the R Hub blog JS and R, what a … Read more

Categories R Tags ExcerptFavorite

Drawing a Map using Python and Word2vec

Word2vec is definitely the most playful concept I’ve met during my Natural Language Processing studies so far. Imagine an algorithm that can really successfully mimic understanding meanings of words and their functions in the language, that can measure the closeness of words along the lines of hundreds of different topics, that can answer more complicated … Read more

Green is the new Black: Saving Amazon Rainforests using AI!

In Amazonia, fire is associated with several land-practices. Slash-and-Burn is one of the most used practices in Brazilian agriculture (as part of a seasonal cycle called “queimada”). Whether for opening and cleaning agricultural areas or renewing pastures, its importance in the agricultural chain is undeniable. Unfortunately, this is often the cause of wildfires in forests. … Read more

Collinearity Measures

Another approach to identify multicollinearity is via the Variance Inflation Factor. VIF indicates the percentage of the variance inflated for each variable’s coefficient. Beginning at a value of 1 (no collinearity), a VIF between 1–5 indicates moderate collinearity while values above 5 indicate high collinearity. Some cases where high VIF would be acceptable include the … Read more