My First Month as a Data Scientist

Photo by Antoine Dautry on Unsplash With the hype surrounding machine learning these days, it’s quite easy to into the trap of overcomplicating a problem that could be solved with a simple linear or logistic regression. In some cases, the required infrastructure for a complex machine learning pipeline might not even be available. Most data … Read more My First Month as a Data Scientist

Simple KNN Classifier with Four Lines of Code for Beginners: Machine Learning

Photo by Annie Spratt on Unsplash A clear explanation of Some Core Machine Learning Concepts with a Project The KNN classifier is a very popular and well known supervised machine learning technique. This article will explain KNN classifier with an example What is a supervised learning model? I will explain it in detail. But here … Read more Simple KNN Classifier with Four Lines of Code for Beginners: Machine Learning

Combining logistic regression and decision tree

Making logistic regression less linear Logistic regression is one of the most used machine learning techniques. Its main advantages are clarity of results and its ability to explain the relationship between dependent and independent features in a simple manner. It requires comparably less processing power, and is, in general, faster than Random Forest or Gradient … Read more Combining logistic regression and decision tree

Imbalanced Data, What Can You Do?

When finding that one trustworthy politician in your dataset is like looking for a needle in the haystack, here is what you can do. Illustration of standard techniques, Image by Author When we speak of imbalanced data, what we mean is that at least one class is underrepresented. For example, when considering the problem of … Read more Imbalanced Data, What Can You Do?

Course sequence: Data analytics for the liberal arts

[This article was first published on George J. Mount, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I’m a proud liberal arts graduate myself who, with some fumbling, ended up in the world of data analytics. It may sound odd, but I never fancied myself much of a “math person,” and I still love to explore the world through … Read more Course sequence: Data analytics for the liberal arts

5 Python Tricks To Make Your Life More Productive

How many times were you forced to switch from the terminal to an internet browser just to view the recent emails? It happens a few times for everyone one of us. You could accelerate your workflow and boost your productivity by getting rid of this context switch. Simply use the python imaplib module that lets … Read more 5 Python Tricks To Make Your Life More Productive

Time Series Pattern Recognition with Air Quality Sensor Data

A real-world client-facing project with real sensor data Photo by Marcin Jozwiak on Unsplash Note: The detailed project report and the datasets used in this post can be found in my GitHub Page. This project was assigned to me by a client. There is no non-disclosure agreement required and the project does not contain any … Read more Time Series Pattern Recognition with Air Quality Sensor Data

Classification Framework for Imbalanced Data.

Understanding and utilizing imbalanced data. Classification is a type of supervised learning in Machine Learning that deals with categorizing data into classes. Supervised learning implies models that take input with their matched output to train a model which can later make useful predictions on a new set of data with no output. Some examples of … Read more Classification Framework for Imbalanced Data.

Survey categorical variables with KableExtra

In my in-progress thesis I decided I’ll analyze my survey results in something other than SPSS we learned in undergrad, which eventually led me to begin using R. The time came and I started analyzing my pilot survey data from Qualtrics. In this post I’ll address how I used {KableExtra} to nicely print a frequency … Read more Survey categorical variables with KableExtra

100 Time Series Data Mining Questions (with answers!) – Part 1

I decided to start this series of Time Series Data Mining base on Eamonn’s presentation, so that’s why the title is “100”. That’s the idea, but for now, we only have 19 questions ready to go. I’ll use the datasets available at https://github.com/matrix-profile-foundation/mpf-datasets so you can try this at home. The original code (MATLAB) and … Read more 100 Time Series Data Mining Questions (with answers!) – Part 1

Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. The federal government recently granted emergency approval for the use of … Read more Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis

Data Professionals: The Arbiters of Truth

Source: Unsplash Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Parker (aka Spiderman), “With great power, comes great responsibility”. The proverb echoed … Read more Data Professionals: The Arbiters of Truth

Custom Coloring Dendrogram Ends in R

As a graduate student studying microbial community data, most of the projects I work on involve some sort of clustering analysis. For one of them, I wanted to color the ends of a dendrogram by a variable from my metadata, to visualize whether that variable followed the clustering as part of another figure. There exist … Read more Custom Coloring Dendrogram Ends in R

TabNet on AI Platform: High-performance, Explainable Tabular LearningTabNet on AI Platform: High-performance, Explainable Tabular LearningResearch Scientist, Google Cloud AISoftware Engineer, Google Cloud AI

Today, we’re making TabNet available as a built-in algorithm on Google Cloud AI Platform, creating an integrated tool chain that makes it easier to run training jobs on your data without writing any code.  TabNet combines the best of two worlds: it is explainable (similar to simpler tree-based models) while benefiting from high performance (similar … Read more TabNet on AI Platform: High-performance, Explainable Tabular LearningTabNet on AI Platform: High-performance, Explainable Tabular LearningResearch Scientist, Google Cloud AISoftware Engineer, Google Cloud AI

Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Introduction Risk ratios (RR) and odds ratios (OR) are both used to analyze … Read more Simulations Comparing Interaction for Adjusted Risk Ratios versus Adjusted Odds Ratios

Exploratory Data Analysis on Anime Data

Now looking at this we can see that actually speaking, comedy is the most common genre unlike hentai which was supposedly the most common before we did the preprocessing and plotting. When we have a categorical variable and a numerical variable, we can resort to using barcharts. Depending on whether the categorical column is on … Read more Exploratory Data Analysis on Anime Data

Data manipulation in r using data frames – an extensive article of basics part2 – aggregation and sorting

[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Welcome to the second part of this two-part series on data manipulation in … Read more Data manipulation in r using data frames – an extensive article of basics part2 – aggregation and sorting

Farewell RNNs, Welcome TCNs

4.2 Model Architecture Overview The fundamental TCN model architecture mentioned here is derived from Section 3 above —a generic TCN architecture consisting of causal convolutions, residual connections and dilated convolutions. The overview of KDTCN architecture is shown below: Illustration of the KDTCN framework Original model inputs are price values X, news corpus N , and … Read more Farewell RNNs, Welcome TCNs

Explore Harry Potter via a dynamic social network of characters

Description: In Harry Potter, we obtain four different communities. One involves Harry’s family, including his true parents and the Dudleys, as well as their entourage. A second one is formed by Dumbledore’s friends, which are a bit outside of the scope of the book’s main plot and are mentioned during one chapter. The members of … Read more Explore Harry Potter via a dynamic social network of characters

Advancing a culture of reliability at the pace of Azure

“Customers value cloud services because they are agile and adaptable, scaling and transforming to meet the changing needs of business. Since the velocity of change can work against the tenets of reliability, our Azure engineering teams have evolved their culture, processes, and frameworks to balance the pace of innovation with assurance of performance and quality. … Read more Advancing a culture of reliability at the pace of Azure

Azure Cost Management + Billing updates – August 2020

Whether you’re a new student, thriving startup, or the largest enterprise, you have financial constraints, and you need to know what you’re spending, where, and how to plan for the future. Nobody wants a surprise when it comes to the bill, and this is where Azure Cost Management + Billing comes in. We’re always looking … Read more Azure Cost Management + Billing updates – August 2020

An Interesting Aspect of the Omitted Variable Bias

Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model: Assume we want to estimate the causal effect beta of x on y. However, we have an unobserved confounder z that affects both x and y. If we don’t add the confounder … Read more An Interesting Aspect of the Omitted Variable Bias

Course Launch: High-Performance Time Series Forecasting in 7 Days!

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. High-Performance Time Series Forecasting Course is an amazing course designed to teach Data … Read more Course Launch: High-Performance Time Series Forecasting in 7 Days!

A layman’s derivative on Deep Q Learning | Reinforcement Learning

Photo by Morning Brew on Unsplash This article is an attempt to establish the intuition behind Deep Q Networks. Before we get to that, let’s spend some good amount of time on understanding what is Q Learning. Too long, do read! In simple terms, Q Learning is a model-free Reinforcement Learning approach to enable the … Read more A layman’s derivative on Deep Q Learning | Reinforcement Learning

130,780-point Quantum Classification

Quantum Classification Using a Real-World Dataset Dr. Vishal Sharma, a Postdoctoral Research Fellow, recently asked me via LinkedIn messaging about quantum classification. I won’t reveal anything about his 5-column, 26,156-row dataset other than its size — 130,780 total data points — but I will reveal the modified algorithm that I shared with him. Introduction Quantum … Read more 130,780-point Quantum Classification

Why is AI So Smart and Yet So Dumb?

The reason behind Moravec’s Paradox. Photo by Stephen Andrews on Unsplash At the most basic level, the reason for Moravec’s Paradox is simple: We don’t know how to program general intelligence (yet). We’re already good at getting AI to do specific things, but most toddler level skills require learning new things and transferring them into … Read more Why is AI So Smart and Yet So Dumb?

How to Become a Better Storyteller: 3 Key Points

Great movies, speeches, and business presentations all share one thing: an emotional arc. How can it spice up your story? You’re launching a digital transformation initiative in the middle of the ongoing pandemic. You are pretty excited about this big-ticket investment, which has the potential to solve remote-work challenges that your organization faces. However, the … Read more How to Become a Better Storyteller: 3 Key Points

Supervised learning — the What, When, Why, Good and Bad (Part 1)

Before I explore the above regression techniques further. Let’s gain an understanding of the assumptions associated with regression techniques. Regression Assumptions Residuals of the regression line are normally distributed — To ensure that the result from the regression model are valid, the residuals (difference between observed and predicted values) should follow a normal distribution (with … Read more Supervised learning — the What, When, Why, Good and Bad (Part 1)

Detecting Pneumonia from Chest X-Rays with Deep Learning

Now that we finished examining and analyzing our data, we can go ahead and build some machine learning models. We will begin simple, using the K-nearest neighbours, and logistic regression classifiers. K-Nearest Neighbours The key concept of K-nearest neighbours is that, when we see an unknown example, we will look at what the unknown is … Read more Detecting Pneumonia from Chest X-Rays with Deep Learning

Text Summarization for Clustering documents

Spacy isn’t great at identifying the “Named Entity Recognition” of healthcare documents. See below: doc = nlp(notes_data[“TEXT”][178])text_label_df = pd.DataFrame({“label”:[ent.label_ for ent in doc.ents], “text”: [ent.text for ent in doc.ents]})display(HTML(text_label_df.head(10).to_html())) Image by Author: Poor job at POS tagging in healthcare jargon But, that does not mean it can not be used to summarize our text. It … Read more Text Summarization for Clustering documents

Sampling Distribution — sample mean

with Python simulation and examples One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better comprehend and interpret results from our descriptive as well as predictive data analysis investigations. Sampling distributions are also frequently used in decision making under … Read more Sampling Distribution — sample mean

Gather Your Data: The “Not-So-Spooky” APIs!

When python plays with the internet files. A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are not so common but has the most … Read more Gather Your Data: The “Not-So-Spooky” APIs!

Crowd Counting Consortium Crowd Data and Shiny Dashboard

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Jay Ulfelder, PhD, serves as Program Manager for the Nonviolent Action Lab, … Read more Crowd Counting Consortium Crowd Data and Shiny Dashboard

Task 1 – Retail Strategy and Analytics

To inspect if certain columns are in their specified format for eg. date column is in date format etc. changes!We saw that the date format is in numeric format which is wrong so we convert it to the date format as shown below Examine PROD_NAME Generating summary of the PROD_NAME column #head(transaction_data$PROD_NAME) transaction_data[, .N, PROD_NAME] … Read more Task 1 – Retail Strategy and Analytics

Handling errors using purrr’s possibly() and safely()

One topic I haven’t discussed in my previous posts about automating tasks with loops or doing simulations is how to deal with errors. If we have unanticipated errors a map() or lapply() loop will come to a screeching halt with no output to show for the time spent. When your task is time-consuming, this can … Read more Handling errors using purrr’s possibly() and safely()

3 Ways to Handle Args in Python

If you want to have a proper command-line interface for your application, the argparse is the module to go. This is a full-fledge module that offers out-of the-box arguments parsing, help messages and automated error throwing when arguments get misused. This module comes pre-installed with Python. To fully utilize the functionalities provided by argparse, it … Read more 3 Ways to Handle Args in Python

Predicting Adoption Speed for PetFinder

Our team is currently acting as a consulting agency who works on behalf of PetFinder, a non-profit organization, contains a database of animals and aims to improve the animal welfare through collaborations with related parties. The core task of this project is to predict how long it will take for a pet to be adopted. … Read more Predicting Adoption Speed for PetFinder

5 Data Science Interview Mistakes I’ve Made

Introduction Discussing the same past project Not asking enough questions Assuming interviewers know my past experiences Not considering the business impact Not overviewing the whole Data Science process Summary References I have interviewed with several companies, having some repeated and key mistakes along the way. As a result, I have learned from them and have … Read more 5 Data Science Interview Mistakes I’ve Made

Transitioning a Soil Mechanics Course to R/exams

[This article was first published on R/exams, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Experiences with transitioning to R/exams for multiple automated and randomized online exams in … Read more Transitioning a Soil Mechanics Course to R/exams

How to Create the most Awesome Development Setup for Data Science using Atom

Github + Markdown + Stack Overflow + Autocomplete + Jupyter Before I even begin this article, let me just say that I love iPython Notebooks, and Atom is not an alternative to Jupyter in any way. Notebooks provide me an interface where I have to think of “Coding one code block at a time,” as … Read more How to Create the most Awesome Development Setup for Data Science using Atom

8 Determining Factors for the Selection of the Model Approach

What you need to know Corporations have defined project processes. Stage-gate or steering committee meetings are part of that where outcomes must be presented. Presentations have to be submitted a few days in advance and must contain certain expected information. Also, corporates are always under pressure to deliver financial results. That leads to consistently tight … Read more 8 Determining Factors for the Selection of the Model Approach

Why you are throwing money away if your cloud data warehouse doesn’t separate storage and compute

What you should consider before migrating to the cloud to make your data warehouse and data lake future-proof & how the separation of storage and compute was approached by Snowflake, Amazon, Google, SAP, and IBM Photo by John Schnobrich on Unsplash Not so long ago, establishing an enterprise data warehouse involved a project that would … Read more Why you are throwing money away if your cloud data warehouse doesn’t separate storage and compute

Exploratory Data Analysis with 1 line of Python code

Overview of Pandas-Profiling library Image by Peggy und Marco Lachmann-Anke from Pixabay Exploratory data analysis (EDA) is an approach to analyze the data and summarize its main characteristics. One spends a lot of time doing EDA to get an understanding of data. EDA involves a lot of steps including some statistical tests, visualization of data … Read more Exploratory Data Analysis with 1 line of Python code

An introduction to weather forecasting with deep learning

[This article was first published on RStudio AI Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Same with weekly climatology: Looking back at how warm it was, … Read more An introduction to weather forecasting with deep learning

Drawing a Map using Python and Word2vec

Word2vec is definitely the most playful concept I’ve met during my Natural Language Processing studies so far. Imagine an algorithm that can really successfully mimic understanding meanings of words and their functions in the language, that can measure the closeness of words along the lines of hundreds of different topics, that can answer more complicated … Read more Drawing a Map using Python and Word2vec

Green is the new Black: Saving Amazon Rainforests using AI!

In Amazonia, fire is associated with several land-practices. Slash-and-Burn is one of the most used practices in Brazilian agriculture (as part of a seasonal cycle called “queimada”). Whether for opening and cleaning agricultural areas or renewing pastures, its importance in the agricultural chain is undeniable. Unfortunately, this is often the cause of wildfires in forests. … Read more Green is the new Black: Saving Amazon Rainforests using AI!

Collinearity Measures

Another approach to identify multicollinearity is via the Variance Inflation Factor. VIF indicates the percentage of the variance inflated for each variable’s coefficient. Beginning at a value of 1 (no collinearity), a VIF between 1–5 indicates moderate collinearity while values above 5 indicate high collinearity. Some cases where high VIF would be acceptable include the … Read more Collinearity Measures