Animations in the time of Coronavirus

The first four months of 2020 have been dominated by the Coronavirus pandemic (aka COVID-19), which has transformed global life in an unprecedented way. Societies and economies struggle to adapt to the new conditions and necessary contraints. A reassuringly large fraction of governments around the world continue to take evidence-based approaches to this crisis that … Read more

Talking about Data Science Topics with Business-Minded Executives

“Our company is using cutting edge machine learning technology.” Machine Learning vs. Data Analysis or Statistics From Dan Shewan on One big difference between data analysis and machine learning is the questions they seek to answer. In data analysis, you want to know something like what happened to sales at this point in time … Read more

Rendering your README with GitHub Actions

There’s one thing that has bugged me for a while about developing R packages. We have all these nice, modern tools we have for tracking our code, producing web sites from the roxygen documentation, an so on. Yet for every code commit I make to the master branch of a package repo, there’s often two … Read more

Amazon EKS Improves Cluster Creation and Management in the AWS Console

The new console design includes a wizard that simplifies cluster creation and provides additional information about cluster components. Additionally, the cluster management page has been updated with a tabbed layout which makes it easier to understand and modify cluster components.  See the new console design here.  Favorite

Targeting Users In Specific Area Using Geofence API

MyReminderRepository.kt You can think of this class as a local server which means it serves us everything we needed to perform any action on the map such as adding a reminder or removing reminder and maintaining all reminders. The following code snippet creates an object of GeofencingClient which helps you to manipulate the geofences. private … Read more

Data preprocessing for Machine Learning in Python

Data preprocessing is a crucial step in machine learning and it is very important for the accuracy of the model. Data contains noise, missing values, it is incomplete and sometimes it is in an unusable format which cannot be directly used for machine learning models. But what if we use questionable and dirty data? What … Read more

Automated Programmatic Website Screenshots in R with {webshot} [Video Tutorial]

In this video tutorial, We explore the R package {webshot} by Winston Chang. This package internally uses phantom js to capture screenshot of web pages / websites, Shiny Applications, RMarkdown documents. {webshot} also lets you take screenshot of a particular viewport or a section of website selected by css selector. Youtube: [embedded content] Please … Read more

AI vs COVID-19. Does it really work?

Figuring out what we can do with the data available and what we can’t Photo by Alissa Eckert, MS, and Dan Higgins, MAMS, on PHIL Contents: Introduction Making our Chest X-ray COVID-19 classifier2.1. Data preparation2.2. Training2.3. Results Does it really work?3.1. Further analysis3.2. Discussion and takeaways Today everyone knows about the pandemic. Professionals do their … Read more

Discover, understand and manage your data with Data Catalog, now GADiscover, understand and manage your data with Data Catalog, now GAProduct Manager, Data Catalog

Technical metadata vs. business metadataTechnical metadata refers to metadata that is available in the source system. Technical metadata for a BigQuery table includes table name, table description, column names, column types, column descriptions, creation date, last modification date, and more. For Pub/Sub, technical metadata refers to Pub/Sub topic names and date created. For Cloud Storage … Read more

Which Technology Should I Learn?

Knowing where to start can be challenging, but we’re here to help. Read on to learn more about where to begin on your data science and analytics journey. Data science and analytics languages If you’re new to data science and analytics, or your organization is, you’ll need to pick a language to analyze your data … Read more

Churn Prediction: A Case study of Sparkify using Apache Spark

Null Elements: Since some users will not have values for certain features, they were captured as NULL. In actual sense, these null elements are zero values, considering aggregates were used. Hence, these null elements were replaced with zeros for all columns. Scaling: The range of values for all the features in the dataset, show quite … Read more

Z is for Additional Axes

Here we are at the last post in Blogging A to Z!

How to learn machine learning and improve your health at the same time

Notice how what we’ve gone through can be applied to learning almost anything. The most important takeaways being instead of focusing on what’s right at any given moment (impossible to predict), you’re concentrating on the trend. You’re building the habit of learning (using courses as a foundation) along with the habit of creating (building your … Read more

Using analytics to drive informed intuition

I truly believe that a Data and Analytics function has the mandate to enable better decision making within an organization. Very few practitioners would disagree with this argument, however for us to truly drive our vision, it is important to understand how people make decisions. The human decision-making process is ambiguous, to the extent of … Read more

Azure Container Registry: Mitigating data exfiltration with dedicated data endpoints

Azure Container Registry announces dedicated data endpoints, enabling tightly scoped client firewall rules to specific registries, minimizing data exfiltration concerns. Pulling content from a registry involves two endpoints: Registry endpoint, often referred to as the login URL, used for authentication and content discovery.A command like docker pull makes a REST request which authenticates and … Read more

Cross Region Restore (CRR) for Azure Virtual Machines using Azure Backup

Today we’re introducing the preview of Cross Region Restore (CRR) for Microsoft Azure Virtual Machines (VMs) support using Microsoft Azure Backup. Azure Backup uses Recovery Services vault to hold customers’ backup data which offers both local and geographic redundancy. To ensure high availability of backed up data, Azure Backup defaults storage settings to geo-redundancy. By … Read more

How to Consume News More Intelligently Using Bayes’ theorem

Base rates, marginal probabilities, sensitivity, and specificity Photo by Markus Spiske on Unsplash When it comes to updating beliefs and making decisions under uncertainty, Bayes’ theorem is just about the best tool available. And yet it is so often relegated to academic textbooks and machine learning applications when it should be bringing us value in … Read more

Looking Beyond Feature Importance

Read in and split the Data For this analysis, I’ll be doing a random forest regression using the Boston Housing Dataset in the scikit-learn package. There are 13 features in the Boston Housing Dataset, you can read about them here. After we do some preliminary feature selection I’ll break down what the more important features … Read more


Or which forecast accuracy metrics to use? Source: Many CPG brands across the world would be focusing on keeping a tab on their sales and demand numbers during the Covid-19 pandemic. In my previous article, I had covered points on doing Marketing Mix modeling during these testing times. The brands might have already forecasted … Read more

Tutorial: Poisson regression with CatBoost

How to use Poisson regression and CatBoost to achieve better accuracy on count-based data… and predict the number of likes that a tweet gets. The concept of count-based data What is Poisson regression and why it is suitable for count-based data How to build a Poisson regression model with CatBoost package How to predict the … Read more

Deploying Deep Learning Models using TensorFlow Serving with Docker and Flask

Generally, the life-cycle of any data science project is comprised of defining the problem statement, collecting and pre-processing data, followed by data analysis and predictive modelling, but the trickiest part of any data science project is the model deployment where we want our model to be consumed by the end users. There are a lot … Read more

Deploying Panel (Holoviz) dashboards using Heroku Container Registry

Deployment Now comes the crux of this post. This was the part of my journey where I couldn’t find much direct help on the internet. The Panel library (part of Holoviz) provides an excellent toolkit for managing data interactions, setting up pipelines, using widgets and deploying dynamic dashboards. Deployment is as easy as marking your … Read more

Fighting COVID-19 with Open Access and AI

The CORD-19 resource attempts to accelerate scientific discovery and save lives The network of diseases and chemicals associated with Chloroquine, an example of the kinds of insights that can be extracted from CORD-19 — this visualization was produced with the CoViz tool from AI2. The urgent phone call from the Michael Kratsios (whose august title … Read more

A Practical Guide for Exploratory Data Analysis

Listen to the data, curiously and carefully! Photo by Emma Frances Logan on Unsplash The fuel of each and every machine learning or deep learning model is data. Without data, the models are useless. Before building a model and train it, we should try to explore and understand the data at hand. By understanding, I … Read more

Highlights of Hugo Code Highlighting

Thanks to a quite overdue update of Hugo on our build system, our website can now harness the full power of Hugo code highlighting for Markdown-based content.What’s code highlighting apart from the reason behind a tongue-twister in this post title?In this post we shall explain how Hugo’s code highlighter, Chroma, helps you prettify your code … Read more

Expert opinion (again)

THis is the second video I was mentioning here —

Vignette: Simulating a minimal SPSS dataset from R

What this is about 📖 I will simulate a minimal labelled survey dataset that can be exported as a SPSS (.SAV) file (with full variable and value labels) in R. I will also attempt to fabricate ‘meaningful patterns’ to the dataset such that it can be more effectively used for creating demo examples. image from … Read more

Address class imbalance easily with Pytorch

Data augmentation in computer vision. Credits for the picture to fastai. What can you do when your model is overfitting your data? This problem often occurs when we are dealing with an imbalanced dataset. If your dataset represents several classes, one of which is much less represented than the others, then it is difficult to … Read more

90 second setup challenge: Jupyter + TensorFlow in Google Cloud

Is it possible for data science beginners to get up and running in under 2 minutes? Data science enthusiasts, how fast can you go from zero to Google Cloud Jupyter notebook? Let’s find out! Image: SOURCE. If you’re in the mood to ultra-customize your setup, Google Cloud gives you dizzying granularity. That’s a fabulous thing … Read more

Nina and John Speaking at Why R? Webinar Thursday, May 7, 2020

Nina Zumel and John Mount will be speaking on advanced

Voice Classification with Python

The idea for this project came from trying to transcribe meetings with an app that could differentiate voices and register who said what. For this usage, 115 people might be excessive. Meetings do not usually have that many people speaking. I felt that 10 was a more realistic number. However, I wanted to be broader … Read more

Which Face is Real?

A Generative model aims to learn and understand a dataset’s true distribution and create new data from it using unsupervised learning. These models (such as StyleGAN) have had mixed success as it is quite difficult to understand the complexities of certain probability distributions. In order to sidestep these roadblocks, The Adversarial Nets Framework was created … Read more

How to prepare for AI

At the level of company-owned ANIs, an individual could best prepare themselves by being aware of how their data is collected (eg: cookies when you visit websites), used (eg: targeted advertising), stored (eg: Snapchat stores data on its servers), and biased at various levels (eg: biased training data, resulting in an inability for the service … Read more

What Has Changed?

SOLUTIONS FOR MICROSOFT POWER PLATFORM A step-by-step guide on the continuous delivery of AI Models, Power Apps, and Flows with Microsoft’s Power Platform and Azure DevOps. Photo by jesse ramirez on Unsplash In one of my recent stories, I’ve explained how to create a no-code AI prediction model using the Microsoft Power Platform to forecast … Read more

Causal Inference cheat sheet for data scientists

Being able to make causal claims is a key business

You Need ModelOps To Scale

As companies, particularly large organizations, scale up their models as a part of building an enterprise-wide pipeline, there’s an increasing need to operationalize the model development process. Similar to DevOps, models need to be developed, integrated, deployed and monitored. Often, with Enterprise AI initiatives, there are a host of governance considerations such as data integrity, … Read more

Optimize Dataproc costs using VM machine typeOptimize Dataproc costs using VM machine typeProduct Manager Google Cloud

Dataproc is a fast, easy-to-use, fully managed cloud service for running managed open source, such as Apache Spark, Apache Presto, and Apache Hadoop clusters, in a simpler, more cost-efficient way. We hear that enterprises are migrating their big data workloads to the cloud to gain cost advantages with per-second pricing, idle cluster deletion, autoscaling, and … Read more

Interactive COVID-19 visualizations using Plotly with 4 lines of code

Doing cool things with data! In this age of technology, data is the new oil. Organizations all over the world are transforming their environments, processes and infrastructures to become more data-driven. A major reason is that data analytics and machine learning gives organizations visibility into how to run their business better. The push to remote … Read more

Y is for scale_y

Yesterday, I talked about scale_x. Today, I'll continue on that topic, focusing

