Understanding Precision, Recall, and Accuracy with COVID tests

Working with Imbalanced datasets and understanding confusion matrix with a simulated example of COVID-19 tests During a recent look-back at revisiting my knowledge of basic statistics, I started delving deeper into the idea of a confusion matrix. This time with an example at hand which has, unfortunately, become very relevant over the last 4 months … Read more Understanding Precision, Recall, and Accuracy with COVID tests

Demystifying the Binomial Distribution

The objective of this article is to introduce some of the important statistical concepts behind binomial distribution and also illustrate one of its primary application in Epidemiology and/or Healthcare Data Science. Bernoulli trial Before we jump into binomial distributions, it is important to understand Bernoulli trials. A Bernoulli trial is a term that is used … Read more Demystifying the Binomial Distribution

Artificial Intelligence for Internal Audit and Risk Management

AUDITMAP.AI TEXT ANALYSIS PLATFORM Dragging Assessments Into the Modern Era Photo by Pexels via pixabay (CC0) Table of Contents1. Abstract: Why AI for Internal Audit and Risk Management?2. Introduction3. Contemporary Internal Audit Challenges4. AuditMap.ai: A Platform for Audit Enhancement5. Limitations and the Way Forward6. References 1. Abstract: Why AI for Internal Audit and Risk Management? … Read more Artificial Intelligence for Internal Audit and Risk Management

Building an Investing Model with Python

Building a model based on financial ratios with Python During this post, we are going to build an investment model to find out attractive stocks based on financial ratios using Python. We will screen all technology related stocks in the Nasdaq exchange. Then, we will get the main financial ratios for each stock. Finally, based … Read more Building an Investing Model with Python

Text Classification: Supervised & Unsupervised Learning Approaches

Model Deployment Heeding instructors’ encouragement to try model deployment, I jumped right into it. Model deployment can be an exciting venture and a hairy business at the same time, especially for a beginner like myself. For model deployment, I used Heroku. It is a Platform as a Service (PaaS) that enables developers to build, run, … Read more Text Classification: Supervised & Unsupervised Learning Approaches

Why You Should Wrap Decorators in Python

The Problems During a debugging process, we sometimes need to inspect particular objects to understand the implementation details better. Let’s consider the following inspection of the above-defined decorated function. String Representation of Decorated Function As you can see, it doesn’t really tell us what it is. Instead, it’s telling us that this function is the … Read more Why You Should Wrap Decorators in Python

Uncanny X-Men: Bayesian take on Dr. Silge’s analysis

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. The other day, Dr. Silge from RStudio posted thescreencast andblog post … Read more Uncanny X-Men: Bayesian take on Dr. Silge’s analysis

Take the Pain Out of Data Cleaning for Machine Learning

Photo by The Creative Exchange on Unsplash …with these four python libraries According to the Anaconda’s recent 2020 State of Data Science survey, which is a great read, by the way, 66% of most data scientists time is still spent on data loading, cleaning and visualisation. Real-world data sets are invariably untidy. Containing missing and … Read more Take the Pain Out of Data Cleaning for Machine Learning

Understanding the Importance of First Serve in Tennis with Data Analysis

Can we judge the performance of a tennis player based on his first serve? Photo by FILMDUDES on Unsplash Tennis is a dynamic and complex sport. There are several shots involved in a single point, but only one of them is played without the opponent’s influence: the serve. Indeed, the serve gives players the chance … Read more Understanding the Importance of First Serve in Tennis with Data Analysis

Fake news detector with deep learning approach (Part-I) EDA

Exploratory Data Analysis For Text Data In this series of articles, I would like to show how we can use a deep learning algorithm for fake news detection and compare some neural network architecture. This is the first part of this series, where I would like to make exploratory data analysis for text. unsplash.com The … Read more Fake news detector with deep learning approach (Part-I) EDA

How to implement Prioritized Experience Replay for a Deep Q-Network

Great, we are now sure that our approach is valid. Let’s dig into the details of the implementation. We will focus on the class `ReplayBuffer` as it contains most of the implementation related to the Prioritized Experience Replay, but the rest of the code is available on GitHub. The goal that we will set is … Read more How to implement Prioritized Experience Replay for a Deep Q-Network

Deep Learning in Healthcare — X-Ray Imaging (Part 4-The Class Imbalance problem)

import numpy as npimport pandas as pdimport cv2 as cvimport matplotlib.pyplot as pltimport osimport randomfrom sklearn.model_selection import train_test_split We have seen all the libraries before, except sklearn. sklearn — Scikit-learn (also known as sklearn) is a machine learning library for python. It contains all famous machine learning algorithms such as classification, regression, support vector machines, … Read more Deep Learning in Healthcare — X-Ray Imaging (Part 4-The Class Imbalance problem)

Model with TensorFlow and Serve on Google Cloud Platform

A Practical Guide Serving TensorFlow Models on a scalable cloud platform In this guide, we learn how to develop a TensorFlow model and serve it on the Google Cloud Platform (GCP). We consider a regression problem of predicting the earnings of products using a three-layer neural network implemented with TensorFlow and Keras APIs. The key … Read more Model with TensorFlow and Serve on Google Cloud Platform

Creating smart ETL data pipelines in python for financial and economic data

The first step is to download and install SQLite on your local machine. Next step from this is to create a database. Optionally, you can create a database from the command line using SQLite commands or via SQLiteStudio. This will serve as database management. SQLStudioLite Building Blocks — Project imports import pandas as pdimport sqlite, … Read more Creating smart ETL data pipelines in python for financial and economic data

My 10 favorite resources for learning data science online

Photo by Ivo Rainha on Unsplash These websites will help you keep up to date with the latest trends in data science I think you will not argue with me when I state that data science is becoming one of the most popular fields to work at, especially given that Harvard Business Review named “data … Read more My 10 favorite resources for learning data science online

What is a Full Stack Data Scientist?

The scope of the role and skills required Photo by freestocks on Unsplash A full-stack data scientist is a jack-of-all trades who engineers and works on each stage in the data science lifecycle, from beginning to end. The scope of a full stack data scientist covers every component of a data science business initiative, from … Read more What is a Full Stack Data Scientist?

Data Science and Machine Learning with Scala and Spark (Episode 02/03)

SCALA SPARK MACHINE LEARNING Spark with Scala API Spark’s inventors chose Scala to write the low-level modules. In Data Science and Machine Learning with Scala and Spark (Episode 01/03), we covered the basics of Scala programming language while using a Google Colab environment. In this article, we learn about the Spark ecosystem and its higher-level … Read more Data Science and Machine Learning with Scala and Spark (Episode 02/03)

Raspberry Pi: Tutorial on hosting a Jupyter Notebook that you can access anywhere

This series is mainly about setting up your Raspberry Pi with a Jupyter Notebook server that you can access anywhere with open internet. And we are at the final chapter. In the previous sessions, we have covered how to set up port forwarding or a cloud proxy server in order for you to connect to … Read more Raspberry Pi: Tutorial on hosting a Jupyter Notebook that you can access anywhere

Labeling Data with Pandas

Data labeling is the process of assigning informative tags to subsets of data. There are many examples of labeled data sets. Data containing x-ray images of cancerous and healthy lungs along with their respective tags is an example of labeled data. Another example is consumer credit data that specifies whether or not a consumer has … Read more Labeling Data with Pandas

The Multi-Channel Neural Network

Neural Networks are widely used across multiple domains, such as Computer Vision, Audio Classification, Natural Language Processing, etc. In most cases, they are considered in each of these domains individually. However, in real-life settings, it is rarely the case that this is the optimal configuration. It is much more common to have multiple channels, meaning … Read more The Multi-Channel Neural Network

Rcpp now used by 2000 CRAN packages–and one in eight!

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. As of yesterday, Rcpp stands at exactly 2000 reverse-dependencies … Read more Rcpp now used by 2000 CRAN packages–and one in eight!

How Floating Point Numbers Work

With Applications to Deep Learning and Digital Photography It is a pesky fact that computers work in binary approximations while humans tend to think in terms of exact values. This is why, in your high school physics class, you may have experienced “rounding error” when computing intermediate numerical values in your solutions and why, if … Read more How Floating Point Numbers Work

5 Python Code Smells You Should Be Wary Of

Firstly, lets put it straight: Loops aren’t bad. But when you’re applying transformations inside them, it can lead to long bloated conditional codes. In such cases, it’s important to not ignore built-in functions like map() filter() and reduce() that are already at our disposal. More importantly, Python provides List comprehensions which is easily the most … Read more 5 Python Code Smells You Should Be Wary Of

Machine Learning Basics: Polynomial Regression

As we can see, the Linear regression always tends to make an error however hard it tries to fit in the data. On the other hand, the Polynomial Regression graph manages to fit the data points onto the line more accurately. In this example, we will go through the implementation of Polynomial Regression, in which … Read more Machine Learning Basics: Polynomial Regression

Matching: Koalas On Fire — Part 2

What is the optimal matching ratio? 1:1 matching or 1:multiple matching? Check out Part 1 here. We tell you all about the tool called (multiple) matching and introduce our example dataset — koalas! Image by David Clode on Unsplash TL;DR: Matching is supposedly a gold-nugget tool for causal inference. We agree it’s wonderful, but most … Read more Matching: Koalas On Fire — Part 2

Quantum parallelism — where quantum computers get their mojo from

How quantum computers harness quantum superposition to execute many computational paths simultaneously. Quantum computers were proposed in the 1980s. Since then, physicists have been laboriously working to harness the power of nature to meet computing demands. There is no single best method of physically realising a quantum computer; the field is fragmented into several competing … Read more Quantum parallelism — where quantum computers get their mojo from

Your Ultimate Data Science Statistics & Mathematics Cheat Sheet

4 main data science statistical measures. Correlation is a statistical measure of how well two variables fluctuate together. Positive correlations mean that two variables fluctuate together (a positive change in one is a positive change to another), whereas negative correlations mean that two variable change opposite one another (a positive change in one is a … Read more Your Ultimate Data Science Statistics & Mathematics Cheat Sheet

Basic AI Algorithms

Search Algorithms for Traveling Salesman Problem To solve a problem with a computer, it is necessary to represent the problem in numerical or symbolic form and offer a specific procedure using a programming language. However, working with problem-solving in the artificial intelligence (AI) field, it is difficult to specify a formulation of a problem from … Read more Basic AI Algorithms

Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package

[This article was first published on Jozef’s Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. The National Hockey League (NHL) is considered to be the premier professional … Read more Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package

A Data Scientist Approach: Running Postgres SQL using Docker

A Docker application for Data Scientist Docker container for Postgres for Data scientists In this short tutorial, I explain the steps to set up PostgreSQL local instance running in Docker and using python to interact with the Database. Below steps have been taken to set up the process. 1. Setting up docker for PostgreSQL2. Connecting … Read more A Data Scientist Approach: Running Postgres SQL using Docker

Interesting AI/ML Articles You Should Read This Week (July 4)

One of the most interesting article I read this week. Luca Rossi has written a piece that will send most readers down the path of self and environmental awareness. After reading this article, I found my self questioning the impact of my actions and contributions that can lead to the imaginary worlds created within this … Read more Interesting AI/ML Articles You Should Read This Week (July 4)

A taste of ACL2020: 6 new Datasets & Benchmarks

Datasets and benchmarks are at the core of progress in Natural Language Understanding (NLU): in leaderboard-driven research, progress is upper-bounded by the quality of our evaluations. While datasets for Machine Learning used to last — i.e. MNIST didn’t reach human performance until more than a decade after it was introduced — the latest benchmarks for … Read more A taste of ACL2020: 6 new Datasets & Benchmarks

The Problem with Data Science Competition Platforms like Kaggle

Do Kaggle and other data science competitions violate Open Source? — A case for open idea transfer and collaboration freebiesupply.com As any data scientist is probably familiar, there exists a vast world of predictive modelling competitions on the internet; the most well-known competitions likely being those from Kaggle. Some of these competitions are incentivized financially, … Read more The Problem with Data Science Competition Platforms like Kaggle

Day 111 of #NLP365: NLP Papers Summary — The Risk of Racial Bias in Hate Speech Detection

Investigate how racial bias has been introduced by annotators into the datasets for hate speech detection, increasing the harm against minority races and proposed a method to prime dialect and race, to reduce racial bias in annotation. The contributions of the paper are as follows: Found unexpected correlation between surface markets of African American English … Read more Day 111 of #NLP365: NLP Papers Summary — The Risk of Racial Bias in Hate Speech Detection

Beginner’s Guide to SQL: Disney Princess Edition

I recently finished a SQL course on Coursera and was looking for project ideas when I came across Amanda’s post. I reached out to her and she suggested me reading Okoh Anita’s post on Basic SQL from which she was inspired. Undoubtedly, I too was inspired by how she described simple queries using a very … Read more Beginner’s Guide to SQL: Disney Princess Edition

NGBoost algorithm: solving probabilistic prediction problems

Predict a distribution of the target variable, not just point estimate Photo by mohammad alizade on Unsplash While looking through the ICML 2020 accepted papers, I found an interesting paper: NGBoost: Natural Gradient Boosting for Probabilistic Prediction We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient… arxiv.org You may ask, … Read more NGBoost algorithm: solving probabilistic prediction problems

This Model is for the Birds

In the interest of contributing to research in fine-grained vision classification, the Cornell Lab of Ornithology has released the NABirds data set consisting of 48,562 images of 404 bird species. Many of these species are further subdivided into categories such as Male/Female, Adult/Juvenile, Breeding/Non-breeding, which leads to 555 total classes. My goal is to determine … Read more This Model is for the Birds

Improving the Performance of Machine Learning Model using Bagging

Understand the working of Bootstrap Aggregation (Bagging) ensemble learning and implement a Random Forest Bagging model using an sklearn library. Photo by Carlos Muza on Unsplash The performance of a machine learning model tells us how the model performs for unseen data-points. There are various strategies and hacks to improve the performance of an ML … Read more Improving the Performance of Machine Learning Model using Bagging

The Sound of Places — Visualizing audio with R

Using R and soundgen to visualize the spectrogram and loudness of places I’ve visited At the risk of sounding like a broken record (that’s an audio pun right there), I have to start this piece by saying that for the last year I’ve been backpacking. During this adventure, I’ve seen wonderful places, tasted extravagant flavors, … Read more The Sound of Places — Visualizing audio with R

Streaming real-time data into Snowflake with Amazon Kinesis Firehose

Photo by Joao Branco on Unsplash Businesses today can benefit in real-time from the data they continuously generate at massive scale and speed from various data sources. Whether it is clickstream data from websites, telemetry data from IoT devices or log data from applications, continuously analysing that data can help businesses learn what their customers, … Read more Streaming real-time data into Snowflake with Amazon Kinesis Firehose

Medium writers you should follow as an aspiring Data Scientist

Introduction This is my list of 10 Data Science writers/influencers that I follow. And yes, I follow and read the work of other people as well but if I had to narrow it down top 10 only, that would be the list. I am sharing this because I believe that all aspiring Data Scientists would … Read more Medium writers you should follow as an aspiring Data Scientist

Eigenvalues and Eigenvectors

Computing and Visualizing Animation available here: https://bdshaff.github.io/bdshaff.github.io/blog/2020-03-23-computing-eigenvalues-and-eigenvectors/ First I’ll talk about what made me curious about how Eigenvalues are actually computed. Then I’ll share my implementation of the simplest algorithm (QR method) and do some benchmarking. Lastly, I’ll share how you can animate the steps the algorithm goes through to finds the Eigenvectors! Eigenvalues and … Read more Eigenvalues and Eigenvectors

Natural Language Processing Pipeline

If we were asked to build an NLP application, think about how we would approach doing so at an organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all … Read more Natural Language Processing Pipeline

Ensemble Methods: Comparing Scikit Learn’s Voting Classifier to The Stacking Classifier

Using the Titanic Dataset to compare scikit learn voting classifier and the stacking classifier. Photo by Perry Grone on Unsplash Two heads, they say, is better than one. Sometimes in many Machine Learning projects we want to make use of the power of synergy using ensemble methods. The voting and the stacking classifier brings us … Read more Ensemble Methods: Comparing Scikit Learn’s Voting Classifier to The Stacking Classifier

Job Fairs Are Now Going Virtual

WomenHack Networking Event Goes Virtual Photo from Shutterstock Attending networking events in person once required careful planning by printing out business cards and resumes, dressing to impress, and allowing enough time to get to and from the event. With many tech companies continuing to hire during COVID-19, these companies are finding ways to recruit employees … Read more Job Fairs Are Now Going Virtual