Deep Learning in Healthcare — X-Ray Imaging (Part 4-The Class Imbalance problem)

import numpy as npimport pandas as pdimport cv2 as cvimport matplotlib.pyplot as pltimport osimport randomfrom sklearn.model_selection import train_test_split We have seen all the libraries before, except sklearn. sklearn — Scikit-learn (also known as sklearn) is a machine learning library for python. It contains all famous machine learning algorithms such as classification, regression, support vector machines, … Read more Deep Learning in Healthcare — X-Ray Imaging (Part 4-The Class Imbalance problem)

Model with TensorFlow and Serve on Google Cloud Platform

A Practical Guide Serving TensorFlow Models on a scalable cloud platform In this guide, we learn how to develop a TensorFlow model and serve it on the Google Cloud Platform (GCP). We consider a regression problem of predicting the earnings of products using a three-layer neural network implemented with TensorFlow and Keras APIs. The key … Read more Model with TensorFlow and Serve on Google Cloud Platform

Creating smart ETL data pipelines in python for financial and economic data

The first step is to download and install SQLite on your local machine. Next step from this is to create a database. Optionally, you can create a database from the command line using SQLite commands or via SQLiteStudio. This will serve as database management. SQLStudioLite Building Blocks — Project imports import pandas as pdimport sqlite, … Read more Creating smart ETL data pipelines in python for financial and economic data

My 10 favorite resources for learning data science online

Photo by Ivo Rainha on Unsplash These websites will help you keep up to date with the latest trends in data science I think you will not argue with me when I state that data science is becoming one of the most popular fields to work at, especially given that Harvard Business Review named “data … Read more My 10 favorite resources for learning data science online

What is a Full Stack Data Scientist?

The scope of the role and skills required Photo by freestocks on Unsplash A full-stack data scientist is a jack-of-all trades who engineers and works on each stage in the data science lifecycle, from beginning to end. The scope of a full stack data scientist covers every component of a data science business initiative, from … Read more What is a Full Stack Data Scientist?

Data Science and Machine Learning with Scala and Spark (Episode 02/03)

SCALA SPARK MACHINE LEARNING Spark with Scala API Spark’s inventors chose Scala to write the low-level modules. In Data Science and Machine Learning with Scala and Spark (Episode 01/03), we covered the basics of Scala programming language while using a Google Colab environment. In this article, we learn about the Spark ecosystem and its higher-level … Read more Data Science and Machine Learning with Scala and Spark (Episode 02/03)

Raspberry Pi: Tutorial on hosting a Jupyter Notebook that you can access anywhere

This series is mainly about setting up your Raspberry Pi with a Jupyter Notebook server that you can access anywhere with open internet. And we are at the final chapter. In the previous sessions, we have covered how to set up port forwarding or a cloud proxy server in order for you to connect to … Read more Raspberry Pi: Tutorial on hosting a Jupyter Notebook that you can access anywhere

Labeling Data with Pandas

Data labeling is the process of assigning informative tags to subsets of data. There are many examples of labeled data sets. Data containing x-ray images of cancerous and healthy lungs along with their respective tags is an example of labeled data. Another example is consumer credit data that specifies whether or not a consumer has … Read more Labeling Data with Pandas

The Multi-Channel Neural Network

Neural Networks are widely used across multiple domains, such as Computer Vision, Audio Classification, Natural Language Processing, etc. In most cases, they are considered in each of these domains individually. However, in real-life settings, it is rarely the case that this is the optimal configuration. It is much more common to have multiple channels, meaning … Read more The Multi-Channel Neural Network

How Floating Point Numbers Work

With Applications to Deep Learning and Digital Photography It is a pesky fact that computers work in binary approximations while humans tend to think in terms of exact values. This is why, in your high school physics class, you may have experienced “rounding error” when computing intermediate numerical values in your solutions and why, if … Read more How Floating Point Numbers Work

5 Python Code Smells You Should Be Wary Of

Firstly, lets put it straight: Loops aren’t bad. But when you’re applying transformations inside them, it can lead to long bloated conditional codes. In such cases, it’s important to not ignore built-in functions like map() filter() and reduce() that are already at our disposal. More importantly, Python provides List comprehensions which is easily the most … Read more 5 Python Code Smells You Should Be Wary Of

Machine Learning Basics: Polynomial Regression

As we can see, the Linear regression always tends to make an error however hard it tries to fit in the data. On the other hand, the Polynomial Regression graph manages to fit the data points onto the line more accurately. In this example, we will go through the implementation of Polynomial Regression, in which … Read more Machine Learning Basics: Polynomial Regression

Matching: Koalas On Fire — Part 2

What is the optimal matching ratio? 1:1 matching or 1:multiple matching? Check out Part 1 here. We tell you all about the tool called (multiple) matching and introduce our example dataset — koalas! Image by David Clode on Unsplash TL;DR: Matching is supposedly a gold-nugget tool for causal inference. We agree it’s wonderful, but most … Read more Matching: Koalas On Fire — Part 2

Quantum parallelism — where quantum computers get their mojo from

How quantum computers harness quantum superposition to execute many computational paths simultaneously. Quantum computers were proposed in the 1980s. Since then, physicists have been laboriously working to harness the power of nature to meet computing demands. There is no single best method of physically realising a quantum computer; the field is fragmented into several competing … Read more Quantum parallelism — where quantum computers get their mojo from

Your Ultimate Data Science Statistics & Mathematics Cheat Sheet

4 main data science statistical measures. Correlation is a statistical measure of how well two variables fluctuate together. Positive correlations mean that two variables fluctuate together (a positive change in one is a positive change to another), whereas negative correlations mean that two variable change opposite one another (a positive change in one is a … Read more Your Ultimate Data Science Statistics & Mathematics Cheat Sheet

Basic AI Algorithms

Search Algorithms for Traveling Salesman Problem To solve a problem with a computer, it is necessary to represent the problem in numerical or symbolic form and offer a specific procedure using a programming language. However, working with problem-solving in the artificial intelligence (AI) field, it is difficult to specify a formulation of a problem from … Read more Basic AI Algorithms

A Data Scientist Approach: Running Postgres SQL using Docker

A Docker application for Data Scientist Docker container for Postgres for Data scientists In this short tutorial, I explain the steps to set up PostgreSQL local instance running in Docker and using python to interact with the Database. Below steps have been taken to set up the process. 1. Setting up docker for PostgreSQL2. Connecting … Read more A Data Scientist Approach: Running Postgres SQL using Docker

Interesting AI/ML Articles You Should Read This Week (July 4)

One of the most interesting article I read this week. Luca Rossi has written a piece that will send most readers down the path of self and environmental awareness. After reading this article, I found my self questioning the impact of my actions and contributions that can lead to the imaginary worlds created within this … Read more Interesting AI/ML Articles You Should Read This Week (July 4)

A taste of ACL2020: 6 new Datasets & Benchmarks

Datasets and benchmarks are at the core of progress in Natural Language Understanding (NLU): in leaderboard-driven research, progress is upper-bounded by the quality of our evaluations. While datasets for Machine Learning used to last — i.e. MNIST didn’t reach human performance until more than a decade after it was introduced — the latest benchmarks for … Read more A taste of ACL2020: 6 new Datasets & Benchmarks

The Problem with Data Science Competition Platforms like Kaggle

Do Kaggle and other data science competitions violate Open Source? — A case for open idea transfer and collaboration freebiesupply.com As any data scientist is probably familiar, there exists a vast world of predictive modelling competitions on the internet; the most well-known competitions likely being those from Kaggle. Some of these competitions are incentivized financially, … Read more The Problem with Data Science Competition Platforms like Kaggle

Day 111 of #NLP365: NLP Papers Summary — The Risk of Racial Bias in Hate Speech Detection

Investigate how racial bias has been introduced by annotators into the datasets for hate speech detection, increasing the harm against minority races and proposed a method to prime dialect and race, to reduce racial bias in annotation. The contributions of the paper are as follows: Found unexpected correlation between surface markets of African American English … Read more Day 111 of #NLP365: NLP Papers Summary — The Risk of Racial Bias in Hate Speech Detection

Beginner’s Guide to SQL: Disney Princess Edition

I recently finished a SQL course on Coursera and was looking for project ideas when I came across Amanda’s post. I reached out to her and she suggested me reading Okoh Anita’s post on Basic SQL from which she was inspired. Undoubtedly, I too was inspired by how she described simple queries using a very … Read more Beginner’s Guide to SQL: Disney Princess Edition

NGBoost algorithm: solving probabilistic prediction problems

Predict a distribution of the target variable, not just point estimate Photo by mohammad alizade on Unsplash While looking through the ICML 2020 accepted papers, I found an interesting paper: NGBoost: Natural Gradient Boosting for Probabilistic Prediction We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient… arxiv.org You may ask, … Read more NGBoost algorithm: solving probabilistic prediction problems

This Model is for the Birds

In the interest of contributing to research in fine-grained vision classification, the Cornell Lab of Ornithology has released the NABirds data set consisting of 48,562 images of 404 bird species. Many of these species are further subdivided into categories such as Male/Female, Adult/Juvenile, Breeding/Non-breeding, which leads to 555 total classes. My goal is to determine … Read more This Model is for the Birds

Improving the Performance of Machine Learning Model using Bagging

Understand the working of Bootstrap Aggregation (Bagging) ensemble learning and implement a Random Forest Bagging model using an sklearn library. Photo by Carlos Muza on Unsplash The performance of a machine learning model tells us how the model performs for unseen data-points. There are various strategies and hacks to improve the performance of an ML … Read more Improving the Performance of Machine Learning Model using Bagging

The Sound of Places — Visualizing audio with R

Using R and soundgen to visualize the spectrogram and loudness of places I’ve visited At the risk of sounding like a broken record (that’s an audio pun right there), I have to start this piece by saying that for the last year I’ve been backpacking. During this adventure, I’ve seen wonderful places, tasted extravagant flavors, … Read more The Sound of Places — Visualizing audio with R

Streaming real-time data into Snowflake with Amazon Kinesis Firehose

Photo by Joao Branco on Unsplash Businesses today can benefit in real-time from the data they continuously generate at massive scale and speed from various data sources. Whether it is clickstream data from websites, telemetry data from IoT devices or log data from applications, continuously analysing that data can help businesses learn what their customers, … Read more Streaming real-time data into Snowflake with Amazon Kinesis Firehose

Medium writers you should follow as an aspiring Data Scientist

Introduction This is my list of 10 Data Science writers/influencers that I follow. And yes, I follow and read the work of other people as well but if I had to narrow it down top 10 only, that would be the list. I am sharing this because I believe that all aspiring Data Scientists would … Read more Medium writers you should follow as an aspiring Data Scientist

Eigenvalues and Eigenvectors

Computing and Visualizing Animation available here: https://bdshaff.github.io/bdshaff.github.io/blog/2020-03-23-computing-eigenvalues-and-eigenvectors/ First I’ll talk about what made me curious about how Eigenvalues are actually computed. Then I’ll share my implementation of the simplest algorithm (QR method) and do some benchmarking. Lastly, I’ll share how you can animate the steps the algorithm goes through to finds the Eigenvectors! Eigenvalues and … Read more Eigenvalues and Eigenvectors

Natural Language Processing Pipeline

If we were asked to build an NLP application, think about how we would approach doing so at an organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all … Read more Natural Language Processing Pipeline

Ensemble Methods: Comparing Scikit Learn’s Voting Classifier to The Stacking Classifier

Using the Titanic Dataset to compare scikit learn voting classifier and the stacking classifier. Photo by Perry Grone on Unsplash Two heads, they say, is better than one. Sometimes in many Machine Learning projects we want to make use of the power of synergy using ensemble methods. The voting and the stacking classifier brings us … Read more Ensemble Methods: Comparing Scikit Learn’s Voting Classifier to The Stacking Classifier

Job Fairs Are Now Going Virtual

WomenHack Networking Event Goes Virtual Photo from Shutterstock Attending networking events in person once required careful planning by printing out business cards and resumes, dressing to impress, and allowing enough time to get to and from the event. With many tech companies continuing to hire during COVID-19, these companies are finding ways to recruit employees … Read more Job Fairs Are Now Going Virtual

Pareidolia — Teaching Art to AI

Pareidolia is our first AI & Art project under the Alien Intelligence umbrella. At Alien Intelligence, we explore our ability to teach art to AI; have it generate some evidence of its understanding, and then analyse and interpret its response. We start with a simple “lesson,” and plan to gradually develop its content and complexity … Read more Pareidolia — Teaching Art to AI

Mastering Query Plans in Spark 3.0

Spark query plans in a nutshell. In Spark SQL the query plan is the entry point for understanding the details about the query execution. It carries lots of useful information and provides insights about how the query will be executed. This is very important especially in heavy workloads or whenever the execution takes to long … Read more Mastering Query Plans in Spark 3.0

Why Should You Use Kotlin For Machine Learning on Android?

Developing ML algorithms will sound thrilling if you’re a programming enthusiast. First things, first. Imagine you are about to create a Decision Tree Classifier in Python. Why would do you? Most probably you’ll use NumPy for array manipulations and Pandas for processing data. Some might use scikit-learn’s DecisionTreeClassifier. If you are about to create a … Read more Why Should You Use Kotlin For Machine Learning on Android?

Probabilistic Programming and Bayesian Inference for Time Series Analysis and Forecasting

A Bayesian Method for Time Series Data Analysis and Forecasting Photo by Author As described in [1][2], time series data includes many kinds of real experimental data taken from various domains such as finance, medicine, scientific research (e.g., global warming, speech analysis, earthquakes), etc. Time series forecasting has many real applications in various areas such … Read more Probabilistic Programming and Bayesian Inference for Time Series Analysis and Forecasting

How does project management work in data science?

Why is it so hard to squeeze data science into traditional project management approaches? A foolish consistency is the hobgoblin of little minds. Ralph Waldo Emerson (1803–1882) Data science does not normally fit very well into standard project management approaches that have been long established in other disciplines. Why is this? Data science projects traditionally … Read more How does project management work in data science?

How to Create and Beautify Venn Diagrams in Python

Empowered by matplotlib-venn Venn diagram is the most common diagram in scientific research articles and can be used to represent the relationship between multiple data sets. From Venn diagram, you can easily detect the commonalities and differences among those datasets. This tutorial will show you three different ways to create Venn diagrams in Python and … Read more How to Create and Beautify Venn Diagrams in Python

A Potential Data Science Foundation for Math Backgrounds

A course guide to help you get started with Data Science Photo by Mirko Blicke on Unsplash You decided to join the data science field! Congrats! But maybe you aren’t sure what you want to do in the field yet but you want to get in there yesterday. Many advocate that you should dive into … Read more A Potential Data Science Foundation for Math Backgrounds

Generating Synthetic Seismogram in Python

Seven steps to generate seismogram from well logs One of the most fundamental concepts in geophysics is convolution. Seismic data that we record on geophysical operations are the reflected energy from the Earth’s internal surfaces that have different physical rock properties from adjacent layers. In fact, seismic signature results from the convolution of reflectivity of … Read more Generating Synthetic Seismogram in Python

Load a Large spaCy model on AWS Lambda

Serverless NLP using spaCy and AWS Lambda Paul Cézanne / Public domain from Wikimedia spaCy is a useful tool that allows us to perform many natural language processing tasks. When integrating spaCy into an existing application, it is convenient to provide it as an API using AWS Lambda and API Gateway. However, due to Lambda’s … Read more Load a Large spaCy model on AWS Lambda

What makes Logistic Regression a Classification Algorithm?

In the above equation, the terms are as follows: g is the logit function. The equation for g(p(x)) shows that the logit is equivalent to linear regression expression ln denotes the natural logarithm p(x) is the probability of the dependent variable that falls in one of the two classes 0 or 1, given some linear … Read more What makes Logistic Regression a Classification Algorithm?

Our Machine Learning Algorithms are Magnifying Bias and Perpetuating Social Disparities

AI Ethics and Considerations For machine learning engineers, the companies that hire them, and the users who are impacted by the algorithms they’ve tuned: Photo by Kevin Ku on Unsplash Shortly after I began my machine learning courses, it dawned on me that there is an absurd exaggeration in the media concerning the state of … Read more Our Machine Learning Algorithms are Magnifying Bias and Perpetuating Social Disparities

A Review of Synthetic Tabular Data Tools and Models

Anonymization methods that are revolutionizing how we share data Picture by Mika Baumeister @mbaumi. https://unsplash.com/photos/Wpnoqo2plFA We live in a data driven generation where big data, data mining and artificial intelligence (and other buzz words) are revolutionizing the ways we obtain value from data. The challenge is that both private companies and public entities have no … Read more A Review of Synthetic Tabular Data Tools and Models

Machine Learning Model Regularization in Practice: an example with Keras and TensorFlow 2.0

Next, let’s create X and y. Keras and TensorFlow 2.0 only take in Numpy array as inputs, so we will have to convert DataFrame back to Numpy array. # Creating X and yX = df[[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]]# Convert DataFrame into np arrayX = np.asarray(X)y = df[[‘label_setosa’, … Read more Machine Learning Model Regularization in Practice: an example with Keras and TensorFlow 2.0

Taking a Machine Learning White Paper to Production

Four things to consider. Image from Wikipedia As I read white papers I’m usually going at this with a purpose in mind, i.e., a potential challenge at hand, usually a big challenge. Often, they’ll really fit the bill. And sometimes, they’ll even have source code on Github. So much the better. Image by Author This … Read more Taking a Machine Learning White Paper to Production

Updating Partitioned Tables in BigQuery

BigQuery has come a long way, but some great aspects such as the wildcard search still lack some functionality which would be relatively straightforward in SQLServer. One such functionality is the ability to update partitioned tables. it’s quite easy to update one, however, if you have 100s of partitions, it can be quite cumbersome to … Read more Updating Partitioned Tables in BigQuery