spaCy Basics

A guide for getting started NLP and spaCy A major challenge of text data is extracting meaningful patterns and using those patterns to find actionable insights. NLP can be thought of as a two part problem: Processing. Converting the text data from its original form into a form the computer can understand. This includes data … Read more spaCy Basics

Predicting movie revenue with AdaBoost, XGBoost and LightGBM

What determines movie success? Marvel’s Avengers: Endgame recently dethroned Avatar as the highest grossing movie in history and while there was no doubt about this movie becoming very successful, I want to understand what makes any given movie a success. The questions I will answer are: Which variables are particularly predictive of absolute revenue figures? … Read more Predicting movie revenue with AdaBoost, XGBoost and LightGBM

Simulate Images for ML in PyBullet — The Quick & Easy Way

When applying deep Reinforcement Learning (RL) to robotics, we are faced with a conundrum: how do we train a robot to do a task when deep learning requires hundreds of thousands, even millions, of examples? To achieve 96% grasp success on never-before-seen objects, researchers at Google and Berkeley trained a robotic agent through 580,000 real-world … Read more Simulate Images for ML in PyBullet — The Quick & Easy Way

Run Amazon SageMaker Notebook locally with Docker container

The main aim of the local Docker container is to maintain as much as possible the most important features of the AWS-hosted instance while enhancing the experience with the local-run capability. Followings are the features that have been replicated: Jupyter Notebook and Jupyter Lab This is simply taken from Jupyter’s official Docker images with a … Read more Run Amazon SageMaker Notebook locally with Docker container

Skip the heavy lifting: Moving Redshift to BigQuery easilySkip the heavy lifting: Moving Redshift to BigQuery easilyProduct Manager, Data Analytics, Google Cloud

Enterprise data warehouses are getting more expensive to maintain. Traditional data warehouses are hard to scale and often involve lots of data silos. Business teams need data insights quickly, but technology teams have to grapple with managing and providing that data using old tools that aren’t keeping up with demand. Increasingly, enterprises are migrating their … Read more Skip the heavy lifting: Moving Redshift to BigQuery easilySkip the heavy lifting: Moving Redshift to BigQuery easilyProduct Manager, Data Analytics, Google Cloud

Meta-Transfer Learning for Few-shot Learning

Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend … Read more Meta-Transfer Learning for Few-shot Learning

Can you have your groceries delivered in under 15 minutes?

A quick Simulation and Optimization study for rapid fast delivery. Instacart, Amazon Prime now, Farmstead and many more startups in delivery space are tackling very interesting supply, demand, simulation and logistic optimization problems. Back in 2018 my cofounder Ricky Wong and I wanted to validate a radical grocery delivery idea : get groceries delivered under … Read more Can you have your groceries delivered in under 15 minutes?

What does a modern analytics platform need to offer companies real added value?

What does a modern analytics platform need to offer companies real added value? Currently, new, innovative platforms are sprouting up on the market again and again – implemented with technical competence and ideally suited to the respective analytical approaches. But the question arises: Is that enough? Is it enough to develop software that allows reliable … Read more What does a modern analytics platform need to offer companies real added value?

Build your own custom hotword detector with zero training data and $0!

TLDR: Google TTS -> Noise augment -> {wav files} ->SnowBoy ->{.pmdl models} -> Raspberry Pi OK, so it’s that time of the year again. You know there’s *that* thing in the desert. Last time around, I rigged up a Google AIY vision kit and added espeak on Chip and Terra , the art installations of … Read more Build your own custom hotword detector with zero training data and $0!

An easy introduction to unsupervised learning with 4 basic techniques

Deep Learning has gotten a lot of love from both the AI community and the general public. But most recently, researchers have started to question and doubt that deep learning is really the future of AI. The prominent deep learning techniques used today all rely on supervised learning, yet we see quite clearly that humans … Read more An easy introduction to unsupervised learning with 4 basic techniques

Regular Sequences

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. So far in this series, we used vectors from built-in datasets (rivers, women … Read more Regular Sequences

Announcing the general availability of Python support in Azure Functions

Python support for Azure Functions is now generally available and ready to host your production workloads across data science and machine learning, automated resource management, and more. You can now develop Python 3.6 apps to run on the cross-platform, open-source Functions 2.0 runtime. These can be published as code or Docker containers to a Linux-based … Read more Announcing the general availability of Python support in Azure Functions

Why Machine Learning is more Practical than Econometrics in the Real World

[This article was first published on R – Remix Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Motivation I’ve read several studies and articles that claim Econometric … Read more Why Machine Learning is more Practical than Econometrics in the Real World

Anomalies in Global Suicide Data

Mental Health Search Interest on Google Trends Every Mental Health Awareness Day (October 10), there is a peak in search interest for “mental health” on Google Trends. However, this past October, there was the highest search interest ever seen. Mental health in the United States is growing as a part of the global conversation – … Read more Anomalies in Global Suicide Data

Ridge Regression Python Example

Overfitting, the process by which a model performs well for training samples but fails to generalize, is one of the main challenges in machine learning. In the proceeding article, we’ll cover how we can use regularization to help prevent overfitting. To be specific, we’ll talk about Ridge Regression, a distant cousin of Linear Regression, and … Read more Ridge Regression Python Example

Defining A Data Science Problem

The most important non-technical skill for a Data Scientist According to Cameron Warren, in his Towards Data Science article Don’t Do Data Science, Solve Business Problems, “…the number one most important skill for a Data Scientist above any technical expertise — [is] the ability to clearly evaluate and define a problem.” As a data scientist … Read more Defining A Data Science Problem

Best Investment Portfolio Via Monte-Carlo Simulation In Python

There exists a risk-free rate which is the rate that an investor earns on his/her investment without taking any risk, such as in buying government treasury bills. There is a tradeoff between risk and return. If an investor is expecting to invest in a riskier investment option than the risk-free rate then he/she is expecting … Read more Best Investment Portfolio Via Monte-Carlo Simulation In Python

Improving the Stack Overflow search algorithm using Semantic Search and NLP

4. Training Word Embeddings using Word2Vec In order for our model to understand the raw text data, we need to vectorize it. Bag of Words and TF-IDF are very common approaches for vectorizing. However, since I would be using an artificial neural network as my model(LSTM), the sparse nature of BOW and TFIDF would pose … Read more Improving the Stack Overflow search algorithm using Semantic Search and NLP

Autonomous Driving: Intro into SLAM

SLAM is the process where a robot/vehicle builds a global map of their current environment and uses this map to navigate or deduce its location at any point in time [1–3]. Use of SLAM is commonly found in autonomous navigation, especially to assist navigation in areas global positioning systems (GPS) fail or previously unseen areas. … Read more Autonomous Driving: Intro into SLAM

Searching for Food Deserts in Los Angeles County

img source: robrogers.com For a recent data science project, I collaborated with several other Lambda School students to search for food deserts in L.A. County. A general definition for what qualifies as a food desert is an area that does not have access, within one mile, to a grocery store/market providing fresh, healthy food options, … Read more Searching for Food Deserts in Los Angeles County

A Detailed, Step-by-Step Guide to Linear Regression using MATLAB

Prediction of Housing Prices The aim is to obtain statistical inference from the given data in the paper of “(1977) Narula and Wellington, Prediction, Linear Regression and the Minimum Sum of Relative Errors, Technometrics” by using linear regression technique for prediction purposes. In the data, 28 data are given for each predictor (11 different predictors) … Read more A Detailed, Step-by-Step Guide to Linear Regression using MATLAB

Data Science in Production

Source: https://pixabay.com/photos/factory-industry-sugar-3713310/ Building Scalable Model Pipelines with Python One of my biggest regrets as a data scientist is that I avoided learning Python for too long. I always figured that other languages provided parity in terms of accomplishing data science tasks, but now that I’ve made the leap to Python there is no looking back. … Read more Data Science in Production

5 Tips To Create A More Reliable Web Crawler

To Boost your web crawler’s efficiency! When I am crawling websites, web crawlers being blocked by websites could be described as the most annoying situation. To become really great in web crawling, you not only should be able to write the xpath or css selectors quickly but also how you design your crawlers matters a … Read more 5 Tips To Create A More Reliable Web Crawler

Sentiment Analysis of Economic Reports Using Logistic Regression

Sentiment analysis is a hot topic in NLP, but this technology is increasingly relevant in the financial markets — which is in large part driven by investor sentiment. With so many reports and economic bulletins being generated on a daily basis, one of the big challenges for policymakers is to extract meaningful information in a … Read more Sentiment Analysis of Economic Reports Using Logistic Regression

Why You Should Double Down On Serverless Infrastructure

When you double-down on serverless architecture you begin to reap amazing rewards. Serverless has been around for a few years. It is not a brand new idea, but it is a new way of thinking about building applications. I always tend to think about why I am doing something before I think about how I … Read more Why You Should Double Down On Serverless Infrastructure

Regression — explained in simple terms!!

In this article, I wish to put forth regression in as simple terms as possible so that you do not remember it as a statistical concept, rather as a more relatable experience. Regression — as fancy as it sounds can be thought of as “relationship” between any two things. For example, imagine you stay on … Read more Regression — explained in simple terms!!

Scholarly Network Analysis

References [1] Feng Xia, Wei Wang, Teshome Megersa Bekele, and Huan Liu. Big scholarly data: A survey.IEEE Transactions on BigData, 3(1):18–35, 2017. [2] Tze-Haw Huang and Mao Lin Huang. Analysis and visualization of co-authorship networks for understanding academic collaboration and knowledge domain of individual researchers. In Computer Graphics, Imaging and Visualisation, 2006 International Conference on, … Read more Scholarly Network Analysis

Dynamic Speed Optimization

REGRESSION MODELING Modeling Ship Performance Curves to Reduce Fuel Consumption Container Ships Can Consume Over 350 Tons of Fuel Per Day, Photo by Anker Crew Insurance Total fuel costs for the global commercial maritime shipping industry were approximately $100 billion in 2018. Emissions regulations, imposed by the International Maritime Organization, are expected to increase fuel … Read more Dynamic Speed Optimization

Bayesian Strategy for Modeling Retail Price with PyStan

Statistical modeling, partial pooling, Multilevel modeling, hierarchical modeling Pricing is a common problem faced by any e-commerce business, and one that can be addressed effectively by Bayesian statistical methods. The Mercari Price Suggestion data set from Kaggle seems to be a good candidate for the Bayesian models I wanted to learn. If you remember, the … Read more Bayesian Strategy for Modeling Retail Price with PyStan

Tune: fast hyperparameter tuning at any scale

Let’s now dive into a concrete example that shows how you to leverage a state-of-the-art early stopping algorithm (ASHA). We will start by running Tune across all of the cores on your workstation. We’ll then scale out the same experiment on the cloud with about 10 lines of code. We’ll be using PyTorch in this … Read more Tune: fast hyperparameter tuning at any scale

NIPS 2018 paper on “Robust Classification of Financial Risk” — Summary

In this short article, I would like to give an overview of a research paper called “Robust Classification of Financial Risk”. The paper was accepted for NIPS 2018 Workshop on “Challenges and Opportunities for AI in Financial Services” and aims to solve very interesting and unique problem occurring in credit lending done with deep learning … Read more NIPS 2018 paper on “Robust Classification of Financial Risk” — Summary

Local Model Interpretation: An Introduction

Concept and Theory Lime, Local Interpretable Model-Agnostic, is a local model interpretation technique using Local surrogate models to approximate the predictions of the underlying black-box model. Local surrogate models are interpretable models like Linear Regression or a Decision Trees that are used to explain individual predictions of a black-box model. Lime trains a surrogate model … Read more Local Model Interpretation: An Introduction

Measures of Proximity in Data Mining & Machine Learning

Moving forward, we are going to talk about Similarity and Dissimilarity between data objects separately. Without further ado, let’s dive into it. Dissimilarities between Data Objects We begin with discussion about distances, which dissimilarities with certain properties. Euclidean Distance The Euclidean distance, d, between two points, x and y, in one, two, three, or higher- … Read more Measures of Proximity in Data Mining & Machine Learning

Converting a Deep Learning Model with Multiple Outputs from PyTorch to TensorFlow

Generating and preparing the data The main difference in the data is that there are now 2 different sets of actual outputs, 1 as a continuous variable and the other in binary form. Also, I defined two functions to generate two different types of outputs for the data. The snippet below illustrates the process of … Read more Converting a Deep Learning Model with Multiple Outputs from PyTorch to TensorFlow

Missing Values In Dataframes With Inspectdf

[This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Summarising NA by column in dataframes Exploring the number of records containing … Read more Missing Values In Dataframes With Inspectdf

How Markets Fool the Models, and Us

Curve Fit Market Data with Caution Data science used to be the domain of statisticians, scientists and Wall Street quants, but thanks to the ubiquity of data and open source libraries, all of us can now develop powerful, predictive models. Of course, these models also have the power to breed overconfidence, especially in the stock … Read more How Markets Fool the Models, and Us

Probability of an Approaching AI Winter

This article addresses the question of whether the field of Artificial Intelligence (AI) is approaching another AI winter or not. Motivation Both industries and governments alike have invested significantly in the AI field, with many AI-related startups established in the last 5 years. If another AI winter were to come about many people could lose … Read more Probability of an Approaching AI Winter

Sugar, Flower, Fish or Gravel — Now a Kaggle competition

I am very happy to announce the launch of our Kaggle competition “Understanding Clouds from Satellite Images”. This competition is the culmination of literally hundreds of hours of human labor from dozens of scientists. The challenge is to segment satellite images into one of four classes. Typically, when we think about different cloud types we … Read more Sugar, Flower, Fish or Gravel — Now a Kaggle competition

6 lessons learned as a new data science lead

Photo credit: https://unsplash.com/photos/RXWgx93tz8w If you have worked in a data science team already, probably you are not entirely unfamiliar with uncertainty. Most probably you have worked on some greenfield projects in your past. Maybe you have even led some of them. And some of them might have succeeded, while some others might not have reached … Read more 6 lessons learned as a new data science lead