End-to-End Time Series Interpolation in PySpark — Filling the Gap

Photo by Steve Halama on Unsplash Anyone working with data knows that real-world data is often patchy and cleaning it takes up a considerable amount of your time (80/20 rule anyone?). Having recently moved from Pandas to Pyspark, I was used to the conveniences that Pandas offers and that Pyspark sometimes lacks due to its distributed … Read more

Follow & Learn: Experiment Size With Python

Photo by Crissy Jarvis on Unsplash You want to change your website layout to get more clicks. You decide to run an experiment where a control group sees the usual page, and then an experimental group sees a new layout. Let’s suppose your current website click through rate (CTR) is p_null=10% and we want to increase … Read more

Audio to Guitar Tab with Deep Learning

Using Convolutional Neural Networks to expedite learning music. Photo by Jacek Dylag on Unsplash This story outlines the implementation of automatic guitar transcription from audio files using Python, TensorFlow, and Keras as well as details the surface level methods performed. For training, the GuitarSet data set is employed for its large quantity of isolated guitar recordings with … Read more

Distributed Deep Learning Pipelines with PySpark and Keras

Step 4 Create the Spark Data Pipeline Now we create the pipeline using PySpark. This essentially takes your data and, per the feature lists you pass, will do the transformations and vectorizing so it is ready for modeling. I referenced the “Extracting, transforming and selecting features” Apache Spark documentation a lot for this pipeline and … Read more

Choose the Right Transformer Framework for You

Compare different Transformer implementation frameworks and choose the best framework for your own needs Image credit: © Flynt — Bigstockphoto.com TL;DR Based on your preference for PyTroch or TensorFlow, I recommend using Fairseq or Tensor2Tensor. If you are a researcher, Fairseq is flexible enough for customization. But if you are working on some real application and considering deployment, … Read more

What makes a movie hit a “jackpot”? Learning from data with Multiple Linear Regression

Explanatory Data Analysis : Feature selection Obviously, we don’t have to consider every variable out of 32 for your model — it doesn’t make any sense to include them in statistical analysis as they were given for informational purposes. The right question is: What variables should I consider in my model? First, let’s look at our dataset. Codebook … Read more

Beginner’s Guide to Sentiment Analysis for Simplified Chinese using SnowNLP

Image made by a colleague from Yoozoo Games By reading this article, you will be exposed to a technique for analyzing the sentiments of any text in Simplified Chinese. This tutorial will be based on Simplified Chinese but it can be used on Traditional Chinese as well due to the fact that SnowNLP is capable of … Read more

Data Visualization With MatPlotLib Using Python

Data visualization using python I Feel: In today’s digital world data has become as important as air. People are consuming and generating huge volumes of data knowingly and unknowingly on a daily basis. It is this bombardment of digital information is what current businesses are trying to tap and harness to sell and engage their customers … Read more

Too Close For Comfort

Why Target and Walmart locate across the street from each other Hotelling’s Law If you’ve ever been to a mall, you’ll often find a surprising situation: stores like Target, Walmart, JCPenney and Kohl’s right nearby each other, often within walking distance. It’s a strange phenomenon. Wouldn’t competitors choose to locate themselves farther from similar stores, to … Read more

Top 10 Statistics Mistakes Made by Data Scientists

A data scientist is a “person who is better at statistics than any software engineer and better at software engineering than any statistician”. In Top 10 Coding Mistakes Made by Data Scientists we discussed how statisticians can become a better coders. Here we discuss how coders can become better statisticians. Detailed output and code for … Read more

Spice Up Your Python Visualizations with Matplotlib Animations

Animating the Board The part that we’ve been waiting for — animation! First, we need to get some formalities out of the way. The following lines of code create the matplotlib figure that will display our animation. # Required line for plotting the animation%matplotlib notebook# Initialize the plot of the board that will be used for animationfig = … Read more

Backtesting Your First Trading Strategy

Backtesting is a fundamental step in testing the viability of your trading ideas and strategies. Here is a simple backtesting implementation in Python. This article showcases a simple implementation for backtesting your first trading strategy in Python. Backtesting is a vital step when building out trading strategies. The core idea here is to develop a strategy … Read more

An “Equation-to-Code” Machine Learning Project Walk-Through — Part 3 SGD

Detailed explanation to implement Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent from scratch in Python from Shutterstock Hi, everyone! This is “Equation-to-Code” walk-through part 3. In the previous articles, we talk about in linear separable problem in part 1, and non-linear separable problem in part 2. This time we will implement stochastic gradient descent (SGD) … Read more

Smart Prosthetics with Object Detection using Tensorflow

During my time at NC State’s Active Robotics Sensing (ARoS) Lab, I had the opportunity to work on a project for smarter control of upper limb prosthesis using computer vision techniques. A prosthetic arm would detect what kind of object it was trying to interact with, and adapt its movements accordingly. Source: Newcastle University Similar … Read more

Text Classification in Python

An end to end Machine Learning project Learn to build a text classification model in Python This article is the first of a series in which I will cover the whole process of developing a machine learning project. In this article we focus on training a supervised learning text classification model in Python. The motivation behind writing … Read more

Getting Started with Plot.ly

A Guided Walkthrough for Powerful Visualizations in Python Authors: Elyse Lee and Ishaan Dey Matplotlib is alright, Seaborn is great, but Plot.ly? That’s on an entirely new level. Plot.ly offers more than your average graph by providing options for full interactivity and many editing tools. Differentiated from the others by having options to have graphs in … Read more

Brilliant Jerks, Crazy Hotties, and Other Artifacts of Range Restriction

When people write about Steve Jobs, they mention that he was brilliant but caustic: he could instantly solve design problems that had bedeviled his team for months, but he’d summarily fire people for minor mistakes. Since a lot of people want to be like Steve Jobs, and since being a genius is hard, some ambitious … Read more

The Prosecutor’s Fallacy

Conditional Probability in the Courtroom The lasso of truth Imagine you have been arrested for murder. You know that you are innocent, but physical evidence at the scene of the crime matches your description. The prosecutor argues that you are guilty because the odds of finding this evidence given that you are innocent are so small that … Read more

Beyond Bar Graphs and Pie Charts

A BEGINNER’S GUIDE Using Python, R, Tableau, and RawGraphs to effectively and beautifully communicate your data I understand. Maybe you forgot about your presentation this afternoon. Maybe you have 5 minutes to throw together the 3 visuals your boss wants on his desk by the end of the day. Maybe you’re just tired of dealing … Read more

Artificial Intelligence Made Easy

Photo Source: ShutterStock A Comprehensive Guide to Modeling with H2O.ai in Python By Ishaan Dey & Elyse Lee If you’re anything like my dad, you’ve worked in IT for decades but have only tangentially touched data science. Now, your new C-something-O wants you to fire up a data analytics team and work with new a set … Read more

AutoML — A Tool to Improve Your Workflow

Photo by Alex Knight on Unsplash A look at H2O AutoML in binary classification Recently, the upsurge in demand data science skills has grown faster than the current supply of skills can keep up with. Today it’s difficult to imagine a business that wouldn’t benefit from the detailed analysis data scientists and machine learning algorithms perform. … Read more

Text Processing Is Coming

Tokenization In order to analyze a text, its words must be pulled out and analyzed. One way to do this is to split each text by spaces so that individual words are returned. However, this doesn’t take into account punctuation or other symbols that might want to be removed. This process of breaking sentences, paragraphs, … Read more

Feature Selection Why & How Explained

Feature selection algorithm implementations in Python In the last article, I explained the problems with including irrelevant or correlated features in model building. In this article, I’ll show you several neat implementations of selection algorithms that can be easily integrated into your project pipeline. Before diving into the detailed implementation, let’s go through the dataset … Read more

Regression or Classification? Linear or Logistic?

link Understanding the differences & the various models for each Regression vs Classification In order to decide whether to use a regression or classification model, the first questions you should ask yourself is: Is your target variable a quantity, a probability of a binary category, or a label? If it’s one of the former options, then you … Read more

Beginner’s Guide to BERT for Multi-classification Task

Original Photo by David Pisnoy on Unsplash. It was later modified to include some inspiring quotes. The purpose of this article is to provide a step-by-step tutorial on how to use BERT for multi-classification task. BERT ( Bidirectional Encoder Representations from Transformers), is a new method of pre-training language representation by Google that aimed to … Read more

A Gentle Implementation of Reinforcement Learning in Pairs Trading

Back to Part 2: Code Design Illustration of the code structure 2.1 Config The execution is governed by the config (dictionary). This component allows us to encapsulate a lot of executions and tidy up the code. It can also be used as a carrier of additional parameters. For instance, in the previous section, the instantiation of … Read more

Effective Data Visualization for other Humans

Choosing the Right Visual Matters Credit: Harry Quan — Unsplash Irrespective of the accuracy of its content, choosing the wrong chart can convey misleading and potentially harmful messages to your audience. We have the responsibility to uphold the integrity of the messages we send. Additionally, using the wrong visual can misclassify otherwise accurate information. We can minimize these … Read more

A Simple Introduction to K-Nearest Neighbors Algorithm

What is KNN? K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified. Let’s take below wine example. Two chemical components called Rutime and Myricetin. … Read more

Using FIPS to Visualize in Plotly

Recently I have done two visualization projects with Plotly, visualizing average rent per square foot in California and average data scientist salary across the United States. I used two different approaches to visualize data on the map — Scatter plot on Map and Chloropleth Map. Why Plotly?Plotly is one of the powerful visualization packages available in Python. … Read more

Let’s Talk About Machine Learning Ensemble Learning In Python

Build Better Predictive Models By Efficiently Combining Classifiers Into A Meta-Classifier Learning about ensembles is important for anyone who wants to get advanced level understanding of the machine learning concepts. This article will focus on the techniques and methods for combining a set of classifiers to improve the performance of your machine learning solution. Ensemble … Read more

Training a Convolutional Neural Network from scratch

A simple walkthrough of deriving backpropagation for CNNs and implementing it from scratch in Python. In this post, we’re going to do a deep-dive on something most introductions to Convolutional Neural Networks (CNNs) lack: how to train a CNN, including deriving gradients, implementing backprop from scratch (using only numpy), and ultimately building a full training pipeline! … Read more

Beginner’s Guide to Building Neural Networks in TensorFlow

Detailed Walkthrough of the TensorFlow 2.0 Beginner Notebook If you’re reading this you’ve probably had some exposure to neural networks and TensorFlow, but you might feel somewhat daunted by the various terms associated with deep learning that are often glossed over or left unexplained in many introductions to the technology. This article will shine a light … Read more

Beginner’s Recommendation Systems with Python

Building our own recommendation systems with the TMDB 5000 movies dataset Objectives of this Tutorial Here are some objectives for you: Learn what recommendation systems are, how they work, and some of their different flavors Implement a few recommendation systems using Python and the TMDB 5000 movies dataset What are Recommendation Systems? A recommendation system (also commonly … Read more

Hyper-Parameter Tuning and Model Selection, Like a Movie Star

Coding, analyzing, selecting, and tuning like you really know what you’re doing. Photo by Markus Spiske on Unsplash “Hyper-parameter tuning for random forest classifier optimization” is one of those phrases which would sound just as at ease in a movie scene where hackers are aggressively typing to “gain access to the mainframe” as it does in a … Read more

What’s in the Black Box?

Individual Conditional Expectation plots But what if we are unsure of the causal direction of our features? One tool that can help is an Individual Conditional Expectation (ICE) plot. Instead of plotting the average prediction based on the value of a feature, it plots a line for each observation across possible values of the feature. … Read more

A simple Monte-Carlo simulation to solve a Putnam Competition math problem

An overview of the Monte-Carlo method in Python Photo by Jonathan Petersson on Unsplash Introduction There are two things common to high-stakes gambling and high-speed racing — a huge degree of uncertainty and the city of Monte-Carlo. This connection between the two led to the use of the term “Monte-Carlo simulation” for computational methods that predict the probabilities of … Read more

How do you check the quality of your regression model in Python?

Linear regression is rooted strongly in the field of statistical learning and therefore the model must be checked for the ‘goodness of fit’. This article shows you the essential steps of this task in a Python ecosystem. Why it is important (and why you might be missing it) For all the talk and hair-splitting on the … Read more

Getting set up in PostgresSQL

Using PSQL and Python Pandas together In the I SPY children’s series, the author asks us to look at a global image for the smaller images we want to find. It is not easy (by design) which is the opposite of what we hope for with our SQL queries. One of the big challenges I faced after … Read more

Summarizing Economic Bulletin Documents with tf-idf

A key strength of NLP (natural language processing) is being able to process large amounts of texts and then summarise them to extract meaningful insights. In this example, a selection of economic bulletins in PDF format from 2018 to 2019 are analysed in order to gauge economic sentiment. The bulletins in question are sourced from … Read more

Anomaly detection in time series with Prophet library

First of all, let’s define what is an anomaly in time series. Anomaly detection problem for time series can be formulated as finding outlier data points relative to some standard or usual signal. While there are plenty of anomaly types, we’ll focus only on the most important ones from a business perspective, such as unexpected … Read more