Getting Started with Plot.ly

A Guided Walkthrough for Powerful Visualizations in Python Authors: Elyse Lee and Ishaan Dey Matplotlib is alright, Seaborn is great, but Plot.ly? That’s on an entirely new level. Plot.ly offers more than your average graph by providing options for full interactivity and many editing tools. Differentiated from the others by having options to have graphs in … Read more

Brilliant Jerks, Crazy Hotties, and Other Artifacts of Range Restriction

When people write about Steve Jobs, they mention that he was brilliant but caustic: he could instantly solve design problems that had bedeviled his team for months, but he’d summarily fire people for minor mistakes. Since a lot of people want to be like Steve Jobs, and since being a genius is hard, some ambitious … Read more

The Prosecutor’s Fallacy

Conditional Probability in the Courtroom The lasso of truth Imagine you have been arrested for murder. You know that you are innocent, but physical evidence at the scene of the crime matches your description. The prosecutor argues that you are guilty because the odds of finding this evidence given that you are innocent are so small that … Read more

Beyond Bar Graphs and Pie Charts

A BEGINNER’S GUIDE Using Python, R, Tableau, and RawGraphs to effectively and beautifully communicate your data I understand. Maybe you forgot about your presentation this afternoon. Maybe you have 5 minutes to throw together the 3 visuals your boss wants on his desk by the end of the day. Maybe you’re just tired of dealing … Read more

Artificial Intelligence Made Easy

Photo Source: ShutterStock A Comprehensive Guide to Modeling with H2O.ai in Python By Ishaan Dey & Elyse Lee If you’re anything like my dad, you’ve worked in IT for decades but have only tangentially touched data science. Now, your new C-something-O wants you to fire up a data analytics team and work with new a set … Read more

AutoML — A Tool to Improve Your Workflow

Photo by Alex Knight on Unsplash A look at H2O AutoML in binary classification Recently, the upsurge in demand data science skills has grown faster than the current supply of skills can keep up with. Today it’s difficult to imagine a business that wouldn’t benefit from the detailed analysis data scientists and machine learning algorithms perform. … Read more

Text Processing Is Coming

Tokenization In order to analyze a text, its words must be pulled out and analyzed. One way to do this is to split each text by spaces so that individual words are returned. However, this doesn’t take into account punctuation or other symbols that might want to be removed. This process of breaking sentences, paragraphs, … Read more

Feature Selection Why & How Explained

Feature selection algorithm implementations in Python In the last article, I explained the problems with including irrelevant or correlated features in model building. In this article, I’ll show you several neat implementations of selection algorithms that can be easily integrated into your project pipeline. Before diving into the detailed implementation, let’s go through the dataset … Read more

Regression or Classification? Linear or Logistic?

link Understanding the differences & the various models for each Regression vs Classification In order to decide whether to use a regression or classification model, the first questions you should ask yourself is: Is your target variable a quantity, a probability of a binary category, or a label? If it’s one of the former options, then you … Read more

Beginner’s Guide to BERT for Multi-classification Task

Original Photo by David Pisnoy on Unsplash. It was later modified to include some inspiring quotes. The purpose of this article is to provide a step-by-step tutorial on how to use BERT for multi-classification task. BERT ( Bidirectional Encoder Representations from Transformers), is a new method of pre-training language representation by Google that aimed to … Read more

A Gentle Implementation of Reinforcement Learning in Pairs Trading

Back to Part 2: Code Design Illustration of the code structure 2.1 Config The execution is governed by the config (dictionary). This component allows us to encapsulate a lot of executions and tidy up the code. It can also be used as a carrier of additional parameters. For instance, in the previous section, the instantiation of … Read more

Effective Data Visualization for other Humans

Choosing the Right Visual Matters Credit: Harry Quan — Unsplash Irrespective of the accuracy of its content, choosing the wrong chart can convey misleading and potentially harmful messages to your audience. We have the responsibility to uphold the integrity of the messages we send. Additionally, using the wrong visual can misclassify otherwise accurate information. We can minimize these … Read more

A Simple Introduction to K-Nearest Neighbors Algorithm

What is KNN? K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified. Let’s take below wine example. Two chemical components called Rutime and Myricetin. … Read more

Using FIPS to Visualize in Plotly

Recently I have done two visualization projects with Plotly, visualizing average rent per square foot in California and average data scientist salary across the United States. I used two different approaches to visualize data on the map — Scatter plot on Map and Chloropleth Map. Why Plotly?Plotly is one of the powerful visualization packages available in Python. … Read more

Let’s Talk About Machine Learning Ensemble Learning In Python

Build Better Predictive Models By Efficiently Combining Classifiers Into A Meta-Classifier Learning about ensembles is important for anyone who wants to get advanced level understanding of the machine learning concepts. This article will focus on the techniques and methods for combining a set of classifiers to improve the performance of your machine learning solution. Ensemble … Read more

Training a Convolutional Neural Network from scratch

A simple walkthrough of deriving backpropagation for CNNs and implementing it from scratch in Python. In this post, we’re going to do a deep-dive on something most introductions to Convolutional Neural Networks (CNNs) lack: how to train a CNN, including deriving gradients, implementing backprop from scratch (using only numpy), and ultimately building a full training pipeline! … Read more

Beginner’s Guide to Building Neural Networks in TensorFlow

Detailed Walkthrough of the TensorFlow 2.0 Beginner Notebook If you’re reading this you’ve probably had some exposure to neural networks and TensorFlow, but you might feel somewhat daunted by the various terms associated with deep learning that are often glossed over or left unexplained in many introductions to the technology. This article will shine a light … Read more

Beginner’s Recommendation Systems with Python

Building our own recommendation systems with the TMDB 5000 movies dataset Objectives of this Tutorial Here are some objectives for you: Learn what recommendation systems are, how they work, and some of their different flavors Implement a few recommendation systems using Python and the TMDB 5000 movies dataset What are Recommendation Systems? A recommendation system (also commonly … Read more

Hyper-Parameter Tuning and Model Selection, Like a Movie Star

Coding, analyzing, selecting, and tuning like you really know what you’re doing. Photo by Markus Spiske on Unsplash “Hyper-parameter tuning for random forest classifier optimization” is one of those phrases which would sound just as at ease in a movie scene where hackers are aggressively typing to “gain access to the mainframe” as it does in a … Read more

What’s in the Black Box?

Individual Conditional Expectation plots But what if we are unsure of the causal direction of our features? One tool that can help is an Individual Conditional Expectation (ICE) plot. Instead of plotting the average prediction based on the value of a feature, it plots a line for each observation across possible values of the feature. … Read more

A simple Monte-Carlo simulation to solve a Putnam Competition math problem

An overview of the Monte-Carlo method in Python Photo by Jonathan Petersson on Unsplash Introduction There are two things common to high-stakes gambling and high-speed racing — a huge degree of uncertainty and the city of Monte-Carlo. This connection between the two led to the use of the term “Monte-Carlo simulation” for computational methods that predict the probabilities of … Read more

How do you check the quality of your regression model in Python?

Linear regression is rooted strongly in the field of statistical learning and therefore the model must be checked for the ‘goodness of fit’. This article shows you the essential steps of this task in a Python ecosystem. Why it is important (and why you might be missing it) For all the talk and hair-splitting on the … Read more

Getting set up in PostgresSQL

Using PSQL and Python Pandas together In the I SPY children’s series, the author asks us to look at a global image for the smaller images we want to find. It is not easy (by design) which is the opposite of what we hope for with our SQL queries. One of the big challenges I faced after … Read more

Summarizing Economic Bulletin Documents with tf-idf

A key strength of NLP (natural language processing) is being able to process large amounts of texts and then summarise them to extract meaningful insights. In this example, a selection of economic bulletins in PDF format from 2018 to 2019 are analysed in order to gauge economic sentiment. The bulletins in question are sourced from … Read more

Anomaly detection in time series with Prophet library

First of all, let’s define what is an anomaly in time series. Anomaly detection problem for time series can be formulated as finding outlier data points relative to some standard or usual signal. While there are plenty of anomaly types, we’ll focus only on the most important ones from a business perspective, such as unexpected … Read more

Optimization with SciPy and application ideas to machine learning

Optimization is often the final frontier, which needs to be conquered to deliver the real value, for a large variety of business and technological processes. We show how to perform optimization with the most popular scientific analysis package in Python — SciPy and discuss unique applications in machine learning space. Introduction You may remember a simple calculus problem … Read more

An Overview of Python’s Datatable package

Data Manipulation Data Tables like dataframes are columnar data structures. In datatable, the primary vehicle for all these operations is the square-bracket notation inspired by traditional matrix indexing but with more functionalities. datatable’s square-bracket notation The same DT[i, j] notation is used in mathematics when indexing matrices, in C/C++, in R, in pandas, in numpy, … Read more

What is Wavelet and How We Use It for Data Science

source: https://ak6.picdn.net/shutterstock/videos/28682146/thumb/1.jpg Hello, this is my second post for the signal processing topic. For now, I’m interested in learning more about signal processing to understand a certain paper. And to be honest for me, this wavelet thing is harder to understand than Fourier Transform. After I felt quite understanding about this topic, I realize something. … Read more

Why you should Double-DIP for Natural Image Decomposition

“Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors The key aspect of Double-DIP is inherent in the fact that the distribution of small patches within each decomposed layer is “simpler” (more uniform) than in the original mixed image. Let’s simplify it with an example; Let’s Observe the illustrative example in Figure 3a. Two different textures, X … Read more

K-Means Clustering with scikit-learn

Fundamentals of K-Means Clustering As we will see, the k-means algorithm is extremely easy to implement and is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering. Prototype-based clustering means that each cluster is represented by a prototype, which can … Read more

How to use ggplot2 in Python

Introduction Thanks to its strict implementation of the grammar of graphics, ggplot2 provides an extremely intuitive and consistent way of plotting your data. Not only does ggplot2’s approach to plotting ensure that each plot comprises certain basic elements but it also simplifies the readability of your code to a great extent. However, if you are … Read more

Machine Learning for Radiology — Where to Begin

Anaconda Anaconda is an open-source platform that is perhaps the easiest way to get started with Python machine learning on Linux, Mac OS X and Windows. It helps you manage the programing environments, and includes common Python packages used in data science. You can download the distribution for your platform at https://www.anaconda.com/distribution/ . Once you install … Read more

Machine Learning Model for Recommending the Crew Size for Cruise Ship Buyers

In this tutotial, we build a regression model using the cruise_ship_info.csv dataset for recommending the crew size for potential cruise ship buyers. This tutorial will highlight important data science and machine learning concepts such as: data proprocessing and variable selection; basic regression model building; hyper-parameters tuning; model evaluation; and techniques for dimensionality reduction. The github … Read more

Know Thyself: Using Data Science to Explore Your Own Genome

DNA analysis with pandas and Selenium “Nosce te ipsum”, (“know thyself”), a well-known ancient maxim, frequently associated with anatomical knowledge. Image from the University of Cambridge 23andme once offered me a free DNA and ancestry test kit if I participated in one of their clinical studies. In exchange for a cheek swab and baring my guts … Read more

8 Reasons Why Python is Good for Artificial Intelligence and Machine Learning

This article about why Python is good for ML and AI is originally posted on Django Stars blog. Artificial Intelligence (AI) and Machine Learning (ML) are the new black of the IT industry. While discussions over the safety of its development keep escalating, developers expand abilities and capacity of artificial intellect. Today Artificial Intelligence went … Read more

A Basic Python Tweet Class

Simple strategies for processing tweet data Photo by Ray Hennessy on Unsplash Motivations Twitter is a amazing source of data with all kinds of opportunities for analysis. NLTK, spaCy, and other Python NLP tools have many powerful, applicable features, and pandas makes it easy to wrangle tabular data. Still, there are some challenges. Tweets, while short, often … Read more

Get Started With TensorFlow 2.0 and Linear Regression

TensorFlow 2.0 has been a major breakthrough in the TensorFlow family. It’s completely new and refurbished and also less creepy! We’ll create a simple Linear Regression model in TensorFlow 2.0 to explore some new changes. So, open up your code editors and let’s get started! Also, open up this notebook for an interactive learning experience. … Read more

A Deep Dive Into Imbalanced Data: Over-Sampling

Learn how to use imbalanced-learn to improve your performance https://unsplash.com/photos/nvDJfbFv0pI When implementing classification algorithms, the structure of your data is of great significance. Specifically, the balance between the number of observations for each potential output heavily influences your prediction’s performance (I intentionally avoided using the word “accuracy” for reasons I will later elaborate on in … Read more