Creating Reproducible Data Science Projects

A Nightmare Scenario Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model and visualisations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed. … Read more

Python Basics — Classes and Objects

It refers to defining a new class with little or no modification to an existing class.A sub-class is derived from a base-class, inheriting its behaviour and making behaviour specific to sub-class. Syntax # Base classclass BaseClass:Body of base class# Derived class class DerivedClass(BaseClass):Body of derived class Why Inheritance? Inheritance allows a derived class to inherit … Read more

An Introduction to Recurrent Neural Networks for Beginners

A simple walkthrough of what RNNs are, how they work, and how to build one from scratch in Python. Recurrent Neural Networks (RNNs) are a kind of neural network that specialize in processing sequences. They’re often used in Natural Language Processing (NLP) tasks because of their effectiveness in handling text. In this post, we’ll explore … Read more

Classification of unbalanced datasets

How to properly do a classification analysis using sklearn when your dataset is unbalanced and improve its results. Photo by Brett Jordan on Unsplash Let’s imagine you have a dataset with a dozen features and need to classify each observation. It can be either a two-class problem (your output is either 1 or 0; true … Read more

How to use airflow-style DAGs for highly effective data science workflows

Airflow and Luigi are great for data engineering but not optimized for data science. d6tflow brings airflow-style DAGs to data science. Data science workflows typically look like this. This workflow is similar to data engineering workflows. It involves chaining together parameterized tasks which pass multiple inputs and outputs between each other. See 4 Reasons Why … Read more

A better EDA with Pandas-profiling

The conceptual approach To ensure that our datasets are useful, a good practice is EDA, Exploratory Data Analysis. An EDA is a way to familiarize yourself with the dataset. Through this reflection work, it is the assurance of working with interesting, coherent and cleaned data. This step is very visual and is based on summary … Read more

Tweet analytics using NLP

Introduction In recent days with the explosion of Big Data there is a large demand for organisations and data scientists to perform information extraction using non-traditional sources of data. Research has shown that nearly 80% of data exists as unstructured text data, hence text analytics is fundamental in order to analyse the wealth of information … Read more

Building a Bayesian Logistic Regression with Python and PyMC3

How likely am I to subscribe a term deposit? Posterior probability, credible interval, odds ratio, WAIC In this post, we will explore using Bayesian Logistic Regression in order to predict whether or not a customer will subscribe a term deposit after the marketing campaign the bank performed. We want to be able to accomplish: How … Read more

Accessing Google Calendar Events Data using Python

Here comes the interesting part: I’m trying to create a method called create_event to create Google Calendar Events: def create_event(start_time_str, summary, duration=1,attendees=None, description=None, location=None): Now I want you to go here and read about Calendar Event insertion reference, Simply scroll down to code section which includes Python code sample and that’ll do. Second thing is, … Read more

Torch vs Keras for CNN Image Classification: Thoughts on the Rock Paper Scissor dataset

Last week I wrote an article on using a CNN classification to classify images of Rock, Paper or Scissor hand gestures that was built on the pyTorch Framework using a package called ‘torchvision”. See: Rock Paper Scissor Image Classifier using Torch Vision and a CNN I’ve been exploring the using of Pytorch’s frameworks over the … Read more

Setting up Python platform for Machine Learning projects

Different project that you will be working on, will require different resources and packages with different version requirements. So, it is always recommended that you use a separate virtual python environment for each project. This also makes sure that you don’t accidentally overwrite any of the existing working versions of certain packages with other versions … Read more

How to Quickly Compare Data Sets

How to get a quick summary of any differences between two data sets Photo by Joshua Sortino on Unsplash Every now and again, the need will arise where you will need to compare two data sets; either to prove that there are no differences or to highlight the exact differences between them. Depending on the … Read more

Stacking Classifiers for Higher Predictive Performance

Using the Wisdom of Multiple Classifiers to Boost Performance Purpose: The purpose of this article is to provide the reader with the necessary tools to implement the ensemble learning technique known as stacking. Materials and methods: Using Scikit-learn, we generate a Madelon-like data set for a classification task. Then a Support Vector classifier (SVC), Nu-Support … Read more

Basics of SQL in Python for Data Scientists

The ultimate beginner’s guide for using SQL in Python environment. This article provides an overview of the basic SQL statements for data scientists, and explains how a SQL engine can be instantiated in Python and used for querying data from a database. As a data scientist using Python, you often need to get your data … Read more

Forecasting Electricity Price Time Series Data in Python using a VAR Model

A Crash Course in Time Series Decomposition and Forecasting with a Vector Autoregression (VAR) Model I’ve dealt with projects that involved time series analysis and forecasting on-and-off for the past several years, but I’ve always found the topic somewhat inaccessible for beginners due to the lack of comprehensive Python tutorials available. So, in an effort … Read more

My Capstone Project: Real Estate Prices & Venues Data Analysis of London

This article was written as part of final capstone project for IBM Data Science Professional Certification in Coursera. In this article I will share the difficulties I faced and also some concepts that I implemented. This article will contain the following steps that are necessary for any Data Science project: Problem statement Data Collection Data … Read more

Explorations in Named Entity Recognition, and was Eleanor Roosevelt right?

Using the spaCy Natural Language Processing lib to gain insight from news articles Image credit: unsplash Eleanor Roosevelt is alleged to have said: Great minds discuss ideas; average minds discuss events; small minds discuss people. And although this might be a misattribution, the statement as such seems to resonate with a lot of people’s intuition, … Read more

Scrape and Summarize News Articles in 5 Lines of Python Code

Install the package: $ pip install newspaper3k Now, let’s ask newspaper3k to scrape the article, extract information and summarize it for us. >>> from newspaper import Article>>> article = Article(‘https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020’)>>> article.download()>>> article.parse()>>> article.nlp() That’s all folks. 5 lines of code including package importing. If you proceeded all previous steps and did not get an error, … Read more

How to use the Split-Apply-Combine strategy in Pandas groupby

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas groupby-apply. TL;DR Pandas groupby-apply is an invaluable tool in a Python data scientist’s toolkit. You can go pretty far with it without fully understanding all of its internal intricacies. However, sometimes that can manifest itself in unexpected behavior and errors. Ever had one … Read more

Unhappy Endings

Let’s take our dataset for a test drive by first looking at something very simple. In our database of ‘finished’ TV shows, which ones have the highest overall series rating? Even if its ending did sour things, Game of Thrones still enjoys a healthy series rating of 9.4, surrounded by the kind of shows you’d … Read more

Modeling customer churn for an e-commerce business with Python

It’s more cost effective to retain existing customers than to acquire new ones, which is why it’s important to track customers at high risk of turnover (churn) and target them with retention strategies. In this project, I’ll build a customer churn model based off of data from Olist, a Brazilian e-commerce site. I’ll use that … Read more

6 Reasons I Love Bokeh for Data Exploration with Python

Bokeh has been around for years but I only recently really discovered it and it didn’t take long to become my favorite Python visualization library. Here’s six reasons why. Bokeh is a Browser Based Visualization Library Quickly before jumping into it let’s do the obligatory introduction paragraph where I introduce you to the topic. Remember … Read more

Surprising Sorting Tips for Data Scientists

Python, Numpy, Pandas, PyTorch, TensorFlow & SQL Sorting data is a basic task for data scientists and data engineers. Python users have a number of libraries to choose from with built-in, optimized sorting options. Some even work in parallel on GPUs. Surprisingly some sort methods don’t use the stated algorithm types and others don’t perform … Read more

Using Publicly Available FracFocus Data and Python’s Matplotlib Function to Visualize Oil and Gas…

I recently wrote some script that automated data pulls from the publicly available FracFocus database, a government-operated data source which provides a comprehensive listing of hydraulic fracturing chemicals pumped in unconventional oil and gas completions jobs in the United States. This database is a great resource — not only for the public, but also for … Read more

Supercharging Jupyter Notebooks

Jupyter Notebooks are currently the hottest programming environment for Pythonistas the world over, especially those who are into Machine Learning and Data Science. I discovered Jupyter Notebooks when I first started to get serious about Machine Learning a few months ago. Initially, I was simply amazed, loved how everything ran inside my browser. However, I … Read more

Easily Scrape and Summarize News Articles Using Python

Webscraping: Now let’s scrape! First, we’ll turn the page content into a BeautifulSoup object, which will allow us to parse the HTML tags. # Turn page into BeautifulSoup object to access HTML tagssoup = BeautifulSoup(page) Then, we’ll need to figure out which HTML tags contain the headline and the main text of the article. For … Read more

Maximizing group happiness in White Elephants using the Hungarian optimal assignment algorithm

Let’s consider a simple scenario in which four players (Alex, Brad, Chloe, and Daisy) are participating in a White Elephant. After opening the presents in order, everyone feels like the distribution of presents is suboptimal. They feel like if they only knew how much each person liked each present, they can redistribute the presents to … Read more

Advanced Histogram Using Python

A histogram to delight the business users and data scientists Python has excellent support for generating histograms. But in Data Science it is very useful to display bar/bin counts, bin ranges, colour the bars to separate percentiles and generate custom legends to provide more meaningful insights to business users. There is no built in direct … Read more

Visualizing Support Vector Machine Decision Boundary

Pipeline, GridSearchCV and Contour Plot Decision Boundary (Picture: Author’s Own Work, Saitama, Japan) In a previous post I have described about principal component analysis (PCA) in detail and, the mathematics behind support vector machine (SVM) algorithm in another. Here, I will combine SVM, PCA, and Grid-search Cross-Validation to create a pipeline to find best parameters … Read more

Machine Learning Pipelines: Feature Engineering Numbers

A really important part of any machine learning model is the data, especially the features used. In this article, we will go over where feature engineering falls in the machine learning pipeline, and how to do some feature engineering on numbers using binning, transformations, and normalization. The real benefit of feature engineering is being able … Read more

Basics of graph plotting

Most of us data scientists go into the industry because we love data (whatever that means? No, I don’t know either!). The ability to create easily readable plots is often an afterthought. Most job descriptions will mention that being able to visualise data is important but I have never had a sensible conversation with anyone … Read more

Web Scraping news articles in Python

Building a web scraping application in Python made simple Source This article is the second of a series in which I will cover the whole process of developing a machine learning project. If you have not read the first one, I strongly encourage you to do it here. The project involves the creation of a … Read more

“What Should I Watch Next?” — Exploring Movie Recommender Systems, part 1: Popularity

If I were starting a theoretical website to recommend movies, I’d have to have somewhere to start while I gathered internal user data, whether explicit (votes/ratings) or implicit (links clicked on, minutes watched, purchases made, etc). A place to start is a popularity filter. This returns ‘top hits’. On Reddit, it’s their front page. The … Read more

Evaluating Machine Learning Classification Problems in Python: 6+1 Metrics That Matter

Usually in both regression and classification models, the dataset is split into train and test datasets. The model is then trained and fitted on the “train dataset” and used to predict based on a “test dataset” to evaluate the performance. The reason for this train/test split is to mimic future datasets and also to avoid … Read more

Scraping and Sentiment Analysis of 1.5 Million Audible Reviews

While we’ve actually scraped three ratings for each review, ‘overall’, ‘story’ and ‘performance’, we’re going to focus on the ‘overall’ column as our labels. The two other columns we’ll save for a future analysis. In order to get our data and targets into the form we will need to train a model, we need to … Read more

Nailing The Basics of Pairs Trading with Python

Generating Fake Securities Let’s actually implement the concept of pairs trading with some Python code! We’ll first start by getting some intuition on how the strategy actually works with some fake time series data. Importing Libraries Let’s get the necessary Python libraries. For the sake of following along with this guide, it’s best to set … Read more

Linear Regression using Flavor of Python

Univariate linear regression focuses on determining relationship between one independent variable and one dependent variable. Regression comes handy mainly in situation where the relationship between two features is not obvious to the naked eye. Suppose we wish to analyse the relationship between a vehicle’s weight and fuel economy or the price of a slice of … Read more

Detecting communities in a language co-occurrence network

Implementing community detection algorithms in Igraph with Python Photo by Perry Grone on Unsplash In this post, we are going to undertake community detection in the python package Igraph, to attempt to detect communities within a language co-occurrence network. This will be implemented using two popular community detection algorithms: Walktrap, and Label Propagation Background Global … Read more

Hacker’s Guide to Quantitative Trading(Quantopian Python) Part 2

Algorithmic Trading using Python Quantopain Provides required API functions,Data,Helpful-community as well as batteries included Web-based Dashboard to play with Algorithmic-Trading, Create Your own trading Strategies, and launch your Trading model in live Market. Here I will only talk about code and how it should be written to create your own Trading Strategy. There are basically … Read more