Automatically Storing Data from Analyzed Data Sets

How to Store Data Analysis Results to Facilitate Later Regression Analysis Figure 1: Example Folder Hierarchy This is the fifth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how … Read more

Gaussian Mixture Modelling (GMM)

GMM estimation Figure 3 below illustrates what GMM is doing. It clearly shows three clusters modelled by three different Gaussian distributions. I have used a toy data set here just to illustrate this clearly as it is less clear with the Enron data set. As you can see, compared to Figure 2 modelled using spherical … Read more

SQL and Pandas

Where and how should these tools be used? As I mentioned in my previous post, my technical experience has almost exclusively been in SQL. While SQL is awesome and can do some really cool things, it has its limitations — these limitations are in large part why I decided to acquire Data Science superpowers at Lambda School. In … Read more

Artificial Neural Networks Optimization using Genetic Algorithm with Python

Main Project File Implementation The third file is the main file because it connects all functions. It reads the features and the class labels files, filters features based on the standard deviation, creates the ANN architecture, generates the initial solutions, loops through a number of generations by calculating the fitness values for all solutions, selecting … Read more

Checking Automated Data Analysis for Errors

How to Check for Errors, both Manually and Automatically, when Automating Data Analysis This is the fourth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how to structure data sets … Read more

10 Steps to Set Up Your Python Project for Success

In this guide we’ll walk through adding tests and integrations to speed development and improve code quality and consistency. If don’t have a basic working Python package, check out my guide to building one and then meet right back here. Cool. Here’s our ten-step plan for this article: Install Black Create .pycache Install pytest Create Tests … Read more

How to Perform Explainable Machine Learning Classification — Without Any Trees

Credit: Pixabay Strict and clear rules… appear to us as something in the background — hidden in the medium of the understanding. – Ludwig Wittgenstein Decision trees are a popular technique for classification. They’re intuitive, easy to interpret, and often perform well out-of-the-box. Tree models are paths of rules that humans can understand. In certain contexts, being able … Read more

Master Python through building real-world applications (Part 9)

Endnotes As we all know, we learn from visualizations far better than we learn from raw data. Building visualizations from data are really rewarding and with help of external libraries like Bokeh, Python’s visualization game is stronger than ever. In this post, you learned about stock market data, how to download it, what are candlestick … Read more

A “full-stack” data science project

2. Data exploration The notebook exploring the data is available on GitHub here. Regardless of the data analysis you’re performing, or how well you think you know your data, it is always a good idea to take a look at it and be aware of the various characteristics before starting to work on a specific … Read more

Machine Learning for Beginners: An Introduction to Neural Networks

A simple explanation of how they work and how to implement one from scratch in Python. Here’s something that might surprise you: neural networks aren’t that complicated! The term “neural network” gets used as a buzzword a lot, but in reality they’re often much simpler than people imagine. This post is intended for complete beginners and … Read more

Replacing Excel with Python

Importing Excel Files into a Pandas DataFrame Initial step is to import excel files into DataFrame so we can perform all our tasks on it. I will be demonstrating the read_excel method of Pandas which supports xls and xlsx file extensions. read_csv is same as using read_excel, we wont go in depth but I will share … Read more

Data Science With No Data

Building an AI/ML model with no access to a dataset In this article, we will demonstrate how to generate a dataset to build a machine learning model. According to this, Medicare fraud and abuse cost taxpayers $60 billion per year. AI/ML could significantly help identify and prevent fraud and abuse, but since privacy is of utmost … Read more

Use Google and Tweepy to Build a Dataset of Twitter Users

With ever-increasing value being placed on the effectiveness of social media in marketing, mining data from social platforms is a critical piece of the ad-tech puzzle. Free developer API access to social data is becoming more and more restrictive, and so easily accessing the right data can be a challenge. Twitter is an exception to … Read more

Building a Flask API to Automatically Extract Named Entities Using SpaCy

How to use the Named Entity Recognition module in spaCy to identify people, organizations, or locations in text, then deploy a Python API with Flask The overwhelming amount of unstructured text data available today provides a rich source of information if the data can be structured. Named-entity Recognition (NER)(also known as Named-entity Extraction) is one of … Read more

Extracting faces using OpenCV Face Detection Neural Network

Recently, I came across the website which has some of the greatest tutorials on OpenCV. While reading through its numerous articles, I found that OpenCV has its own Face Detection Neural Network with really high accuracy. So I decided to work on a project using this Neural Network from OpenCV and extract faces from … Read more

Real-time face liveness detection with Python, Keras and OpenCV

Most facial recognition algorithms you find on the internet and research papers suffer from photo attacks. These methods work really well at detecting and recognizing faces on images, videos and video streams from webcam. However they can’t distinguish between real life faces and faces on a photo. This inability to recognize faces is due to … Read more

CASM = Fractals

Using a simple equation, we can see exactly how the iteration occurs. We first substitute a value for x. Solve the equation for y. Then take the value of y and make it our new x. The best way to illustrate this is to actually use real values. Iteration Our first value was 1 for … Read more

Random thoughts on my first ML deployment

5 things I didn’t know six months ago and that’s better not to forget in the months to come A little bit of context: I’m currently working for a fast growing yet still medium-sized company that after having built a robust and widely used product has decided to start leveraging the data generated during the years … Read more

Building Blocks: Text Pre-Processing

Morphological Normalization Morphology, in general, is the study of the way words are built up from smaller meaning-bearing units, morphomes. For example, dogs consists of two morphemes: dog and s Two commonly used techniques for text normalization are: Stemming: The procedure aims to identify the stem of a word and use it in lieu of … Read more

Finding Lane Lines — Simple Pipeline For Lane Detection.

Identifying lanes of the road is very common task that human driver performs. This is important to keep the vehicle in the constraints of the lane. This is also very critical task for an autonomous vehicle to perform. And very simple Lane Detection pipeline is possible with simple Computer Vision techniques. This article will describe … Read more

Set Your Jupyter Notebook up Right with this Extension

Solution: The Setup Jupyter Notebook Extension Rather than just complaining about the problem (it’s easy to be a critic but a lot harder to do something positive) I decided to see what could be done with Jupyter Notebook extensions. The result is an extension that on opening a new notebook automatically: Creates a template to … Read more

Climate Heatmaps Made Easy

Investigating Paleoclimate Data with Pandas and Seaborn Some time ago Dr. Ed Hawkins, who happens to be the creator of the Climate Spirals, released to the world the Warming Stripes graph for Annual Global Temperature ranging from 1850–2017. The concept is simple but also very informative: each stripe represents the temperature for a single year and … Read more

The Python Dreamteam

As a Data Scientist, I code almost entirely in Python. I also get easily scared by configuring stuff. I don’t really know what a PATH is. I have no clue what lies within the /bin directory on my laptop. These are all things that you seemingly have to get familiar with to not have Python … Read more

Boosting: Is It Always The Best Option?

Gradient boosting has become quite a popular technique in the area of machine learning. Given its reputation for achieving potentially higher accuracy than other models, it has become particularly popular as a “go-to” model for Kaggle competitions. However, use of gradient boosting raises two questions: Does this technique really outperform others consistently irrespective of the … Read more

How to make your model awesome with Optuna

Example walk-through Jason and the Argonauts source Data I used the 20 newsgroups dataset from Scikit-Learn to prepare the experiment. You can find the data import below: Model It’s a Natural Language Processing problem, and the model’s pipeline contains a feature extraction step and a classifier. The code for the pipeline looks as follows: Optimization … Read more

Machine Learning for Particle Data When You are Not a Physicist

How a H2O deep learning model can be used to do supervised classification with Python This article introduces Deep Learning with H2O, the open source machine learning package by, and shows how a H2O Deep Learning model can be used to solve supervised classification problem, that is, use the ATLAS experiment to identify the Higgs … Read more

How to use Google Speech to Text API to transcribe long audio files?

Credit: Pixabay Speech recognition is a fun task. A lot of API resources are available in market today which makes it easier for user to opt for one or another. However, when it comes to audio files especially call center data, the task becomes little challenging. Let’s make an assumption that a call center conversation … Read more

Build Your First Open Source Python Project

A step-by-step guide to a working package Every software developer and data scientist should go through the exercise of making a package. You’ll learn so much along the way. Making an open source Python package may sound daunting, but you don’t need to be a grizzled veteran. You also don’t need an elaborate product idea. You … Read more

A beginner’s guide to Linear Regression in Python with Scikit-Learn

Simple Linear Regression Linear Regression While exploring the Aerial Bombing Operations of World War Two dataset and recalling that the D-Day landings were nearly postponed due to poor weather, I downloaded these weather reports from the period to compare with missions in the bombing operations dataset. You can download the dataset from here. The dataset … Read more

How to Practice Python with Google Colab?

Automatic setting-up, getting help effectively, collaborative programming, and version control. A one-stop solution to the pain points in Python beginners’ practice. Pain Points This semester, I started to teach the course “INFO 5731: Computational Methods for Information Systems” at University of North Texas (UNT), which includes the foundation of Python, Natural Language Processing and Machine … Read more

Web Scraping Using BeautifulSoup

“Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work.” You can use the pip package manager to install BeautifulSoup. $ pip install … Read more

Jupyter Lab: Evolution of the Jupyter Notebook

8. Extensions JupyterLab has been designed as an essentially extensible environment. The extensions are really powerful tools that can really enhance a person’s productivity. JupyterLab extensions are npm packages (the standard package format in Javascript development). There are many community-developed extensions being built on GitHub. You can search for the GitHub topic jupyterlab-extension to find … Read more

Lot’s of JSON

Getting Started We can use %%bash magic to print a sample of our file: %%bash head ../input/roam_prescription_based_prediction.jsonl {“cms_prescription_counts”: {“CEPHALEXIN”: 23, “AMOXICILLIN”: 52, “HYDROCODONE-ACETAMINOPHEN”: 28},”provider_variables”: {“settlement_type”: “non-urban”, “generic_rx_count”: 103, “specialty”: “General Practice”, “years_practicing”: 7, “gender”: “M”, “region”: “South”, “brand_name_rx_count”: 0}, “npi”: “1992715205”} From this we can see the JSON data looks like a Python dictionary. That’s … Read more

Web Scraping Using Selenium

Scrape Image Page Links The following code launches Chrome browser with the provided url using Selenium, scrolls to the bottom of the page (apparently magically), extracts the links for the image display pages and saves them in a csv file. Lines 5–10: import the necessary packages required for this code to work. The selenium webdriver will … Read more

Understanding Logistic Regression step by step

Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. As an example, consider the task of predicting someone’s … Read more

Intro to Statistics — Looking at Data (1)

There are many free learning courses and material about Statistics. Statistics can be effectively used to analyse, estimate, and sometimes predict real-world events. When used correctly, statistics will lead us to take better and safer decisions based on data observations. It is the basic pillar in Data Science and an extremely useful tool in many … Read more

Modeling Price with Regularized Linear Model & Xgboost

Developing statistical models for predicting individual house prices We would like to model the price of a house, we know that the price depends on the location of the house, square footage of a house, year built, year renovated, number of bedrooms, number of garages, etc. So those factors contribute to the pattern — premium location would typically … Read more

Jupytext 1.0 highlights

In version 1.0 the jupytext command was extended with new modes: –sync to synchronize the multiple representations of a notebook –set-formats (and optionally, –sync), to set or change the pairing of a notebook or a text file –pipe to pipe the text representation of a notebook into another program. Perhaps you would like to reformat … Read more