4 Rarely-Used Yet Very Useful Pandas Tricks

The groupby function is commonly used in exploratory data analysis. Combined with the agg function, we are able to apply different aggregation functions to different columns. The NamedAgg method allows us to rename the aggregated columns inside the agg function. Let’s first create a dataframe. import numpy as npimport pandas as pdcats = pd.Series(list(‘abc’)*3).sample(n=9).reset_index(drop=True)df = … Read more 4 Rarely-Used Yet Very Useful Pandas Tricks

Deriving Patterns of Fraud from the Enron Dataset

The Enron email and financial datasets are big, messy treasure troves of information, which become much more useful once you know your way around them a bit. Enron’s complete data may be downloaded from this link here, and the refined pickle files may be downloaded from the following Github repository along with the complete code … Read more Deriving Patterns of Fraud from the Enron Dataset

Disjoint Set and Tarjan’s Off-line Lowest Common Ancestor Algorithm

Let’s apply the algorithm to an example binary tree: Binary Tree (Image by Author) Iterating through the binary tree post order, we first start at node 7: (Image by Author) Nothing much happens here, we make a set for node 7, with parent = 7, ancestor = 7, and size = 1, and we mark … Read more Disjoint Set and Tarjan’s Off-line Lowest Common Ancestor Algorithm

How to Set Up Automated Tasks in Linux Using Cron

GETTING STARTED A Step-by-Step Guide to Setting Up a Cron Job Photo by Possessed Photography on Unsplash Have you ever found yourself doing repetitive tasks on a regular basis? For example, deleting temporary files every week to conserve your disk space, scraping data from a site every week to gather new information or sending recurring … Read more How to Set Up Automated Tasks in Linux Using Cron

Fourier Convolutions in PyTorch

Math and code for efficiently computing large convolutions with FFTs. Photo by Faye Cornish on Unsplash Note: Complete methods for 1D, 2D, and 3D Fourier convolutions are provided in this Github repo. I also provide PyTorch modules, for easily adding Fourier convolutions to a trainable model. Convolutions Convolutions are ubiquitous in data analysis. For decades, … Read more Fourier Convolutions in PyTorch

Fine-Tuning Pre-trained Model VGG-16

After importing the necessary libraries, our train/test set, and preprocessing the data (described here), we dive into modeling: First, import VGG16 and pass the necessary arguments: from keras.applications import VGG16vgg_model = VGG16(weights=’imagenet’, include_top=False, input_shape=(224, 224, 3)) 2. Next, we set some layers frozen, I decided to unfreeze the last block so that their weights get … Read more Fine-Tuning Pre-trained Model VGG-16

How to Code Linear Regression from Scratch

Think back to your first algebra class: do you remember the equation for a line? If you said “y = mx + b”, you’re absolutely right. I think it’s also helpful to start in two dimensions, because without using any matrices or vectors, we can already see that given inputs x, and outputs y, we … Read more How to Code Linear Regression from Scratch

All the Pandas shift() you should know for data analysis

Pandas shift() shift index by the desired number of periods. The simplest call should have an argument periods (It defaults to 1) and it represents the number of shifts for the desired axis. And by default, it is shifting values vertically along the axis 0 . NaN will be filled for missing values introduced as … Read more All the Pandas shift() you should know for data analysis

Autonomous Driving Dataset Visualization with Python and VizViewer

Disclosure: The author is involved in VizViewer’s development. As part of a recently published paper and Kaggle competition, Lyft has made public a dataset for building autonomous driving path prediction algorithms. The dataset includes a semantic map, ego vehicle data, and dynamic observational data for moving objects in the vehicle’s vicinity. The challenge presented by … Read more Autonomous Driving Dataset Visualization with Python and VizViewer

Top 5 AI Terms to Know in 2020 for Data Scientists

Deep Learning Deep Learning has been around for what feels like a while but in reality, it’s only the past 5 years that it exploded with interest because of the gains that Deep Mind made. Pre-2015, yes it was interesting, but there wasn’t that element of explosive interest. Don’t believe me? Just check Google Trends: … Read more Top 5 AI Terms to Know in 2020 for Data Scientists

Writing a multi-file-upload Python-web app with user authentication

When building a webserver we often wish to present an idea or topic. In the case of a static website, this can be done by including the relevant information within the source files. In more dynamic examples, an API (application programming interface) can be used to pre-process information before returning it to the user. In … Read more Writing a multi-file-upload Python-web app with user authentication

Complete Introduction to PySpark-Part 4

Performing Data Visualization using PySpark Photo by William Iven on Unsplash Data Visualization plays an important role in data analysis because as soon as the human eyes see some charts or graphs they try finding the patterns in that graph. Data Visualization is visually representing the data using different plots/graphs/charts to find out the pattern, … Read more Complete Introduction to PySpark-Part 4

How to Read Data Files on S3 from Amazon SageMaker

Keeping your data science workflow in the cloud Photo by Sayan Nath on Unsplash Amazon SageMaker is a powerful, cloud-hosted Jupyter Notebook service offered by Amazon Web Services (AWS). It’s used to create, train, and deploy machine learning models, but it’s also great for doing exploratory data analysis and prototyping. While it may not be … Read more How to Read Data Files on S3 from Amazon SageMaker

Top 5 Reasons Starting with Python Will Make You a Better Data Scientist

Photo by Tyler Franta on Unsplash As the demand for people with a data science skillset has soared, companies have looked for ways to fill that demand. One way is for companies to go out and recruit people who encapsulate all things data science, which usually includes proficiency of a coding language, probably Python. This … Read more Top 5 Reasons Starting with Python Will Make You a Better Data Scientist

What are the 10 most popular standard libraries in python?

3. What are the 10 most popular standard libraries based on GitHub commits in python repositories? So far we collect a sample dataset of 5 famous python repo in GitHub and build a class to collect library’s names in python codes. Now, we need to apply this function to our sample data from GitHub and … Read more What are the 10 most popular standard libraries in python?

Identify Variables High On Memory Consumption

Optimizing Python Codes Photo by Possessed Photography on Unsplash Sometimes, when executing Python scripts, we encounter memory errors. These errors are primarily due to some variables that have high memory consumption. In this tutorial, we will focus on profiling Python codes to optimize memory consumption. Memory profiling is a process using which we can dissect … Read more Identify Variables High On Memory Consumption

How to Highlight Cells in Matplotlib Tables

An annotated example of drawing attention to cell data How to Highlight Cells (Image by author) When presenting tabular data, we often want to bring attention to a particular cell. This focus improves the visual support we want from our tables and keeps the audience from feeling overwhelmed by the breadth of the data contained … Read more How to Highlight Cells in Matplotlib Tables

Types of Structured Data Every Data Science Enthusiast should Know

We know that Data Analysis has evolved beyond its original expected extent and this has happened because of the rapid development of technology, generation of more and bigger data, aggressive usage of quantitative analysis across a variety of disciplines. As a result of this infrastructure development, we could reach a stage where we had multiple … Read more Types of Structured Data Every Data Science Enthusiast should Know

Understand O.O.P. in Python with One Article

In order to begin understanding the intuition behind this programming technique, let’s take a look at an initial example. Imagine that you have to describe a car to someone who’s never seen one before, how would you do it? Photo by vaea Garrido on Unsplash You might want to start saying that it’s a wheeled … Read more Understand O.O.P. in Python with One Article

Predict House Prices with Machine Learning

We’re going to train four tried-and-true regression models: regularised linear regression (Ridge, Lasso & Elastic Net) random forests gradient-boosted trees First, let’s split our analytical base table. y = df.statusX = df.drop(‘tx_price’, axis=1) We’ll then split into training and test sets. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234) We’ll set up a pipeline … Read more Predict House Prices with Machine Learning

Data visualisation: 3 secret tips on Python to make interactive graphs and impress your boss.

Tips 1: Adding a range slider The first tip that I am going to share with you is the range slide selector. Believe it or not, in one line of code, you can already bring substantial interactivity to your graph. Instead of having a static graph, the user will be able to choose and zoom … Read more Data visualisation: 3 secret tips on Python to make interactive graphs and impress your boss.

Dictionary Comprehensions in Python

How to use dictionary comprehensions to create dictionaries in python Photo by Tianyi Ma on Unsplash Creating a Dictionary Let’s say that we want to create a dictionary in python from another iterable object or sequence, such as a list. For example, we have a list of numbers, and we want to create a dictionary … Read more Dictionary Comprehensions in Python

How to change semi-structured text into a Pandas dataframe

Using Python and Pandas, I converted a text document meant for human readers into a machine readable dataframe Semi-structured data on the left, Pandas dataframe and graph on the right — image by author These days much of the data you find on the internet are nicely formatted as JSON, Excel files or CSV. But … Read more How to change semi-structured text into a Pandas dataframe

Multiple regression as a machine learning algorithm

You’ve got the intuition with a simplified example of how multiple regression makes prediction of the price of a used car based on two features: horsepower and high-way mpg. But the world we live in is complex and messy. Each of the steps I’ve shown above will need to be branched out further. For example, … Read more Multiple regression as a machine learning algorithm

Construct a Decision Tree and How to Deal with Overfitting

A decision tree is an algorithm for supervised learning. It uses a tree structure, in which there are two types of nodes: decision node and leaf node. A decision node splits the data into two branches by asking a boolean question on a feature. A leaf node represents a class. The training process is about … Read more Construct a Decision Tree and How to Deal with Overfitting

Building Custom Layers on AWS Lambda

How to build custom Python layers for your serverless application. Many developers face issues when importing custom modules on AWS Lambda, you see errors like “No module named pandas” or “No module named numpy”, and most times, the easiest ways to solve this is to bundle your lambda function code with the module and deploy … Read more Building Custom Layers on AWS Lambda

5 Stories Data Tell us About Data Scientists

Data scientists tell stories through data. But what stories can data tell about data scientists? Photo by Dariusz Sankowski It may sound like the revenge of structured data, but it’s actually just a survey conducted by Kaggle Platform. The result of the 2019 Kaggle Machine Learning and Data Science Survey was made available here and … Read more 5 Stories Data Tell us About Data Scientists

DON’T underestimate these Python dunder methods!

Suppose you are testing different supervised learning techniques for a classification problem. For simplicity, I’ll assume you are familiar with these terms . Generally, you would have a set of raw features to be used as input variables for training the algorithm. However, important transformations such as filling out missing values, standardizing variables and so … Read more DON’T underestimate these Python dunder methods!

Data preprocessing with Python Pandas — Part 1 Missing Data

Part 1 — Missing Data Photo by Photo Mix from Pixabay This tutorial explains how to preprocess data using the pandas library. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalized format. Preprocessing involves the following aspects: missing values data standardization data normalization data … Read more Data preprocessing with Python Pandas — Part 1 Missing Data

Zonal Statistics Algorithm with Python in 4 Steps

How to summarize raster data for polygon zones Photo by Clay Banks on Unsplash It is a common need to summarize information from a gridded dataset within an irregularly shaped area. While at first glance this may seem simple, reconciling differences between raster (gridded) and vector (polygon) datatypes can quickly become complicated. This article shows … Read more Zonal Statistics Algorithm with Python in 4 Steps

Diving into NumPy

Photo by Myriam Jessier on Unsplash NumPy, one of the most important and basic libraries used in data science and machine learning, It consists of functionalities for multidimensional arrays, high-level mathematical functions such as, Linear algebra operations Fourier transform Random generators and also NumPy array forms the fundamental data structure for scikit-learn. The core of … Read more Diving into NumPy

All the Pandas merge() you should know for combining datasets

And below is how the Venn Diagram looks like for our test dataset df_customer = pd.DataFrame({‘id’: [1,2,3,4],’name’: [‘Tom’, ‘Jenny’, ‘James’, ‘Dan’],})df_info = pd.DataFrame({‘id’: [2,3,4,5],’age’: [31,20,40,70],’sex’: [‘F’, ‘M’, ‘M’, ‘F’]})pd.merge(df_customer, df_info, on=’id’, how=?) Venn Diagram (Image by author) 4.1 inner join By default, Pandas merge() is performing the inner join and it produces only the set … Read more All the Pandas merge() you should know for combining datasets

7 Easy Ways for Improving Your Data Science Workflow

1. Organise your project directory Nothing is worse than having a messy folder with Untitled.ipynb up to Untitled9999.ipynb, data csv scattered everywhere in the folder, and a bunch of cache like .ipynb_checkpoints. A good first step to setting up your project folder would immensely smoother your daily workflow. Some common tools that help set up … Read more 7 Easy Ways for Improving Your Data Science Workflow

How to deal with imbalanced data in Python

What are imbalanced data precisely? Imbalanced data sets are a special case for classification problem where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class [1]. Normally the minority class is what we hope the ML model would be … Read more How to deal with imbalanced data in Python

How to Visualize Interactive 3D Network with Python Plotly

In this guide, I will be using Google Colab to demonstrate how to set up and create your own Python script from scratch so you can visualize your own 3D network. This guide is adapted from Plotly’s official guide. Changes I’ve made from that guide: Fixed some codes that gave me errors Removed some unnecessarily … Read more How to Visualize Interactive 3D Network with Python Plotly

Simple way to find a suitable algorithm for your data in scikit-learn (Python)

If you want to follow along with the code on your computer, make sure you have numpy, pandas, seaborn, sklearn and xgboost installed. Let’s imagine we want to find a suitable machine learning algorithm for a classification problem. For our example, we will use a subset of features from titanic dataset. Let’s import relevant packages … Read more Simple way to find a suitable algorithm for your data in scikit-learn (Python)

Docker Example for BI/Data Science Development

I have found Docker to be a powerful tool for both BI and data science development workflows. Mitigating the “it works on my machine” issues is phenomenal in itself, so why not start from a development standpoint for robust business solutions? Finding a balance between scalability, workflow efficiency, and minimization of software dependency/shared library conflicts … Read more Docker Example for BI/Data Science Development

Getting Started with Bokeh for Python | Emile Gill

Effortlessly elegant interactive data visualisations in Python Photo by Denise Johnson on Unsplash In this article I aim to give you an introduction to Bokeh, detailing what it is, why you should be using it and how you can easily get started! Bokeh is a neat Python library that allows us to quickly and easily … Read more Getting Started with Bokeh for Python | Emile Gill

How to design more informative visualizations

5 Tips to spread your message more effectively across the audience with static visualizations Data Visualization is a must-have skill you need to become a data scientist. Most of the time, we fully focus on learning how to use visualizations tools and we do not stop to think about the designing principles and good practices … Read more How to design more informative visualizations

Synthetic Data Vault (SDV): A Python Library for Dataset Modeling

Python Library A tool to generate complex datasets using statistical & machine-learning models Image by Author In data science, you usually need a realistic dataset to test your proof of concept. Creating fake data that captures the behavior of the actual data may sometimes be a rather tricky task. Several python packages try to achieve … Read more Synthetic Data Vault (SDV): A Python Library for Dataset Modeling

Using Dynamic Planning to Help Trump Win the Elections

Dynamic Planning in Python for optimising Election Promotion Disclamation: The 2020 US Election is merely used as background in this article. This story meant to show you such a way of thinking in computer programming. I am NEITHER expressing my viewpoints of politics NOR persuading with any ideas. All the data used in this article … Read more Using Dynamic Planning to Help Trump Win the Elections

Design of experiment basics: if you build them, they will come

There are already thousands of sources on the web explaining p-value and concepts related to statistical significance testing. Why write another one? I see over and over again statistical tests used as a “black box” to obtain p-value and apply the “golden rule”: if p-value is less than 0.05, then results are significant. Very often … Read more Design of experiment basics: if you build them, they will come

4 Steps to Easily Allocate Resources with Python & Bin Packing

Although these words may seem bordering on understanding (I can swear to you that there are no typing errors in the previous sentences), the Bin Packing Problem often occurs in everyday life. Here are some examples: 🛒 You are at the supermarket. You have just paid and you have to put all the m products … Read more 4 Steps to Easily Allocate Resources with Python & Bin Packing

12 Examples to Master Python Dictionaries

A comprehensive practical guide for learning dictionaries Photo by Pisit Heng on Unsplash Data structures are crucial parts of any programming language. In order to create robust and well-performing products, one must know the data structures very well. In this post, we will work on an important data structure of Python programming language and that … Read more 12 Examples to Master Python Dictionaries