Utilize Your Self-Imposed Deadlines | Punch Today in The Face

The art of creating self-imposed deadlines is crucial not only to go above and beyond meeting requirements, but also to make our working progress in achieving small and big goals a lot smoother. Love them or hate them; they are incredibly motivational deadlines! However, here are two things we should try to avoid when dealing … Read moreUtilize Your Self-Imposed Deadlines | Punch Today in The Face

Steps to basic modern NN model from scratch

After we have defined the matrix multiplication strategy, its time to defined the ReLU function and the forward pass for the Neural Network. I would request the readers to go through the Part — 1 of the series to get the background of the data used below. The Neural Network is defined as below: output … Read moreSteps to basic modern NN model from scratch

A Quick Short Look Into Bootstrapping

Big Questions: After an A/B testing, to what extent can we trust our small sample can represent the entire population of our customers? If we repeatedly sample the same size, how would our estimates vary? If we obtain different estimators after repeated sampling, can we gauge the distribution of the population? If we don’t know … Read moreA Quick Short Look Into Bootstrapping

Central Limit & Large Numbers

If you’re into math equations, let us now turn to formal representations of the theorems in order to understand their claims and the relationship between the two a bit more precisely. Let be independent and identically distributed random variables with expected value μ and finite variance σ². Then converges towards the Standard Normal Distribution in … Read moreCentral Limit & Large Numbers

How to understand Numpy documentation

When we start to learn Data Science, Machine Learning, Deep Learning or any excited fields that will be using Python as programming language, most probably all of us will be using numpy as well. In this post, I will be writing numpy basics and how to read documentation properly based on my experience of using … Read moreHow to understand Numpy documentation

Web Scrape Twitter by Python Selenium (Part 1)

Begin of tutorial PS: For a new beginner, I would suggest you work in Jupyter Notebook first because you will face more errors than anytime before. By using Jupyter Notebook you can run the script step by step so that you know where the problem is. Access to twitter frontpage The first step is to … Read moreWeb Scrape Twitter by Python Selenium (Part 1)

Predicting Heart Disease Mortality

Building a machine learning model that can identify high-risk states in 2019. According to the Center for Disease Control, “About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.” It is unlikely anyone reading this hasn’t been affected by this disease in some way. I, myself, lost … Read morePredicting Heart Disease Mortality

Reduce Memory Usage and Make Your Python Code Faster Using Generators

A hands on guide to create iterators in a very pythonic manner Photo by Createria on Unsplash When I started learning about python generators, I had no idea how important it would turn out to be. It has helped me immensely while writing custom functions throughout my machine learning journey. Generator functions allow you to … Read moreReduce Memory Usage and Make Your Python Code Faster Using Generators

5 Minute Guide to Detecting Holidays in Python

With Pandas, it’s fairly straightforward to construct a list of dates, let’s say for the whole year of 2019: Great. Now we can construct a DataFrame object from those dates — let’s put them into Dates column: Now here comes a slight problem. The dates look to be stored in a string format, just like … Read more5 Minute Guide to Detecting Holidays in Python

A Quick Primer on Databricks Koalas

Interact with Spark Dataframes with Pandas vocabulary Photo by Jordan Whitt on Unsplash In a project of mine, I extensively used Spark to manage working with some large data files. Though it is often known for the many benefits for use with large distributed systems, it works equally well locally for projects working with large … Read moreA Quick Primer on Databricks Koalas

How to code effectively without dying in the attempt

1. Find a comfortable working space Most programming and coding jobs are flexible enough that allow to work from home, a common space, a library or even a coffee shop, without having to be at an office 8 hours per day 5 days per week. However, the working environment will always have a highly significant … Read moreHow to code effectively without dying in the attempt

Design of Experiments for Your Change Management

A step-by-step Guide to Design of Experiments Data science professionals, have you ever faced any of the following challenges? Story 1: Machine learning does not mean experimental design You are asked to design an experiment due to your statistical expertise, but realized your machine learning tools do not help you design an experiment. Story 2: … Read moreDesign of Experiments for Your Change Management

Let’s calculate Z-scores for Airbnb prices in New York

Z-score, also called standard score, according to wikipedia. In statistics, the standard score is the signed fractional number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Translation: a measure of how far a value is from its population … Read moreLet’s calculate Z-scores for Airbnb prices in New York

Why Companies Are Using Data Science and Analytics to Inform Benefits Packages

Employee benefits packages can help candidates choose to take job offers or look elsewhere. They can also factor into how long a worker stays at a company and how happy they are while there. If they realize that other companies offer better benefits and they’re frustrated with their job already, they may decide it’s not … Read moreWhy Companies Are Using Data Science and Analytics to Inform Benefits Packages

Integrating Python & Tableau

Bring your analyses to life with engaging data visualizations. When performing in-depth analyses on large and unstructured datasets, the power of Python and relevant machine learning libraries cannot be understated. Matplotlib serves as a great tool to help us visualize results, but it’s stylization options are not always optimal for use in presentations and dashboards. … Read moreIntegrating Python & Tableau

Why your AI might be racist and what to do about it

Individually reasonable correlations can cause an AI to gain a racial bias Even well-designed AI systems can still end up with a bias. This bias can cause the AI to exhibit racism, sexism, or other types of discrimination. Entirely by accident. This is usually considered a political problem, and ignored by scientists. The result is … Read moreWhy your AI might be racist and what to do about it

An Alternative To Batch Normalization

The development of Batch Normalization(BN) as a normalization technique was a turning point in the development of deep learning models, it enabled various networks to train and converge. Despite its great success, BN exhibits drawbacks that are caused by its distinct behavior of normalizing along the batch dimension. One of the major disadvantages of BN … Read moreAn Alternative To Batch Normalization

Managing virtual environment with pyenv

Most Python developers and data scientist have already heard of virtual environments. However, managing tens of environments created for different projects can be daunting. pyenv will help you to streamline the creation, management and activating virtual environments. In the old days, before the virtualenv became popular, I would keep a single global workspace for all … Read moreManaging virtual environment with pyenv

Introduction to Spark NLP: Foundations and Basic Components

As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll get in detail during this article series. Spark NLP … Read moreIntroduction to Spark NLP: Foundations and Basic Components

Best Practices for NLP Classification in TensorFlow 2.0

Use Data Pipelines, Transfer Learning and BERT to achieve 85% accuracy in Sentiment Analysis Photo by Jirsak, courtesy of Shutterstock When I first started working with Deep Learning, I went through Coursera and fast.ai courses, but afterwards I wondered where to go from here. I started asking questions like “How do I develop a data … Read moreBest Practices for NLP Classification in TensorFlow 2.0

Using K-Means Clustering Algorithm to Redefine NBA Positions and Explore Roster Construction

Conventional positions within the NBA do not accurately reflect the playing style or functional role a player provides to their team. The overall style of play has changed drastically and various era’s within the NBA indicate that. Similarly a player’s style of play is also reflective of this change. Currently the league is fast paced … Read moreUsing K-Means Clustering Algorithm to Redefine NBA Positions and Explore Roster Construction

The City of the Homeless: Humanitarian Crisis on the Streets of Los Angeles

The dominant narrative around who is living on the street — and why it is so difficult to help them — is that people experiencing homelessness are all drug addicts and/or severely mentally ill. This damaging narrative dehumanizes people experiencing homelessness in a cynical attempt to justify inaction. But it is also factually incorrect: according … Read moreThe City of the Homeless: Humanitarian Crisis on the Streets of Los Angeles

Cloud Risk Assessment through Data- log analysis in AWS

https://aws.amazon.com/getting-started/projects/analyze-big-data/ These are the high-level steps: (Note: An AWS account setup is a pre-requisite. If you try this out, ensure that clusters and buckets are deleted after use to avoid additional charges). Sample data is loaded; in real-life projects the relevant dataset would replace this. Launch a Hadoop cluster using Amazon EMR [Elastic Map Reduce], … Read moreCloud Risk Assessment through Data- log analysis in AWS

Machine Learning and Data Analysis — Inha University (Part-2)

Welcome to the second part of Machine learning and data analysis series based on a graduate course offered by Inha University, Rep. of Korea. In this part, we will discuss Data structures in python. However, if you are viewing this for the first time then we encourage you to follow the first part first where … Read moreMachine Learning and Data Analysis — Inha University (Part-2)

Automatic Speech Recognition as a Microservice on AWS

Let’s quickly get back to our LAB work and implement this highly-complex piece of work in a few easy steps. At this point, you should have your EC2 up and be SSHed into it. Please refer the Github Repository for any missing resources/links In your home directory [/home/ec2-user], maintain the following directory structure D -> … Read moreAutomatic Speech Recognition as a Microservice on AWS

How to Write Python Command-Line Interfaces like a Pro

Photo by Kelly Sikkema on Unsplash We as Data Scientists face doing many repetitive and similar tasks. That includes creating weekly reports, executing extract, transform, load (ETL) jobs, or training models using different parameter sets. Often, we end up having a bunch of Python scripts, where we change parameters in code every time we run … Read moreHow to Write Python Command-Line Interfaces like a Pro

Let’s build an Intelligent chatbot

Modern chatbots do not rely solely on text, and will often show useful cards, images, links, and forms, providing an app-like experience. Depending on way bots are programmed, we can categorize them into two variants of chatbots: Rule-Based (dumb bots) & Self Learning (smart bots). Rule-Based Chatbots: This variety of bots answer questions based on … Read moreLet’s build an Intelligent chatbot

Transfer Learning in NLP

Neural Transfer Learning for Natural Language Processing by Sebastian Ruder Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359. Xia, R., Zong, C., Hu, X., and Cambria, E. (2015). Feature Ensemble plus Sample Selection: A Comprehensive Approach to Domain Adaptation for Sentiment Classification. Proceedings … Read moreTransfer Learning in NLP

How Data Creates a Collective Storytelling Voice on a Global Issue

20.6k upvotes. 1.9k comments. Cross-posted on 10 other subreddits. Not super impressive. But is there more to these figures which supposedly reflect the success (or popularity) of the bar chart I posted on the subreddit /r/dataisbeautiful? Recently I posted a race bar chart showing the evolution of the top 10 countries of origin of international … Read moreHow Data Creates a Collective Storytelling Voice on a Global Issue

Onboarding a New Data Scientist

Efficiently onboard your new data champions. Your new employee will be ready to save the world in no time! Onboarding is Hard Onboarding is so important but very difficult. Most traditional jobs have clear expectations, documentation, and processes for onboarding. Data Science roles are completely different! This, of course, is a result of being a … Read moreOnboarding a New Data Scientist

Kalman Filter(2) — Grid World Localisation

Apply Basics to 2 Dimensional Space In last post, we have applied basic Bayes rules and total probability to localise a moving car in a 1-dimension world. Let’s reinforce our understanding and apply them to a 2-dimension world. Consider a 2 dimensional world, the robot can move only left, right, up, or down. It cannot … Read moreKalman Filter(2) — Grid World Localisation

Beat The Heat with Machine Learning Cheat Sheet

Make the Next-to-last mistake Supervised Learning Supervised learning algorithms involves direct supervision of operation. We teach or train the machine using data, which means that the data is labelled with the right answer. We use an algorithm to analyse the training data and learn the function that maps inputs with their outputs. The function can … Read moreBeat The Heat with Machine Learning Cheat Sheet

Custom Transformers in Python — Part II

Data Cleaning is the most important part of any Machine Learning project. The fact that your data may be in multiple formats and spread across different systems makes it imperative that the data is properly massaged before it’s fed to an ML Model. Data preparation is one of the most tedious and time-consuming steps in … Read moreCustom Transformers in Python — Part II

Elizabeth Warren is Leading the 2020 Presidential Race: An analysis in Python

In this post we will use the python google trends API, pytrends, to analyze which of the leading democratic candidates are being searched most. In order to install pytrends open up a command line and type: pip install pytrends Next open up an IDE (I use Spyder) and import pytrends: from pyrtends.request import TrendReq Next … Read moreElizabeth Warren is Leading the 2020 Presidential Race: An analysis in Python

600X t-SNE speedup with RAPIDS

GPU accelerations are commonly associated with Deep Learning. GPUs power Convolutional Neural Networks for Computer Vision and Transformers for Natural Language Processing. They do this through parallel computation, making them much faster for certain tasks compared to CPUs. RAPIDS is expanding the utilization of GPUs by bringing traditional Machine Learning and Data Science algorithms, such … Read more600X t-SNE speedup with RAPIDS

Process Mapping with R

R for Industrial Engineers Creating Process Maps using R packages Image by Matteson Ellis available at FCPAmericas A problem well stated is a problem half solved. – Charles Franklin Kettering Process mapping represents a great tool for retrieving information about a process during the Define phase of the DMAIC (Define, Measure, Analyze, Improve, Control) cycle. … Read moreProcess Mapping with R

Hey Model, Why Do You Say This Is Spam?

Shapley values are used in machine learning to explain the predictions of a complex predictive model, aka “black box”. In this post, I will use Shapley values to identify YouTube comment key terms that explain why a comment was predicted as either spam or legitimate by a predictive model. The “coalitions” of specific key terms … Read moreHey Model, Why Do You Say This Is Spam?

Lessons Learned Using Google Cloud BigQuery ML

CLOUD AUTOMATIC MACHINE LEARNING TOOLS (PART 1) Start-to-finish ML demo using German Credit Data Motivation A new buzzword I hear often is “democratize AI for the masses”. Usually what follows is a suggested Cloud Machine Learning tool. The umbrella term for these tools seems to be AML, or Automatic Machine Learning. As a Data Scientist … Read moreLessons Learned Using Google Cloud BigQuery ML

Just Keep Guessing: The Power of the Monte Carlo Method

The Monte Carlo method is an incredibly powerful tool used in a wide variety of fields. From mathematics to science to finance, the Monte Carlo method can be used to solve a variety of unique and interesting problems. The idea of the Monte Carlo method is farily stright forward. It relies on a large amount … Read moreJust Keep Guessing: The Power of the Monte Carlo Method

Meaningful Metrics: Cumulative Gains and Lyft Charts

Nowadays, all major companies rely heavily on their data science capabilities. Business data units are becoming larger and more sophisticated in terms of the complexity and diversity of analysis. However, the success of delivering data science solutions into business reality largely depends on the interpretability of findings. Even if the developed models provide outstanding accuracy … Read moreMeaningful Metrics: Cumulative Gains and Lyft Charts

Isolation Forest and Spark

Results In both examples we use a very small and simple dataset, just to demonstrate the process. data = [{‘feature1’: 1., ‘feature2’: 0., ‘feature3’: 0.3, ‘feature4’: 0.01},{‘feature1’: 10., ‘feature2’: 3., ‘feature3’: 0.9, ‘feature4’: 0.1},{‘feature1’: 101., ‘feature2’: 13., ‘feature3’: 0.9, ‘feature4’: 0.91},{‘feature1’: 111., ‘feature2’: 11., ‘feature3’: 1.2, ‘feature4’: 1.91},] Both algorithms conclude that the first sample … Read moreIsolation Forest and Spark

Uncommon Data Cleaners for your Real-World Machine or Deep Learning Project

Cleaning Tools Data cleaning is a subject that is lightly touched in your brick&mortar or on-line classes. However, in your work as a Data Engineer or Data Scientist you will spend a great deal of your time getting ready (pre-processing) your data so that it can be input into your model. Data cleaning is critical … Read moreUncommon Data Cleaners for your Real-World Machine or Deep Learning Project

Animating gAnime with StyleGAN: The Tool

In-depth tutorial for an open-source GAN research tool Visualization of feature map 158 at a layer with resolution 64×64 This is a tutorial/technical blog for a research tool I’ve been working on as a personal project. While a significant portion of the blog assumes you have access to the tool while reading it, I attempted … Read moreAnimating gAnime with StyleGAN: The Tool

How to Test Your Hypothesis Using P-Value Uniformity Test

How do we interpret the shape of a p-value distribution? Example 1: Uniform P-value Distribution Suppose the null hypothesis says a random variable follows a normal distribution with mean 0 and variance 1. As depicted above, the p-value distribution will closely resemble a uniform distribution if the sample follows the null distribution. This is because … Read moreHow to Test Your Hypothesis Using P-Value Uniformity Test

How to Implement Machine Learning For Predictive Maintenance

As Industry 4.0 continues to generate media attention, many companies are struggling with the realities of AI implementation. Indeed, the benefits of predictive maintenance such as helping determine the condition of equipment and predicting when maintenance should be performed, are extremely strategic. Needless to say that the implementation of ML-based solutions can lead to major … Read moreHow to Implement Machine Learning For Predictive Maintenance