Probability Distributions in Data Science

An introduction to some of the most commonly used Probability Distributions in Data Science with real-life examples. Having a sound statistical background can be greatly beneficial in the daily life of a Data Scientist. Every time we start exploring a new dataset, we need to first do an Exploratory Data Analysis (EDA) in order to … Read moreProbability Distributions in Data Science

Top Down View at Reinforcement Learning

Stitch together the different parts and branches of Reinforcement Learning When you are new to Reinforcement Learning you will no doubt be bombarded with weird terms, like Model-Based, Model-Free, On Policy, Off Policy etc… Soon you will find it exhausting to keep track of this terminology that seem to appear all over the place, without … Read moreTop Down View at Reinforcement Learning

The Data Processing Error in the Most Prominent Fair Machine Learning Dataset (short version)

While ProPublica’s COMPAS data are used in an increasing number of studies, researchers have generally taken the datasets created by ProPublica as they are and do not appear to have scrutinized them for data processing issues. Instead of testing a novel fairness definition or procedure, I take a closer look at the actual datasets put … Read moreThe Data Processing Error in the Most Prominent Fair Machine Learning Dataset (short version)

Working as a Data Scientist in Cybersport

Data science for Dota team Photo by Fredrick Tendong on Unsplash In football, there have long been teams that use big data analysis for their transfers and analysis of their players’ games. There is even a football club that for years bought players for itself only by looking at its statistics and what data scientists … Read moreWorking as a Data Scientist in Cybersport

Very Short Introduction to Data Science Terminology

Photo by Franki Chamaki on Unsplash An attribute is a property of an object etc. Attribute is also known as variable or feature. A collection of attributes describes an object. Object is also known as sample, entity, or instance. Data can often be represented or abstracted as an n×d data matrix. “n” rows correspond to … Read moreVery Short Introduction to Data Science Terminology

The geometric interpretation of 3D lines and planes

Linear Algebra is that branch of calculus whose objects live beyond ℝ. Those objects might be coordinates in the spaces (hence points) or combinations of points in the form of multivariate equations. Every time we work with more than 3 dimensions, it is not physically possible to visualize our objects. Hence, in this article I’m … Read moreThe geometric interpretation of 3D lines and planes

Six Challenges Every Data Scientist Will Face and How to Overcome Them

Supplements for thought? Image made by author. The age of information has bestowed upon humanity one of the biggest explosions of tech-focused jobs ever. While an abundance of the power behind big successful companies such as Uber, Facebook, AirBnB, and Amazon is their ingenuity and convenience for consumers, their success can also be attributed to … Read moreSix Challenges Every Data Scientist Will Face and How to Overcome Them

Five ways ML helps broadcasters achieve new efficiencies and reinvent CX

Source: unsplash.com With broadcasting behemoths like Netflix and Hulu dominating the market, winning eyeballs and making the audience stay tuned to your video content is no walk in the park. But not impossible, either. AI and ML development experts recommend creating a coherent marketing strategy and using a winning combination of video and image analysis … Read moreFive ways ML helps broadcasters achieve new efficiencies and reinvent CX

LineFlow: Simple NLP Dataset Handler for PyTorch or Any Framework

Smaller Code, Less Pain For an NLP task, you might need to tokenize text or build the vocabulary in the pre-processing. And you probably have experienced that the pre-processing code is as messy as your desk. Forgive me if your desk is clean 🙂 I have such experience too. That’s why I create LineFlow to … Read moreLineFlow: Simple NLP Dataset Handler for PyTorch or Any Framework

A High-Level Guide to Autoencoders

An autoencoder toolbox from most basic to most fancy. In the wonderful world of machine learning and artificial intelligence, there exists this structure called an autoencoder. Autoencoders are a type neural network which is part of unsupervised learning (or, to some, semi-unsupervised learning). There are many different types of autoencoders used for many purposes, some … Read moreA High-Level Guide to Autoencoders

Sensing the Air Quality

A low-cost IoT Air Quality Monitor based on RaspberryPi 4 Santiago, Chile during a winter environmental emergency I have the privilege of living in one of the most beautiful countries in the world, but unfortunately, not “all are flowers”. Chile during winter season suffers a lot with air contamination, mainly due to particulate materials as … Read moreSensing the Air Quality

The case against the jupyter notebook

Joel Grus on the TDS podcast Editor’s note: This is the first episode of the Towards Data Science podcast’s “Climbing the Data Science Ladder” series, hosted by Jeremie Harris, Edouard Harris and Russell Pollari. Together, they run a data science mentorship startup called SharpestMinds. You can listen to the podcast below: To most data scientists, … Read moreThe case against the jupyter notebook

Helping a Reader with Python Web Scraping Refactored.

Bhargava Reddy Morampalli, a microbiologist from India, read my first post on web scraping from my old blog. If you didn’t get a chance to check out that post you can read it here. Python Web Scraping Refactored My first article on my old blog was on a web scraping example. Web scraping is one … Read moreHelping a Reader with Python Web Scraping Refactored.

View from the Top: 3 Takeaways from the Chief Data Officer Symposium

What the World’s Most Innovative CDOs are Doing Today Photo by Skye Studios on Unsplash Earlier this month, I had the pleasure of attending the Chief Data Officer Symposium at MIT for the first time. More than 60 CDOs were there, hailing from across the United States, Canada, Germany, Netherlands, and more. There was representation … Read moreView from the Top: 3 Takeaways from the Chief Data Officer Symposium

AI Powered Search for Extra-terrestrial Intelligence — Signal Classification with Deep Learning

AI FOR SOCIAL GOOD SERIES — PART 2.2 Classifying Radio-Telescope Signals from SETI with Deep Learning Welcome (or welcome back!) to the AI for social good series! In the second part, of this two-part series of articles, we will look at how Artificial intelligence (AI) coupled with the power of open-source tools and techniques like … Read moreAI Powered Search for Extra-terrestrial Intelligence — Signal Classification with Deep Learning

Learn faster with smarter data labeling

So, besides what transfer learning offers, can we further reduce the amount of labeling work? Actually, the answer is yes, and there are a couple of techniques that exist. One of the most well studied is active learning. The principle is straightforward: only label what is useful for your current model. Formally, the active learning … Read moreLearn faster with smarter data labeling

Data Scientists, The five Graph Algorithms that you should know

A graph with 3 connected components We all know how clustering works? You can think of Connected Components in very layman’s terms as a sort of a hard clustering algorithm which finds clusters/islands in related/connected data. As a concrete example: Say you have data about roads joining any two cities in the world. And you … Read moreData Scientists, The five Graph Algorithms that you should know

How I Got Started With Kaggle Competitions (It’s Not That Hard)

Most people in the data science community know Kaggle as a place to learn and grow your skills. One popular way for practitioners to improve is to compete in prediction challenges. For newcomers, it can be overwhelming to jump in and compete on the site in an actual challenge. At least, that’s how I always … Read moreHow I Got Started With Kaggle Competitions (It’s Not That Hard)

Bayesian Basketball : were the Toronto Raptors really the best team during NBA 2019 season ?

Let’s go back in time and see if we can end up with a different winner for the NBA 2019 title. How ?By using Bayesian simulations. credit : NYTimes [This article was inspired by the work of Baio and Blangiardo (2010), Daniel Weitzenfeld’s great blog article, and Peadar Coyle’s tutorial on Hierarchical models.] Bayesian simulation … Read moreBayesian Basketball : were the Toronto Raptors really the best team during NBA 2019 season ?

Rank the Features, now rank again

Machine Learning and Biomarkers 40 ways to rank your features and my experience selecting biomarker candidates (features) using few observations Here I discuss methods to rank features in 40 ways and a difficult case with an unstable model, caused by the data having many more variables than samples. As a bonus, you will know a … Read moreRank the Features, now rank again

How Data Science can help solve Climate Change

Data-driven solutions will lead the Transition to Clean Energy Photo by Bogdan Pasca on Unsplash Climate Change is real. And even though many scientists agree on the fact that we are already too late, people are just becoming conscious about this problem. And with the people comes politics, and with politics comes the money. That’s … Read moreHow Data Science can help solve Climate Change

A Text Analytics Primer: Key Factors in a Text Analytics Strategy

A Venn diagram of the subfields of text analytics and how they relate (Miner, 2012) Introduction About 90% of all the data in the world we have created in the last 24 months — averaging 2.5 quintillion bytes per day — and about 90% of that is unstructured data, which is things like texts, Tweets, … Read moreA Text Analytics Primer: Key Factors in a Text Analytics Strategy

Maximum Likelihood Estimation Explained – Normal Distribution

Wikipedia defines Maximum Likelihood Estimation (MLE) as follows: “A method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.” To get a handle on this definition, let’s look at a simple example. Let’s say we have some data and … Read moreMaximum Likelihood Estimation Explained – Normal Distribution

Reduce Dimensions for Single Cell

Compare dimension reductions for single cell genomics From Becht et al., Nature Biotechnology 2019, image source This is the eighth article in the column Mathematical Statistics and Machine Learning for Life Sciences where I try to cover analytical techniques common for Bioinformatics, Biomedicine, Genetics, Evolutionary Science etc. Today we are going to talk about dimension … Read moreReduce Dimensions for Single Cell

One-tailed or two-tailed test, that is the question

Source: pixabay Learn the difference between two variants of statistical tests and how to implement them in Python In data science/econometrics we see statistical tests in many places: correlation analysis, ANOVA, A/B testing, linear regression results, etc. Therefore, for the practitioners, it is very important to thoroughly understand their meaning and know why a given … Read moreOne-tailed or two-tailed test, that is the question

Reinforcement Learning (DDPG and TD3) for News Recommendation

In the next section, we will try to compare and, primarily, evaluate different reinforcement learning algorithms. But how do we tell if the results are good or not? The critic network assigns the values to our actions; however, are you sure whether the value is meaningful. Well, they are based on critic loss. If critic … Read moreReinforcement Learning (DDPG and TD3) for News Recommendation

AI is transforming politics — for both good and bad

BIG TECH, BIG DATA, BIG MONEY Big Data powering Big Money, the return of direct democracy, and the tyranny of the minority Source: Pixabay Nowadays, artificial intelligence (AI) is one of the most widely discussed phenomena. AI is poised to fundamentally alter almost every dimension of human life — from healthcare and social interactions to … Read moreAI is transforming politics — for both good and bad

Advanced Topics in Neural Networks

As you have likely come to realize from your own adventures with neural networks, and possibly from other articles and research literature, the learning rate is a very important part of neural network training. The learning rate essentially determines how ‘fast’ the network will learn, it determines the step size of the movement. A higher … Read moreAdvanced Topics in Neural Networks

Perceptron Algorithms for Linear Classification

Learn how the perceptron algorithms work and the intuition behind them. The basic perceptron algorithm was first introduced by Ref 1 in the late 1950s. It is a binary linear classifier for supervised learning. The idea behind the binary linear classifier can be described as follows. where x is the feature vector, θ is the … Read morePerceptron Algorithms for Linear Classification

TARGET HK: A Quick Dive Into China’s Disinformation Campaign On Twitter

This is a quick dive into the trove of Chinese state troll tweets released by Twitter on Aug 19. More to come in the coming days and weeks. An example of Chinese state troll tweet exposed by Twitter on Aug 19. On August 19, Twitter dropped a new trove of state troll tweets that the … Read moreTARGET HK: A Quick Dive Into China’s Disinformation Campaign On Twitter

Boston Job Market for Data Analysts and Scientists : August 2019 Update

Most Hiring Companies, Top Tools & Tech, and More Introduction This is an August 2019 update of my original project where I simply aim to explore the job market for data analysts and data scientists in the Greater Boston Area. These visuals were produced only from job listings posted on Indeed with the search term … Read moreBoston Job Market for Data Analysts and Scientists : August 2019 Update

GeoVec: word embeddings for geosciences

We can see that it is organised by layers and contains details such as colour, presence of roots, descriptions of the pores, textural class (estimated proportion of clay, silt and sand), etc. Most of the time the descriptions follow some recommended format but they might contain more or less free-form text depending on the study. … Read moreGeoVec: word embeddings for geosciences

KL Divergence Python Example

As you progress in your career as a data scientist, you will inevitable come across the Kullback–Leibler (KL) divergence. We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions. One common scenario where this is useful is when we are working with a … Read moreKL Divergence Python Example

Detecting and modeling outliers with PyOD

As the name suggests, outliers are datapoint which differs significantly from the rest of your observations. In other words, they are far away from the average path of your data. In statistics and Machine Learning, detecting outliers is a pivotal step, since they might affect the performance of your model. Namely, imagine you want to … Read moreDetecting and modeling outliers with PyOD

Deep Learning and Momentum Investing

V. Test Set Results and Interpretability of Predictions A. Out-of-Sample Results First, to gauge the model’s ability to generalize on unseen data, let’s have a look at the test set loss. Figure 5 plots the ensemble loss relative to its validation loss (dashed black line normalized to 1). The red line draws the average loss … Read moreDeep Learning and Momentum Investing

Making PATE Bidirectionally Private

This guide is based on this repo. Some sections of the code will be skipped or modified for readability of the article. Initial Setup First, we need to import the necessary libraries. This guide assumes all libraries are already installed locally. We’re declaring the necessary libraries and hooking Syft with Torch. To demonstrate how PATE … Read moreMaking PATE Bidirectionally Private

Data Science Roles: A Classification Problem

Netflix relies on data to deliver personalized experiences for 130 million Netflix members worldwide. According to the Netflix Tech Blog: Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs … Read moreData Science Roles: A Classification Problem

How to Prepare for Your Data Engineering Interview

You should feel very accomplished if you get to the on-site interview, but the hardest part is yet to come! On-sites can be grueling affairs of interviewing with 4–10 people in 3–6 hours, especially if you’re not prepared. Knowing what to expect and doing realistic preparation beforehand go a long way toward reducing fear and … Read moreHow to Prepare for Your Data Engineering Interview

You Don’t Have to be Struck by Lightning to Win the Lottery

A few weeks ago while in lecture, I was asked the following question: “What’s the likelihood of making a living by playing the lottery?” Not very high you think? Well in 2005, a group of MIT students got together and formed a betting syndicate. They had found the game they wanted to bet on, calculated … Read moreYou Don’t Have to be Struck by Lightning to Win the Lottery

Intelligent Loan Selection for Peer-to-Peer Lending

Automatic Investing on Lending Club Using a Neural Network while Controlling Risk in Loan Selection In this article I describe how to train a neural network to evaluate loans that are offered on the crowd lending platform Lending Club. I also cover how to test the model, how to adjust the risk in loan selection, … Read moreIntelligent Loan Selection for Peer-to-Peer Lending

The Ultimate Guide to using the Python regex module

The first thing we need to learn while using regex is how to create patterns. I will go through some most commonly used patterns one by one. As you would think, the simplest pattern is a simple string. pattern = r’times’string = “It was the best of times, it was the worst of times.”print(len(re.findall(pattern,string))) But … Read moreThe Ultimate Guide to using the Python regex module

spaCy Basics

A guide for getting started NLP and spaCy A major challenge of text data is extracting meaningful patterns and using those patterns to find actionable insights. NLP can be thought of as a two part problem: Processing. Converting the text data from its original form into a form the computer can understand. This includes data … Read morespaCy Basics

Simulate Images for ML in PyBullet — The Quick & Easy Way

When applying deep Reinforcement Learning (RL) to robotics, we are faced with a conundrum: how do we train a robot to do a task when deep learning requires hundreds of thousands, even millions, of examples? To achieve 96% grasp success on never-before-seen objects, researchers at Google and Berkeley trained a robotic agent through 580,000 real-world … Read moreSimulate Images for ML in PyBullet — The Quick & Easy Way

Run Amazon SageMaker Notebook locally with Docker container

The main aim of the local Docker container is to maintain as much as possible the most important features of the AWS-hosted instance while enhancing the experience with the local-run capability. Followings are the features that have been replicated: Jupyter Notebook and Jupyter Lab This is simply taken from Jupyter’s official Docker images with a … Read moreRun Amazon SageMaker Notebook locally with Docker container