Virtual, Headless, and Distributed (Oh My!)

Fearless Web Scraping with Python in DataLab Notebooks This post empowers the Pythonista, with a complete framework to explore the world of data on the internet — all behind randomized proxy servers in a fast parallelized sequence, while protecting your company’s immutable IP from curious eyes, and other potential trolls. With this new outlet, the reader is … Read more

We Must Prevent Data Pseudoscience Before It’s Too Late

A Beacon of Hope: The Hypatic Oath Hypatia at the Haymarket Theatre, H. M. Paget via Wikimedia Commons Hypatia was a philosopher, mathematician, and astronomer who lived in Alexandria, Egypt around 400 CE. While much of her legend is most likely apocryphal, it is believed that Hypatia’s death resulted in the burning of the Library of … Read more

3 Real Life Machine Learning Examples

Photo by Adeolu Eletu on Unsplash No matter if I’m speaking to a client, a student, or a distant family member, people always ask me for examples of how I’ve applied Machine Learning in the real world. It seems that even though we’re being bombarded by articles and tutorials, that some context is missing. In this … Read more

An Orwellian Approach to the Litter Problem

Photo by Paweł Czerwiński on Unsplash Using computer vision to detect someone missing the garbage Anyone who has lived in an urban environment knows how filthy it can be. No matter the effort exerted by municipalities, trash finds a way to roll through cities like tumble weeds. Simple solutions involve sending individuals with trash pickers to decontaminate … Read more

It’s OK to use spreadsheets in data science

Because they’re great in a bunch of messy sub-optimal data science contexts. With all the great sophisticated data tools that exist out there these days, it’s easy to think that spreadsheets are too primitive for use in serious data science work. The fact that there’s literally 20+ years of literature cautioning people about the evils … Read more

Can Artificial Intelligence Help Medical Decision Making?

Data Collection As an independent high school researcher, I do not have the legal and professional qualifications to obtain real medical data. To overcome this, I created a chemotherapy treatment simulation based on a mathematical model that represents the change in a patient’s cancer progression given current physiological state and the applied chemotherapy dosage. The … Read more

Data Types for Data Sciences

Big Data and Data Science is now in everyone’s mind. But not everyone clearly understands that not all data is the same, and has a clear vision of the types of applications and technologies available from Data Science. Data Science, Artificial Intelligence and Machine learning are often considered as quite equivalent. It is critical to … Read more

Deep Learning and Doughnuts

Manifold learning Under the manifold assumption, real-world high-dimensional data concentrates close to a non-linear low-dimensional manifold [2]. In other words, data lies approximately on a manifold of much lower dimension than the input space, a manifold that can be retrieved/learned [8] The manifold assumption is crucial in order to deal with the curse of dimensionality: … Read more

Fiction Today Is Reality Tomorrow

My concept of reality has always had boundaries. Moving outside of those boundaries in the past has been classified as “science fiction”. I have noticed during my lifetime those boundaries, considered absolute, have been eroded away more than once. I see these constraints as being softer now, where what many consider impossible today will be … Read more

Is Data Science a BI on steroids?

I have been in Business Intelligence industry for 10 years and worn many hats from SAP BI Developer to SAP BI Solution Architect before becoming a Data Scientist. I have first-hand experience in both BI and Data Science, and I have gone through the transformation from one to another. In these articles I’d like to … Read more

How Does Linear Regression Actually Work?

Training The Linear Regressor To get the technicalities out of the way. What I described in the previous section is referred to as Univariate Linear Regression, because we are trying to map one independent variable (x-value) to one dependent variable (y-value). This is in contrast to Multivariate Linear Regression, where we try to map multiple … Read more

A Complete Exploratory Data Analysis and Visualization for Text Data

How to combine visualization and NLP in order to generate insights in an intuitive way Visually representing the content of a text document is one of the most important tasks in the field of text mining. As a data scientist or NLP specialist, not only we explore the content of documents from different aspects and … Read more

TensorFlow Dev Summit 2019 wrap-up

For the second consecutive year, I was lucky enough to attend the TensorFlow dev Summit on March 6th — 7th at the Google Event Center in Sunnyvale, California USA. Last year, I was impressed by the organization, the venue, the schedule, the speakers and I think this year was even better. So once again, congratulations to the … Read more

Tree-Based Methods: Regression Trees

This article gives a detailed review of the Decision Tree Algorithm used for Regression task-setting. At the core, Decision Tree models are nested if-else conditions. Interpretability of the result is much more pronounced than Least Squared Approach, but there is a considerable loss of accuracy involved. To overcome that, we use strategies like Bagging, Boosting … Read more

Artificial Intelligence and Society

A look at the impact of AI within society. Introduction Press the pause button! Artificial Intelligence (AI) continues to be a growing focus in the media. An agenda gathering momentum like the cloud did, particularly in the business world. On a global path of technology innovation, AI may seem the next logical step towards progress. Computing … Read more

Deep learning for Arabic part-of-speech tagging

Introduction In this post, I will explain Long short-term memory network (aka . LSTM) and How it’s used in natural language processing in solving the sequence modeling task while building an Arabic part-of-speech tagger based on Universal Dependancy Tree Bank. This post is part of a series in building a python package for Arabic natural language … Read more

Understanding the Mathematics behind Gradient Descent.

Derivatives Machine learning uses derivatives in optimization problems. Optimization algorithms like gradient descent use derivates to actually decide whether to increase or decrease the weights in order to increase or decrease any objective function. If we are able to compute the derivative of a function, we know in which direction to proceed to minimize it. Primarily … Read more

Speeding Up and Perfecting Your Work Using Parallel Computing

A detailed guide of Python multiprocessing vs. PySpark mapPartition In science, behind every achievement is grinding, rigorous work. And success is unlikely to happen with one attempt. As a data scientist, you probably deal with huge amount of data and computations, perform repeated tests and experiments on your day-to-day work. Though you don’t want to … Read more

Clearing air around “Boosting”

3) AdaBoost ^ Photo by Mehrshad Rajabi on Unsplash This is the first Boosting algorithm which made a huge mark in ML world. It was developed by Freund and Schapire (1997), and here is the paper. In addition to sequentially adding model’s predictions (i.e. Boosting) it adds weights to each prediction. It was originally designed for … Read more

This New Technique Helps Build Autonomous, Self-Learning AI Agents that Passed the Pommerman…

The emergence of trends such as self-driving cars or drones have helped to popularized an area of artificial intelligence(AI) research known as autonomous agents. Conceptually, autonomous agents are AI that build knowledge real time based on the characteristics of their surrounding environment as well as other agents. If we use the example of self-driving vehicles, … Read more

Introducing Mercury-ML: an open-source “messenger of the machine learning gods”

A messenger of the gods for machine learning workflows These are some of the very real problems that we at Alexander Thamm GmbH are frequently faced with when developing machine learning solutions for our clients. In recent times it became quite clear to us that we needed a library that could break down machine learning … Read more

Neural Network with Tensorflow : How to stop training using callback?

Photo by Samuel Zeller on Unsplash An useful hack with Tensorflow and Keras Introduction Often, when training a very deep neural network, we want to stop training once the training accuracy reaches a certain desired threshold. Thus, we can achieve what we want (optimal model weights) and avoid wastage of resources (time and computation power). In this … Read more

Optimizing Jupyter Notebook: Tips, Tricks, and nbextensions

nbextensions The benefits of this extension are that it changes the defaults. To install nbextensions, execute below commands in Anaconda Prompt: conda install -c conda-forge jupyter_contrib_nbextensionsconda install -c conda-forge jupyter_nbextensions_configurator Alternatively, you can also install nbextensions using pip: pip show jupyter_contrib_nbextensions Run pip show jupyter_contrib_nbextensions to find where notebook extensions are installed Run jupyter contrib … Read more

Data Scientist’s Guide to Summarization

A text summarization tutorial for beginners Team Members: Richa Bathija, Abhinaya Ananthakrishnan, Akhilesh Reddy(@akhilesh.narapareddy), Preetika Srivastava (@preetikasrivastava30) Did you ever face a situation where you had to scroll through a 400 word article only to realize that there are only 4 key points in the article? All of us have been there. In this age … Read more

Financial Machine Learning Part 1: Labels

Setting up a supervised learning problem Introduction In the previous post, we’ve explored several approaches for aggregating raw data for a financial instrument to create observations called bars. In this post, we will focus on the next crucial stage of the machine learning pipeline — labeling observations. As a reminder, labels in machine learning denote the outcomes of … Read more

Finding similar images using Deep learning and Locality Sensitive Hashing

A simple walkthrough on finding similar images through image embedding by a ResNet 34 using FastAI & Pytorch. Also doing fast semantic similarity search in huge image embeddings collections. Fina output with similar images given an Input image in Caltech 101 In this post, we are trying to achieve the above result, i.e., given an image, … Read more

Bayesian Modeling of Pro Overwatch Matches with PyMC3

Photo by AC De Leon on Unsplash Professional eSports are becoming increasingly popular, and the industry is growing rapidly. Many of these professional game leagues are based on games that have two teams that battle it out. Call of Duty, League of Legends, and Overwatch are all examples. Although these are comparable to traditional team sports, … Read more

The path to being the best data analyst: Help, Build, then Do.

The core competency of a data analyst is “Speed to Insight”. A data team often consists of many people, with many skills, using potentially overlapping techniques. This focus on speed distinguishes this role from data scientists or statisticians. Today I’m focused on answering questions about the business or about how users behave. I’ll refer to … Read more

Six Recommendations for Aspiring Data Scientists

Source: Building experience before landing a job Data science is a field with a huge demand, in part because it seems to require experience as a data scientist to be hired as a data scientist. But many of the best data scientists I’ve worked with have diverse backgrounds ranging from humanities to neuroscience, and it … Read more

Taking Google Sheets to (a) Class.

I am currently building a Flask app for teachers. Since Google Drive has been adopted by teachers, Google sheets are used by them also. One of my app’s features is to easily allow teachers to copy and paste the sheet link into the app and submit it through a form. It will then convert it … Read more

Machine Learning Models as Micro Services in Docker

One of the biggest underrated challenges in machine learning development is the deployment of the trained models in production that too in a scalable way. One joke on it I have read is “Most common way, Machine Learning gets deployed today is powerpoint slides :)”. Why Docker? Docker is a containerization platform which packages an application … Read more

How to setup the PySpark environment for development, with good software engineering practices

In this article we will discuss about how to set up our development environment in order to create good quality python code and how to automate some of the tedious tasks to speed up deployments. We will go over the following steps: setup our dependencies in a isolated virtual environment with pipenv how to setup … Read more

Convolutional Neural Network: A Step By Step Guide

“Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise, you’re going to be a dinosaur within three years” — Mark Cuban, a Serial Entrepreneur Hello and welcome, aspirant! If you are reading this and interested in the topic, I’m assuming that you are familiar with the basic concepts of deep … Read more

Let’s build an Article Recommender using LDA

Due to keen interest in learning new topics, I decided to work on a project where a Latent Dirichlet Allocation (LDA) model can recommend Wikipedia articles based on a search phrase. This article explains my approach towards building the project in Python. Check out the project on GitHub below. Structure Photo by Ricardo Cruz on Unsplash … Read more

Object Detection On Aerial Imagery Using RetinaNet

ESRI Data Science Challenge 2019 3rd place solution (Left) the original image. (Right) Car detections using RetinaNet, marked in green boxes Detecting cars and swimming pools using RetinaNet Introduction For tax assessments purposes, usually, surveys are conducted manually on the ground. These surveys are important to calculate the true value of properties. For example, having a swimming … Read more

Light on Math ML: Attention with Keras

Why Keras? With the unveiling of TensorFlow 2.0 it is hard to ignore the conspicuous attention (no pun intended!) given to Keras. There was greater focus on advocating Keras for implementing deep networks. Keras in TensorFlow 2.0 will come with three powerful APIs for implementing deep networks. Sequential API — This is the simplest API where you … Read more

Why you should be a Generalist first, Specialist later as a Data Scientist?

So what’s a Generalist and a Specialist? Before going any further, let’s first understand what we mean when we talk about being a generalist and a specialist in data science. A generalist is someone that has knowledge in many areas whereas a specialist knows a lot in one area. Simple as that. Particularly in data … Read more

Who are Independent Voters?

The differences in people who identify with a party “not very strongly”, and those who identify as independent but “are closer to” a party. The data we are using is polling conducted by YouGov Blue and from the progressive data organization Data For Progress, it consists of 3,215 voters and then is weighted by “age, sex, … Read more

Data Scientist Knowledge and Skills

A data scientist creates knowledge from data; and has skills in statistics, programming, and the domain under study. A data scientist creates knowledge from data through quantitative and programming methods and the knowledge of the domain under study. Data science is field in which data scientists work. A data scientist should have skills and knowledge in … Read more

Robotic Control with Graph Networks

Exploiting relational inductive bias to improve generalization and control source Machine learning is helping to transform many fields across diverse industries, as anyone interested in technology undoubtedly knows. Things like computer vision and natural language processing were changed dramatically due to deep learning algorithms in the past few years, and the effects of that change are … Read more

PCA and SVD explained with numpy

How exactly are principal component analysis and singular value decomposition related and how to implement using numpy. Principal component analysis (PCA) and singular value decomposition (SVD) are commonly used dimensionality reduction approaches in exploratory data analysis (EDA) and Machine Learning. They are both classical linear dimensionality reduction methods that attempt to find linear combinations of … Read more