Book Titles Are Getting Longer

A Data Analysis of Trends in the World of Print photo from Pixabay A few weeks ago I noticed popular literary agent DongWon Song express what I had suspected for a while, book titles are getting longer. This trend seemed interesting since the rest of media seems to be going in the opposite direction. We know for … Read more

Data Mining for Sustainable Data Management

In the rapidly expanding technological world of today, when smartphones, tablets, PCs have become an inseparable part of the human life, it is the quintessential philosophy that the power of information and data is realized. Today, as we live in the ‘information age’, the data volumes are exploding; more data has been created in the … Read more

How Artificial Intelligence can Revolutionise Agriculture

Photo by Joao Marcelo Marques on Unsplash The agricultural sector accounts for approximately 12% of global emissions, but in lower income countries it can account for over 50% of national emissions. Thankfully, the agricultural sector’s emissions have decreased by 20% between 1990 and 2015. The majority of these reductions have come from technological adaptations to … Read more

The Misuse of Big Data Algorithms in the United States Criminal Justice System

America has the largest prison population in the world. With more than 2.3 million inmates, the US incarcerates more than 25% of the world’s prison population, even as its general population only accounts for 5%. And while about 1% of the American population is incarcerated, no group is more targeted than black men. 1 in … Read more

What is Out of Bag (OOB) score in Random Forest?

This blog attempts to explain the internal functioning of oob_score when it is set as true in the “RandomForestClassifier” in “Scikit learn” framework. This blog describes the intuition behind the Out of Bag (OOB) score in Random forest, how it is calculated and where it is useful. In the applications that require good interpretability of … Read more

Nvidia’s new Data Science Workstation — a review and benchmark

Data science is hot. The past several years have seen a massive surge in interest for Data Science, so much so that many companies are reorienting their business strategy and branding themselves as “data driven”. There’s a good reason for this and it’s no secret: more data gives us bigger opportunities to extract insights and create … Read more

The Intuition Behind Correlation

Source: Global Shark Attack File & The World Bank What does it really mean for two variables to be correlated? We’ll answer that question in this article. We’ll also develop an intuitive feel for the equation for Pearson’s correlation coefficient. When you dive into the sea of knowledge that is data science, one of the first … Read more

Optimizing Customers’ Journey in Omni-Channel Retailing

Shoppers are increasingly omni-channel. Image from Pixels Hi, I am a Marketing Science Analyst of a global beauty-retail company. The company that I work for is one of the most sophisticated omni-channel retailers in the region. With the advent of data analytics, omni-channel retailing is at one of its most exciting frontiers in history. The … Read more

Quantifying Chatroom Toxicity

Natural Language Preprocessing in a Nutshell Once we have our data, we need to go through the cleaning process. For text data sets in general, this often includes: Removal of punctuation Removal of stop words (things like “the”, “this”, “what”) Stemming/lemmatization (reducing words down to their base form by removing suffixes like “-ed”, “-ing”). In addition, … Read more

7 Tips for Dealing With Small Data

Because more often than not, that’s what you’re gonna get. We often hear that Big Data is the key to building successful machine learning projects. This is a major problem: Many organizations won’t have the data you need. How can we prototype and validate machine learning ideas without the most essential raw material? How can we … Read more

Political Python

Scraping Congressional documents with Scrapy No, not that kind of python. The 2020 Democratic candidates for President will face off in debates starting Wednesday. Many of them are current or former members of Congress. All of them are vying to lead the country. As voters, wouldn’t it be extraordinary if we had a record of everything that … Read more

Neural Networks with push button, AI for all

Neural Architecture Search with NASBench from Google Research— Can we design network architectures automatically, instead of relying on expert experience and knowledge? Motivation Recent advances in neural architecture search (NAS) demand tremendous computational resources, which makes it difficult to reproduce experiments and imposes restrictions to researchers who do not have access to large-scale computation. Neural … Read more

Bested by AI: What Happens When AI Wins?

AI is already better than you at certain tasks. Here are some thoughts on how to cope. Photo by JESHOOTS.COM on Unsplash A few months ago, I sent my dad the article 20 Top Lawyers Were Beaten by Legal AI in a Controlled Study, which (as the title suggests) discusses a study on how AI … Read more

Visualization of Information from Raw Twitter Data — Part 1

Lets explore what kind of information we can easily retrieve from raw Twitter data! Hello everybody! In the previous posts we explored how to efficiently download data using Python and both of the Twitter APIs: The Streaming API to download tweets produced on real time, and the REST API to download historical information like user timelines … Read more

Limericking part 1: context and haikus.

Watson: Jeopardy champ and future poet laureate? One of the most exciting fields within machine learning and data science is natural language processing. Having a machine be able to parse and generate plausibly human sounding language is both of enormous practical value and also notoriously difficult. Human language is messy, filled with the sort of … Read more

The best laid plans

This piece is adapted from a talk I gave at Visualising Data London, Microsoft Reactor on 20th June, 2019. I had an odd journey into data journalism. It goes something like this — Five years ago I was surrounded by boxes and wrapping up my role as a design consultant for Embarq, a sustainable transport and research … Read more

Simple Web Scraping with Python’s Selenium

Let’s get started! To begin we need to install a Webdriver, which we will control through Python using the Selenium module. The Webdriver that we will be using during this tutorial will be the Chromedriver, which can be downloaded by navigating to this link. Next, we need to install Selenium. In my Windows command prompt, I … Read more

Alternating Least Square for Implicit Dataset with code

In the era of big data and analytics, the power which a data driven business has been exponentially increasing. Greater integration of AI and Machine learning has played a vital role in development of systems which can benefit both the Business and Business user as well. Moreover,Recommendation systems add an edge to digital businesses. Below … Read more

Monte Carlo Simulation in R with focus on Financial Data

Generating Random Distributions Now the only missing thing in previous cases is how would one generate a Uniform random, Normal random distributions. We therefore look to cover algorithms to generate such Uniform random distributions and also methods to transform these to other distributions such as Normal Distributions. The numbers that we will be generating in … Read more

Exploring New York City water tank inspection data.

I spy over 12 water tanks in this photo. Photo by author. I have always been interested in building water tanks. Despite all the advances in science, these wooden barrels which hold a building’s water supply have remained largely unchanged for the last century. In New York City they are necessary to maintain water pressure for … Read more

What is Robustness in Statistics? A Brief Intro to Robust Estimators

In the presence of outliers, traditional methods are not efficient in determining process parameters, due to increase in bias and variance, therefore outlier resistant statistics are employed to remove outliers before estimating parameters. How we measure robustness? Robustness of an estimator can be measured by the local stability assessed via influence function (IF), and global … Read more

Fashion product image classification using Neural Networks | Machine Learning from Scratch (Part…

TL;DR Build Neural Network in Python from scratch. Use the model to classify images of fashion products into 1 of 10 classes. We live in the age of Instagram, YouTube, and Twitter. Images and video (a sequence of images) dominate the way millennials and other weirdos consume information. Having models that understand what images show can … Read more

Tolerance Stackups

And how to use Monte Carlo Simulations Instead What is a Tolerance Stackup? Imagine you have 2 pucks that you want to fit tightly in an opening. If you just needed these three parts to fit together once, you could measure how tall the pucks were and then make your cut to that size and … Read more

Learning from Machines: The Data Supply Chain

One of the most useful and intuitive concepts I learned from lean manufacturing is to reconsider your cookie-cutter processes. Being strategic at a more granular level (i.e. at the raw material or parts level) helps organizations to better manage product inventory whilst reducing overhead cost. When your product is built from data, the central idea … Read more

American Labor is in Free Fall Without a Parachute, and We Think We’re Still Safely in the Plane

Americans love to work. It provides an income to support the things we love, and it gives us a reason to wake up in the morning. It gives us a sense of purpose, and many of us define ourselves by what we do. We are Sales Associates, Marketing Managers, Business Owners, Tailors, Accountants, Carpenters, Engineers, … Read more

Collaborative filtering to “predict” the efficacy of a drug

Yu LiuBlockedUnblockFollowFollowing Jun 24 Motivation In drug development, screening is a significant part of the early discovery process. The purpose of this screening is to find molecules that bind the target strong and specific enough to have clinical benefit. Scientists have tolled in the lab to synthesize all kinds of molecules, hoping to improve their … Read more

Neural Networks for Music Generation

Maia approach for Music Generation Maia is a research project I developed at UC Berkeley, along with Edward T. and Louis R., to give one possible solution to this broad challenge. Background We started out with the intention of creating an AI that could complete Mozart’s unfinished composition Lacrimosa — the eighth sequence of the Requiem — which was … Read more

Everyday Life and Microprediction

B2C Prediction and Safety Tools With Artificial Intelligence Do we have the right as individuals to use artificial intelligence to attempt to predict behaviour in everyday life? As an example should you be able to predict the risk of hiring a specific babysitter? In 2018 Predictim advertised a service that promised to vet possible babysitters … Read more

CycleGAN: How Machine Learning Learns Unpaired Image-To-Image Translation

I recently read the CycleGAN paper (link), which I found very interesting because CycleGAN models have the incredible ability to accurately change images into something they’re not (e.g. changing a picture of a horse into a picture of a zebra). Very cool. Let’s dive into how it works. Some of CycleGAN’s applications (left to right): … Read more

Importance of Exhaust Data in Data Science

Image Credit: NASA/JPL-Caltech I was fascinated when I first heard the term Exhaust Data. There are a lot of definitions floating around on the same topic, and I wanted to dig an inch deeper. In simple words exhaust data is the data which is generated without a specific purpose in mind and immediately might not … Read more

Marketing A/B Testing at Zalando

Zalando Office Tamara-Danz-Straße, Berlin-Friedrichshain In-Depth Analysis Enabling Location Based A/B Tests Using Cluster Analysis Contributors: Carsten Rasch, Thomas Perl, Martin Kasten, Jean de Bressy The goal of marketing A/B test analysis at Zalando is to derive the incremental impact of marketing actions. The results of these analyses form the basis for an optimal allocation of budget … Read more

Portable Computer Vision: Tensorflow 2.0 on a Raspberry Pi

Part 8 — Deploy Pre-trained Model (MobileNetV2) Live Demo (using TensorFlow 2.0) I used this code to sanity-check the TensorFlow 2.0-beta0 wheel that I cross-compiled for my Raspberry Pi 3. SSH into your Raspberry Pi $ ssh raspberrypi.local 2. Start a new tmux session pi@raspberryi:~ $ tmux new-session -s mobilenetv2 3. Split the tmux session vertically by … Read more

Predicting Micronutrients using Neural Networks and Random Forest (Part 2)

UNICEF wants you to help them to predict important nutritions within foods using the power of machine learning. Photo by Julien R on Unsplash Welcome back! Glad you can join me in part 2 of this series of “Predicting Micronutrients using Neural Networks and Random Forest.” In the previous blog post, I mentioned that UNICEF has … Read more

Can Machine Learning Read Chest X-rays like Radiologists? (Part 2)

Using adversarial networks to achieve human-level performance for chest x-ray organ segmentation This is Part 2 of a two part series. See Part 1 for challenges and clinical applications of chest x-ray (CXR) segmentation, and how medical imaging, and CXRs specifically, critically need AI to scale. Recap from Part 1 The task of chest X-ray (CXR) … Read more

People Analytics

Human Bias in recruitment selection — Case Study I Credits: Jupiterimages INTRODUCTION The startup XYZ has dramatically increased its workforce in the last year. Now, it wants to incorporate a People Analyst as part of its new strategy to keep on scaling its business while maintaining an accurate hiring process that will positively contribute to future company-decisions. As … Read more

Kruskal’s Minimum Spanning Tree Implementation

Graph is a non linear data structure that has nodes and edges. Minimum Spanning Tree is a set of edges in an undirected weighted graph that connects all the vertices with no cycles and minimum total edge weight. For finding the spanning tree, Kruskal’s algorithm is the simplest one. This content is about implementing the … Read more

Mobility Data, Feature Engineering and Hierarchical Clustering

The United States has one of the world’s largest automobile markets, second only to China. With 270.4 million registered vehicles as of 2017 on the American roads, there are millions of crashes every year. According to the National Highway Traffic Safety, there were an estimated 7 million police-reported motor vehicle crashes in the US in … Read more

Artificial Intelligence in Supply Chain Management: Predictive Analytics for Demand Forecasting

Utilizing data to drive operational performance Supply chain management (SCM) is critical in almost every industry today. Still, despite the importance, it hasn`t received the same amount of focus from AI startups and vendors as many other domains. However, given the vast amounts of data collected by industrial logistics, transportation and warehousing, this is an … Read more

Quickly Navigating Python Libraries With Ctags

A tutorial for using ctags to efficiently navigate Python libraries for data scientists. As a machine learning practitioner, I often use open-source machine learning libraries, such as fastai and scikit-learn. After working with these libraries for awhile, you may reach the point where you want to do something that is not currently supported by a … Read more

Stand Up for Best Practices:

Misuse of Deep Learning in Nature’s Earthquake Aftershock Paper Source: Yuriy Guts selection from Shutterstock The Dangers of Machine Learning Hype Practitioners of AI, machine learning, predictive modeling, and data science have grown enormously over the last few years. What was once a niche field defined by its blend of knowledge is becoming a rapidly growing … Read more

Leveraging the Present to Anticipate the Future in Videos

Predict future action labels instead of predicting pixel level information — Summarization of research paper by Facebook AI Motivation Anticipating actions before they are executed serves wide range of practical applications including autonomous driving and robotics. Prior work done in this field requires partial observation of executed actions. In contrast, this blog concentrates around anticipating actions seconds before … Read more

Brilliant! SIM Cards & Traffic Management

Analysis and Results After many cleansing and feature engineering steps, we end up with 11 columns and 9805 rows (ID’s). Main features were: calculating the travel period of each vehicle, distance thus the speed. The main challenge was to deal with some noises in the coordinates or GPS drifting! However, here I will display some plots … Read more

Naive Bayes Document Classification in Python

How well can I classify a philosophy paper based on its abstract? Naive Bayes is a reasonably effective strategy for document classification tasks even though it is, as the name indicates, “naive.” Naive Bayes classification makes use of Bayes theorem to determine how probable it is that an item is a member of a category. … Read more