Smarter Pricing for Airbnb Using Machine Learning

Increasing host revenue with regression and time series analysis [This project was done as part of an immersive data science program called Metis. You can find the files for this project at my GitHub and the slides here. The final project is accessible here (interactive web app).] I recently designed a new approach to automatic … Read more Smarter Pricing for Airbnb Using Machine Learning

Getting started with Pandas time-series functionality

3 techniques to make your data analysis faster Pandas has exceptional features for analyzing time-series data, including automatic datetime parsing, advanced filtering capabilities, and several datetime-specific plotting functions. I find myself using those features almost every day, but it took me a long time to discover them: many of Pandas datetime capabilities are not immediately … Read more Getting started with Pandas time-series functionality

Should I buy a lottery ticket?

Lottery Ticket Analysis I analyzed past lottery data to decide to buy a lottery ticket using statistics and probability. Photo by dylan nolte on Unsplash I often find myself in deciding between buying a lottery ticket or not especially for the powerball new year’s eve draw. The reason why I feel hesitant in those moments … Read more Should I buy a lottery ticket?

Predicting Movie Profitability and Risk at the Pre-production Phase

Movie Data and Box Office Numbers In order to build my prediction algorithm, I gathered movie data from a couple online sources. I obtained the bulk of my data from the Internet Movie Database (IMDb) which provides a set of files for free download. However, the IMDb files do not contain data on estimated movie … Read more Predicting Movie Profitability and Risk at the Pre-production Phase

How Do Conversational Agents Answer Questions?

NLP, Knowledge Graphs, and the Three Pillars of Intelligence Jibo, Echo/Alexa, Google Home The Three Pillars of Intelligence To Amazon, the reception for its voice agent, Alexa, was a big surprise. Apple’s Siri had put voice input onto smartphones. But here was a new class of device that you could shout at across the kitchen … Read more How Do Conversational Agents Answer Questions?

Oxford (Real) Farming Conference 2020

NLP: Sentiment Analysis, Word Embeddings and Topic Modelling of 3,8K tweets Last week, from the 7th to 9th of January, Oxford hosted the well-established, traditional and businessy Oxford Farming Conference (OFC) and its antidote Oxford Real Farming Conference (ORFC). Both aims to connect actors involved in the agricultural and food sector to tackle the challenges … Read more Oxford (Real) Farming Conference 2020

Exploring the Future of Cloud Computing in 2020 and Beyond

Cloud computing has become a fundamental requirement for most organizations. With this in mind, cloud computing is massively on the rise in the current day and age. In fact, 81 percent of companies with 1,000 employees or more have a multi-platform strategy. The number is to rise to more than 90 percent by 2024. Between … Read more Exploring the Future of Cloud Computing in 2020 and Beyond

Site Planning for Market Coverage Optimization with Mobility Data

Commercial Activity Data: Points of Interest (POIs) This dataset provides information on POIs. Cells are enriched with the following data from this source: Number of competitors within a 250-meter buffer from the cell’s centroid. Number of POIs within a 250-meter buffer from the cell’s centroid. Here we use POIs as a proxy for commercial activity. … Read more Site Planning for Market Coverage Optimization with Mobility Data

The truth about the martingale betting system

I swear by the name of Science that the evidence I shall give shall be the truth, the whole truth, and nothing but the truth. About the simulation from random import *def roll():result = randint(1,36)results.append(result)results = []for i in range(1000000):roll() The script simulates 1000000 roulette outcomes within a second. At each simulation, a random whole … Read more The truth about the martingale betting system

How MonetDB/X100 Exploits Modern CPU Performance

Modern CPUs have undergone significant development. But how does MonetDB exploit this development to maximize its performance? Computer processors have significantly developed in the last three decades. This development involves not only the increasing number of transistors it holds but also the evolution of the architecture. Hence, an application needs to adapt to how the … Read more How MonetDB/X100 Exploits Modern CPU Performance

Preventing the Death of the Dataframe

Source: Disney A definition to save the dataframe from extinction Dataframes emerged from a specific need, but because so many diverse systems now call themselves dataframes, the term is on the verge of meaning nothing. In an effort to preserve the dataframe, we formalized the definition based on the original data model in our recent … Read more Preventing the Death of the Dataframe

Visualising spending behaviour through open banking and GIS

Financial habits have historically been something that people place back of mind, but with the rising amount of information and tools available, a new attitude to financial control is creating a rising popularity in transparent, digital banking. A new breed of financial institutions (such as Monzo, Starling, Revolut and N26) are leveraging digital products to … Read more Visualising spending behaviour through open banking and GIS

Download Email Attachment from Microsoft Exchange Web Services Automatically

Automating The Dull Routine With Python Learn to Handle Email Attachment Using Python Library Exchangelib Photo by Webaroo.com.au on Unsplash Did you need to download email attachments regularly? Do you want to automate this boring process? I know that feel bro. When I first come to my job, I was assigned a daily task: download … Read more Download Email Attachment from Microsoft Exchange Web Services Automatically

How to use the power of “WHY” to achieve what you want

Data science path is not easy. If you’re a data scientist reading this now. You’ll know what I mean. It’s tough. It’s an ever-changing field. It’s dynamic. It’s moving fast. In other words, you need learn fast and adapt quickly to keep up-to-date with the latest trend and technology being used in the industry. The … Read more How to use the power of “WHY” to achieve what you want

The Galactic Island Hypothesis

New research provides a quantitative solution to Fermi’s paradox Simulated settlement trajectories, showing how civilizations could spread through the Galaxy (source) Compared to the age of the Milky Way Galaxy, our 200,000-year old human species has only been around for just the blink of an eye. The Milky Way is at least 10 Billion years … Read more The Galactic Island Hypothesis

Hands-on Web Scraping: Building your Twitter dataset with python and scrapy

This assumes that you have some basic knowledge of python and scrapy. If you are interested only in generating your dataset, skip this section and go to the sample crawl section on the GitHub repo. Gathering tweets URL by searching through hashtags For searching for tweets we will be using the legacy Twitter website. Let’s … Read more Hands-on Web Scraping: Building your Twitter dataset with python and scrapy

Guide to Dimensionality Reduction in single cell RNA-seq analysis

Image Source: Unsplash A major breakthrough in the omics area came in early 2000 with the single cell RNA sequencing (scRNA-seq) technology. The ability to isolate and sequence the genetic material of single cells allows researchers to identify which genes are active in each cell. This provides unprecedented opportunities over bulk RNA sequencing technologies, that … Read more Guide to Dimensionality Reduction in single cell RNA-seq analysis

Introduction to ggplot2 in R

Boxplots are another excellent tool for visualizing descriptive statistics. If you want to learn more about boxplots check out this article from fellow Towards Data Science writer — Michael Galarnyk Below is a boxplot shows the spread for all the rating sites. ggplot(data=reviews) +aes(x=Rating_Site, y = Rating, color = Rating_Site) +geom_boxplot() +labs(title=”Comparison of Movie Ratings”) … Read more Introduction to ggplot2 in R

A Tale of Two Cities — A mystery solved with Pandas

Could Perth really be wetter than Melbourne? Photo by Ricardo Resende on Unsplash Having recently moved from Melbourne to Perth I found it natural to make comparisons between the two cities. Which one has better coffee? OK, that one is easy — Melbourne hands down! Which one has more rain — well to answer that … Read more A Tale of Two Cities — A mystery solved with Pandas

Basic Statistics You NEED to Know for Data Science

Numerical: data expressed with digits; is measurable. It can either be discrete (finite number of values) or continuous (infinite number of values). Categorical: qualitative data classified into categories. It can be nominal (no order) or ordinal (ordered data). Mean: the average of a dataset.Median: the middle of an ordered dataset; less susceptible to outliers.Mode: the … Read more Basic Statistics You NEED to Know for Data Science

Enterprise AI/Machine Learning: Lessons Learned

I recently had the privilege of participating on a panel with several AI/Machine Learning experts. There were many great questions, but most were related to how to most effectively establish an AI/Machine Learning (AI/ML) in a large organization. This gave me an opportunity to reflect on my own experiences helping large enterprise accelerate their AI/Machine … Read more Enterprise AI/Machine Learning: Lessons Learned

Why Python is better than R for Data Science careers

Most companies require their data scientists to do more than predictive modeling (ie machine learning). At the least, you’ll probably be required to maintain the data pipelines that feed your models, and those data pipelines will likely be built in Python. The industry standard for pipelines today is the Python-based Airflow, and at Facebook we … Read more Why Python is better than R for Data Science careers

Generalizing data load processes with Airflow

Data load processes should not be written twice, they should be generalized Photo by Max Nelson on Unsplash We use Airflow as our data pipeline orchestrator in order to easily orchestrating and monitoring data processes. Particularly, we’ve been working on data load processes in order to make them easier. These processes allow us for extracting … Read more Generalizing data load processes with Airflow

Real-Time Fingers Detector Over an Object — A Working Example

Photo by Priscilla Du Preez on Unsplash Recently, I had the opportunity to build a PoC (Proof of Concept — a demo) to resolve a specific computer vision problem and it was a cool experience, so why not share it? The goal of the demo was to detect in real-time, having as input a video … Read more Real-Time Fingers Detector Over an Object — A Working Example

What if more young people had voted in 2016?

Data sources and assumptions For this analysis, I assembled exit poll data, along with voter participation rates and population statistics, all broken down by age. The exit poll data was taken from Edison Research, a firm which does exit polling for a media consortium including ABC, The Associated Press, CBS, CNN, Fox and NBC. Voter … Read more What if more young people had voted in 2016?

Pandas tips that will save you hours of head scratching

Making your Data Analysis experiments reproducible saves time for you and others in the long term When revisiting a problem you’ve worked on in the past and finding out that the code doesn’t work is frustrating. Making your Data Analysis experiments reproducible saves time for you and others in the long term. These tips will … Read more Pandas tips that will save you hours of head scratching

Spoiler Alert: Conor McGregor vs Cowboy? Who Will Win?

Elo rating system The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games. It has been mostly used in chess. But many have applied the same algorithm in other competitive games such as soccer, basketball and scrabble. In this article, we will predict the winner of this … Read more Spoiler Alert: Conor McGregor vs Cowboy? Who Will Win?

Understanding the Central Limit Theorem

A Practical Guide to understanding one of the most important concepts in statistics Photo by Verne Ho on Unsplash Central Limit Theorem (CLT for short) is one of the most important concepts in the field of statistics. In this post, I will try to explain this concept in a simple and non-technical manner. Introduction Let’s … Read more Understanding the Central Limit Theorem

68–95–99.7 — The Three-Sigma Rule of Thumb Used in Power BI.

DATA SCIENCE WITH MICROSOFT POWER BI Detecting outliers and anomalies using the three-sigma rule of thumb in Power BI with no code Photo by timJ on Unsplash Even in the smallest of all data projects, one of the most important steps is detecting abnormal values, outliers or anomalies within your data structure. In this brief … Read more 68–95–99.7 — The Three-Sigma Rule of Thumb Used in Power BI.

Evaluate your Recommendation Engine using NDCG

How to best evaluate a recommender system is a topic of debate. Let us see how we can use NDCG measure to evaluate a recommendation engine. We are in an era of personalization. The user wants personalized content and businesses are capitalizing on the same. Recommendation Engines, usually built using Machine Learning techniques, have become … Read more Evaluate your Recommendation Engine using NDCG

The Best book to Start your Data Science Journey

Data Science Here’s the book you should read to learn Data Science from scratch. Data Science. It’s the sexiest job of the 21st century and everyone is talking about it. Companies are eager to hire the best talents and people are enthusiastic to jump in the data science boat. With data growing exponentially and our … Read more The Best book to Start your Data Science Journey

From Linear to Logistic Regression Explained Step by Step

Congrats~you have gone through all the theoretical concepts of the regression model. Feel bored?! Here’s a real case to get your hands dirty! picture from APPLE Imagine that you are a store manager at the APPLE store, increasing 10% of the sale revenue is your goal this month. Therefore, you need to know who the … Read more From Linear to Logistic Regression Explained Step by Step

Intuitive explanation of Neural Machine Translation

A simple explanation of Sequence to Sequence model for Neural Machine Translation(NMT) What is Neural Machine Translation? Neural machine translation is a technique to translate one language to another language. An example could be converting English language to Hindi language. Let’s consider if you were in an Indian village where most of the people do … Read more Intuitive explanation of Neural Machine Translation

Neural Networks Training with Approximate Logarithmic Computations

This section contains a detailed description of the training experiments performed on our Log-domain MLP, which we designed using the mathematics outlined thus far and neural network MLP architecture knowledge. Fixed-Point Bitwidth analysis in Log-domain All our neural network training and inference experiments are conducted using fixed-point arithmetic. But how does bit-widths scale when we … Read more Neural Networks Training with Approximate Logarithmic Computations

Plotly Python: Scatter Plots

fig = go.Figure(data=go.Scatter(x=steamdf[‘price’],y=steamdf[‘average_playtime’],mode=’markers’,marker_size=steamdf[‘ratio’],hovertext=steamdf[‘name’],hoverlabel=dict(namelength=0),hovertemplate=’%{hovertext}<br>Price: %{x:$}<br>Avg. Playtime: %{y:,} min’,marker=dict(color=’rgb(255, 178, 102)’,size=10,line=dict(color=’DarkSlateGrey’,width=1))))fig.update_layout(title=’Price vs. Average Playtime’,xaxis_title=’Price (GBP)’,yaxis_title=’Average Playtime (Minutes)’,plot_bgcolor = ‘white’,paper_bgcolor = ‘whitesmoke’,font=dict(family=’Verdana’,size=16,color=’black’))fig.update_xaxes(showline=True,linewidth=2,linecolor=’black’,mirror=True,showgrid=False,zerolinecolor=’black’,zerolinewidth=1,range=[-1, 65])fig.update_yaxes(showline=True,linewidth=2,linecolor=’black’,mirror=True,showgrid=True,gridwidth=1,gridcolor=’grey’,zerolinecolor=’black’,zerolinewidth=1,range=[-2000, 40000])fig.show() I hope this covers enough to get you feeling confident with creating and customizing scatter plots with Plotly! Favorite

The exponential adoption of Tesla in the Netherlands

Exploring the RDW license plate data set with BigQuery, Cloud Storage and Data Studio Last year, driving the Dutch highway, I noticed that the amount of Tesla’s on the road has been increasing rapidly. There are a few good reasons for this. The adoption of full-electric (lease) cars is actively supported by the Dutch government … Read more The exponential adoption of Tesla in the Netherlands

Helping Kids Play With Artificial Intelligence

How zines can teach state-of-the-art skills Photo by Giulia Bertelli on Unsplash Every day, our kids are swept through the world by algorithms. YouTube algorithms decide what videos they watch, GPS algorithms map what route they take to school, Spotify algorithms select what songs they hear, and personal assistants like Siri and Alexa advise them … Read more Helping Kids Play With Artificial Intelligence

Most Data Science Jobs Descriptions Should Stop Requiring a PhD

I just searched on Indeed for “Data Scientist.” Of the 20 jobs on the first page 6 of them did not mention a PhD, 3 said it would be nice to have, and 11 had it as a requirement. I didn’t spend hours looking at job boards or automate a system to scrape them all … Read more Most Data Science Jobs Descriptions Should Stop Requiring a PhD

Democratize Data like Zynga, Facebook and Ebay Do

Zynga and the data law Zynga, founded in 2007, is the company behind FarmVille and lots of other very successful mobile games; in 2018 they had close to 1 billion USD in revenue, 15 million USD net income and close to 2,000 employees. In their 12 years of company history, they introduced a data-driven cult … Read more Democratize Data like Zynga, Facebook and Ebay Do

Bulk Mapping Attributes to Dataframes using Python Pandas

When it comes to iterating through large volumes of rows in Pandas dataframes, many of us have impatiently waited for our program to finish looping, sometimes row by row for painstakingly long periods of time. This was one of my main struggles when loading high-volume transactional data as a Pandas dataframe, and then enriching the … Read more Bulk Mapping Attributes to Dataframes using Python Pandas

How does Airbnb impact housing in San Francisco? Analysis and data.

First, one needs to define the “short term rental market” in SF. This analysis uses Dec 4, 2019 data from Inside Airbnb so these results only apply to properties on Airbnb listed at that time [6]. With that sample in mind, this essay excludes hotels/hostels (8% of listings). Furthermore, as the SF Airbnb market has … Read more How does Airbnb impact housing in San Francisco? Analysis and data.

Training a GAN to Sample from the Normal Distribution

Visualizing the Very Basics of Generative Adversarial Networks In the original Generative Adversarial Network paper, Ian Goodfellow describes a simple GAN which, when trained, is able to generate samples indiscernible from those sampled from the normal distribution. This process is illustrated here: Figure 1: the first figure ever published in a GAN paper, illustrating a … Read more Training a GAN to Sample from the Normal Distribution