The Galactic Island Hypothesis

New research provides a quantitative solution to Fermi’s paradox Simulated settlement trajectories, showing how civilizations could spread through the Galaxy (source) Compared to the age of the Milky Way Galaxy, our 200,000-year old human species has only been around for just the blink of an eye. The Milky Way is at least 10 Billion years … Read more The Galactic Island Hypothesis

Hands-on Web Scraping: Building your Twitter dataset with python and scrapy

This assumes that you have some basic knowledge of python and scrapy. If you are interested only in generating your dataset, skip this section and go to the sample crawl section on the GitHub repo. Gathering tweets URL by searching through hashtags For searching for tweets we will be using the legacy Twitter website. Let’s … Read more Hands-on Web Scraping: Building your Twitter dataset with python and scrapy

Guide to Dimensionality Reduction in single cell RNA-seq analysis

Image Source: Unsplash A major breakthrough in the omics area came in early 2000 with the single cell RNA sequencing (scRNA-seq) technology. The ability to isolate and sequence the genetic material of single cells allows researchers to identify which genes are active in each cell. This provides unprecedented opportunities over bulk RNA sequencing technologies, that … Read more Guide to Dimensionality Reduction in single cell RNA-seq analysis

Introduction to ggplot2 in R

Boxplots are another excellent tool for visualizing descriptive statistics. If you want to learn more about boxplots check out this article from fellow Towards Data Science writer — Michael Galarnyk Below is a boxplot shows the spread for all the rating sites. ggplot(data=reviews) +aes(x=Rating_Site, y = Rating, color = Rating_Site) +geom_boxplot() +labs(title=”Comparison of Movie Ratings”) … Read more Introduction to ggplot2 in R

A Tale of Two Cities — A mystery solved with Pandas

Could Perth really be wetter than Melbourne? Photo by Ricardo Resende on Unsplash Having recently moved from Melbourne to Perth I found it natural to make comparisons between the two cities. Which one has better coffee? OK, that one is easy — Melbourne hands down! Which one has more rain — well to answer that … Read more A Tale of Two Cities — A mystery solved with Pandas

Basic Statistics You NEED to Know for Data Science

Numerical: data expressed with digits; is measurable. It can either be discrete (finite number of values) or continuous (infinite number of values). Categorical: qualitative data classified into categories. It can be nominal (no order) or ordinal (ordered data). Mean: the average of a dataset.Median: the middle of an ordered dataset; less susceptible to outliers.Mode: the … Read more Basic Statistics You NEED to Know for Data Science

Enterprise AI/Machine Learning: Lessons Learned

I recently had the privilege of participating on a panel with several AI/Machine Learning experts. There were many great questions, but most were related to how to most effectively establish an AI/Machine Learning (AI/ML) in a large organization. This gave me an opportunity to reflect on my own experiences helping large enterprise accelerate their AI/Machine … Read more Enterprise AI/Machine Learning: Lessons Learned

Why Python is better than R for Data Science careers

Most companies require their data scientists to do more than predictive modeling (ie machine learning). At the least, you’ll probably be required to maintain the data pipelines that feed your models, and those data pipelines will likely be built in Python. The industry standard for pipelines today is the Python-based Airflow, and at Facebook we … Read more Why Python is better than R for Data Science careers

Generalizing data load processes with Airflow

Data load processes should not be written twice, they should be generalized Photo by Max Nelson on Unsplash We use Airflow as our data pipeline orchestrator in order to easily orchestrating and monitoring data processes. Particularly, we’ve been working on data load processes in order to make them easier. These processes allow us for extracting … Read more Generalizing data load processes with Airflow

Real-Time Fingers Detector Over an Object — A Working Example

Photo by Priscilla Du Preez on Unsplash Recently, I had the opportunity to build a PoC (Proof of Concept — a demo) to resolve a specific computer vision problem and it was a cool experience, so why not share it? The goal of the demo was to detect in real-time, having as input a video … Read more Real-Time Fingers Detector Over an Object — A Working Example

What if more young people had voted in 2016?

Data sources and assumptions For this analysis, I assembled exit poll data, along with voter participation rates and population statistics, all broken down by age. The exit poll data was taken from Edison Research, a firm which does exit polling for a media consortium including ABC, The Associated Press, CBS, CNN, Fox and NBC. Voter … Read more What if more young people had voted in 2016?

Pandas tips that will save you hours of head scratching

Making your Data Analysis experiments reproducible saves time for you and others in the long term When revisiting a problem you’ve worked on in the past and finding out that the code doesn’t work is frustrating. Making your Data Analysis experiments reproducible saves time for you and others in the long term. These tips will … Read more Pandas tips that will save you hours of head scratching

Spoiler Alert: Conor McGregor vs Cowboy? Who Will Win?

Elo rating system The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games. It has been mostly used in chess. But many have applied the same algorithm in other competitive games such as soccer, basketball and scrabble. In this article, we will predict the winner of this … Read more Spoiler Alert: Conor McGregor vs Cowboy? Who Will Win?

Understanding the Central Limit Theorem

A Practical Guide to understanding one of the most important concepts in statistics Photo by Verne Ho on Unsplash Central Limit Theorem (CLT for short) is one of the most important concepts in the field of statistics. In this post, I will try to explain this concept in a simple and non-technical manner. Introduction Let’s … Read more Understanding the Central Limit Theorem

68–95–99.7 — The Three-Sigma Rule of Thumb Used in Power BI.

DATA SCIENCE WITH MICROSOFT POWER BI Detecting outliers and anomalies using the three-sigma rule of thumb in Power BI with no code Photo by timJ on Unsplash Even in the smallest of all data projects, one of the most important steps is detecting abnormal values, outliers or anomalies within your data structure. In this brief … Read more 68–95–99.7 — The Three-Sigma Rule of Thumb Used in Power BI.

Evaluate your Recommendation Engine using NDCG

How to best evaluate a recommender system is a topic of debate. Let us see how we can use NDCG measure to evaluate a recommendation engine. We are in an era of personalization. The user wants personalized content and businesses are capitalizing on the same. Recommendation Engines, usually built using Machine Learning techniques, have become … Read more Evaluate your Recommendation Engine using NDCG

The Best book to Start your Data Science Journey

Data Science Here’s the book you should read to learn Data Science from scratch. Data Science. It’s the sexiest job of the 21st century and everyone is talking about it. Companies are eager to hire the best talents and people are enthusiastic to jump in the data science boat. With data growing exponentially and our … Read more The Best book to Start your Data Science Journey

From Linear to Logistic Regression Explained Step by Step

Congrats~you have gone through all the theoretical concepts of the regression model. Feel bored?! Here’s a real case to get your hands dirty! picture from APPLE Imagine that you are a store manager at the APPLE store, increasing 10% of the sale revenue is your goal this month. Therefore, you need to know who the … Read more From Linear to Logistic Regression Explained Step by Step

Intuitive explanation of Neural Machine Translation

A simple explanation of Sequence to Sequence model for Neural Machine Translation(NMT) What is Neural Machine Translation? Neural machine translation is a technique to translate one language to another language. An example could be converting English language to Hindi language. Let’s consider if you were in an Indian village where most of the people do … Read more Intuitive explanation of Neural Machine Translation

Neural Networks Training with Approximate Logarithmic Computations

This section contains a detailed description of the training experiments performed on our Log-domain MLP, which we designed using the mathematics outlined thus far and neural network MLP architecture knowledge. Fixed-Point Bitwidth analysis in Log-domain All our neural network training and inference experiments are conducted using fixed-point arithmetic. But how does bit-widths scale when we … Read more Neural Networks Training with Approximate Logarithmic Computations

Plotly Python: Scatter Plots

fig = go.Figure(data=go.Scatter(x=steamdf[‘price’],y=steamdf[‘average_playtime’],mode=’markers’,marker_size=steamdf[‘ratio’],hovertext=steamdf[‘name’],hoverlabel=dict(namelength=0),hovertemplate=’%{hovertext}<br>Price: %{x:$}<br>Avg. Playtime: %{y:,} min’,marker=dict(color=’rgb(255, 178, 102)’,size=10,line=dict(color=’DarkSlateGrey’,width=1))))fig.update_layout(title=’Price vs. Average Playtime’,xaxis_title=’Price (GBP)’,yaxis_title=’Average Playtime (Minutes)’,plot_bgcolor = ‘white’,paper_bgcolor = ‘whitesmoke’,font=dict(family=’Verdana’,size=16,color=’black’))fig.update_xaxes(showline=True,linewidth=2,linecolor=’black’,mirror=True,showgrid=False,zerolinecolor=’black’,zerolinewidth=1,range=[-1, 65])fig.update_yaxes(showline=True,linewidth=2,linecolor=’black’,mirror=True,showgrid=True,gridwidth=1,gridcolor=’grey’,zerolinecolor=’black’,zerolinewidth=1,range=[-2000, 40000])fig.show() I hope this covers enough to get you feeling confident with creating and customizing scatter plots with Plotly! Favorite

The exponential adoption of Tesla in the Netherlands

Exploring the RDW license plate data set with BigQuery, Cloud Storage and Data Studio Last year, driving the Dutch highway, I noticed that the amount of Tesla’s on the road has been increasing rapidly. There are a few good reasons for this. The adoption of full-electric (lease) cars is actively supported by the Dutch government … Read more The exponential adoption of Tesla in the Netherlands

Helping Kids Play With Artificial Intelligence

How zines can teach state-of-the-art skills Photo by Giulia Bertelli on Unsplash Every day, our kids are swept through the world by algorithms. YouTube algorithms decide what videos they watch, GPS algorithms map what route they take to school, Spotify algorithms select what songs they hear, and personal assistants like Siri and Alexa advise them … Read more Helping Kids Play With Artificial Intelligence

Most Data Science Jobs Descriptions Should Stop Requiring a PhD

I just searched on Indeed for “Data Scientist.” Of the 20 jobs on the first page 6 of them did not mention a PhD, 3 said it would be nice to have, and 11 had it as a requirement. I didn’t spend hours looking at job boards or automate a system to scrape them all … Read more Most Data Science Jobs Descriptions Should Stop Requiring a PhD

Democratize Data like Zynga, Facebook and Ebay Do

Zynga and the data law Zynga, founded in 2007, is the company behind FarmVille and lots of other very successful mobile games; in 2018 they had close to 1 billion USD in revenue, 15 million USD net income and close to 2,000 employees. In their 12 years of company history, they introduced a data-driven cult … Read more Democratize Data like Zynga, Facebook and Ebay Do

Bulk Mapping Attributes to Dataframes using Python Pandas

When it comes to iterating through large volumes of rows in Pandas dataframes, many of us have impatiently waited for our program to finish looping, sometimes row by row for painstakingly long periods of time. This was one of my main struggles when loading high-volume transactional data as a Pandas dataframe, and then enriching the … Read more Bulk Mapping Attributes to Dataframes using Python Pandas

How does Airbnb impact housing in San Francisco? Analysis and data.

First, one needs to define the “short term rental market” in SF. This analysis uses Dec 4, 2019 data from Inside Airbnb so these results only apply to properties on Airbnb listed at that time [6]. With that sample in mind, this essay excludes hotels/hostels (8% of listings). Furthermore, as the SF Airbnb market has … Read more How does Airbnb impact housing in San Francisco? Analysis and data.

Training a GAN to Sample from the Normal Distribution

Visualizing the Very Basics of Generative Adversarial Networks In the original Generative Adversarial Network paper, Ian Goodfellow describes a simple GAN which, when trained, is able to generate samples indiscernible from those sampled from the normal distribution. This process is illustrated here: Figure 1: the first figure ever published in a GAN paper, illustrating a … Read more Training a GAN to Sample from the Normal Distribution

Would You Pass the Airbnb Psychopath Test?

I hope you don’t drink or smoke. Dear reader, I write to you from an Airbnb in Seville. As such, I can proudly, triumphantly claim that I must have passed the Airbnb psychopath test. As such, some unkind observers might claim that said test must be mighty forgiving in its definition of ‘psychopath’. Yes, this … Read more Would You Pass the Airbnb Psychopath Test?

The Bio-Medicine Singularity

Refresher: how are new drugs created now? Right now it costs over $1billion to get a drug approved by the U.S. Food and Drug Administration. This cost covers layers of discovery and hard science phases with clinical trials. Frequently the mechanisms for bio-therapeutics are often patched over once efficacy is shown, so many early clinical … Read more The Bio-Medicine Singularity

A Tale about a Giant, a Machine Learning pill, and the Automotive Industry

Today, cars are driven by humans to get from A to B. Machine Learning is a key technology in allowing people or goods to be driven autonomously to the target destination. This is called Autonomous Driving (AD). Source AD allows the creation of manifold applications. RoboTaxis spring to mind immediately, which transport people in urban … Read more A Tale about a Giant, a Machine Learning pill, and the Automotive Industry

Modelling time series relationships with R: S&P 500 vs. oil prices

Here are some examples of time series applications in R, which are used to investigate potential relationships between the S&P 500 and oil prices. In this example, an OLS regression model is constructed in an attempt to forecast future S&P 500 levels based on the price of Brent crude oil. However, since this OLS regression … Read more Modelling time series relationships with R: S&P 500 vs. oil prices

An Extensive Guide to Exploratory Data Analysis

To me, there are main components of exploring data: Understanding your variables Cleaning your dataset Analyzing relationships between variables In this article, we’ll take a look at the first two components. You don’t know what you don’t know. And if you don’t know what you don’t know, then how are you supposed to know whether … Read more An Extensive Guide to Exploratory Data Analysis

The What, Why, and When of Apache Spark

Before-you-code Spark basics Photo by Ethan Hoover on Unsplash Spark has been called a “general purpose distributed data processing engine”1 and “a lightning fast unified analytics engine for big data and machine learning”². It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources. … Read more The What, Why, and When of Apache Spark

Top 7 Mobile Apps for learning and Practicing Data Science

Mobile apps have become an integral part of humans life. Most people waste their time just using messaging apps or listening to music but they don’t know they can utilize their time by using such apps that offer to learn and practice the latest technologies, new coding languages that can benefit people in long run. … Read more Top 7 Mobile Apps for learning and Practicing Data Science

Automate Kaggle Competition with the help of Google Colab

I’m considering the Kaggle IEEE-CIS Fraud Detection competition, I’ll now breakdown step by step of a typical Kaggle machine learning pipeline in colab. 1. Downloading the datasets from API calls: First download your API token by going to your Kaggle My Account (https://www.kaggle.com/*Your-Username*/account), going to section ‘API’ and clicking on ‘create new API token’. You … Read more Automate Kaggle Competition with the help of Google Colab

Data Visualization Rules to Keep You Ahead of The Game

Whether you’re trying to break into the world of data analytics or data science, if you’re a product manager, sales leader, or anybody seeking to understand their business being able to utilize data in a meaningful way is key. Whether you’re using data visualization software like Tableau, Domo, PowerBI, etc. or you’re using a language … Read more Data Visualization Rules to Keep You Ahead of The Game

How to generate data science projects ideas

What kind of project should I present so that I could stand out during the interview? Having a great project to present is important in an interview. Although it might not be the most vital element to succeed in an interview, it is definitely a plus to stand out among other candidates. However, before getting … Read more How to generate data science projects ideas

SVD in Machine Learning: Ridge Regression and Multicollinearity

In this section, we will look at multicollinearity and how can it compromise least squares. Consider a matrix X of shape n × p. For its columns X₁, X₂, …, Xₚ ∈ ℝⁿ, we say they are linearly independent when ∑αᵢXᵢ = 0 if and only if αᵢ = 0 for i = 1, 2, … Read more SVD in Machine Learning: Ridge Regression and Multicollinearity

Understanding data engineering jargon: schema and master/branch

Technical engineering terms explained for non-engineers Photo by Austin Distel on Unsplash “If one does not understand a person, one tends to regard him as a fool” — Carl Jung Well said, Mister Jung. It happens a lot of us. When an error occurs during a cross-functional project, we tend to blame the party that … Read more Understanding data engineering jargon: schema and master/branch

Do You Even Lift? Predicting Weight Change with Workouts and Nutrition

Weight change: I considered using fat % change as the variable of interest. Because after all, burning fats and building muscles are the desired results of this project. However, many sources pointed out the inaccuracy of bathroom body composition scale although it is nice to see certain metrics related to your body. I tried using … Read more Do You Even Lift? Predicting Weight Change with Workouts and Nutrition

How to deal with REJECTION as aspiring Data Scientist

Every Data Scientist who’s working in industry went through the same tough job process that you are going through now. We all failed in the same crushing way but also succeeded in the same triumphant way. The next few tips will delve into how you can deal with your lowest moments. The first tip is … Read more How to deal with REJECTION as aspiring Data Scientist