ICLR 19 highlights (and all that Jazz)

A joint post with Ofri Mann We went to ICLR to present our work on debugging ML models using uncertainty and attention. Between cocktail parties and jazz shows in the wonderful New Orleans (can we do all conferences in NOLA please?) we also saw a lot of interesting talks and posters. Below are our main … Read more

Interview Coding Problems: 1.

1. Return if any two numbers from a list add up to a number 2. Transform a list such that each element to be the product of all the rest numbers in the original list 3. Serialise and Deserialise a binary tree. A great way to improve your coding skills is by solving coding challenges. Solving … Read more

Recurrence in biological and artificial neural networks

Similarities, differences, and why it matters Recurrence is an overloaded term in the context of neural networks, with disparate colloquial meanings in the machine learning and the neuroscience communities. The difference is narrowing, however, as the artificial neural networks (ANNs) used for practical applications are increasingly sophisticated and more like biological neural networks (BNNs) in some … Read more

Predicting Airbnb prices with deep learning, part 1: how to clean up Airbnb data

Project aims and background Airbnb is a home-sharing platform that allows home-owners and renters (‘hosts’) to put their properties (‘listings’) online, so that guests can pay to stay in them. Hosts are expected to set their own prices for their listings. Although Airbnb and other sites provide some general guidance, there are currently no free … Read more

A.I. with Behaviors

What do Rumors, Fashions/Fads, and doing the Wave at sports games have in common? They are all forms of Collective Behavior ORGIN of Collective Behavior The U.S. sociologist Robert E. Park, who coined the term collective behaviour, defined it as “the behavior of individuals under the influence of an impulse that is common and collective, an … Read more

Scalable Python Code with Pandas UDFs: A Data Science Application

Source: https://pxhere.com/en/photo/1417846 Making Python code run at massive scale in the cloud PySpark is a really powerful tool, because it enables writing Python code that can scale from a single machine to a large cluster. While libraries such as MLlib provide good coverage of the standard tasks that a data scientists may want to perform in … Read more

Computational Biology

(Image reproduced from: https://blog.f1000.com/2017/02/01/f1000prime-f1000prime-faculty-launch-bioinformatics-biomedical-informatics-computational-biology/) Computational biology is the combined application of math, statistics and computer science to solve biology-based problems. Examples of biology problems are: genetics, evolution,cell biology, biochemistry. [1] Introduction Recent advancements in technology are enabling us to store an incredible amount of data. Initially, “Big Data” was perceived as a problem to be … Read more

Why Data should be Normalized before Training a Neural Network

And Why Tanh Generally Performs Better Than Sigmoid Photo by Clint Adair on Unsplash Among the best practices for training a Neural Network is to normalize your data to obtain a mean close to 0. Normalizing the data generally speeds up learning and leads to faster convergence. Also, the (logistic) sigmoid function is hardly ever used anymore … Read more

AI Thinks Men Are Shallow

The data doesn’t lie. We started noticing this several years ago, first with some of the attraction data that was available. Check out the raw data for men/women attraction as a function of age: According to the “human” labeled data, most likely labeled by men, the attraction for a woman steadily declines with age. It … Read more

Anomaly Detection on Donald Trump’s Wikipedia Page

Anomaly detection (aka outlier detection) is a data mining technique that identifies rare observations from a dataset. This technique is widely used in analytics operations, such as web analytics and fraud analytics. In web analytics, anomaly detection is used to analyze web traffic and identify periods when unusual events occur. For instance, Nike received massive … Read more

Predicting Hotel Cancellations with ExtraTreesClassifier and Logistic Regression

Hotel cancellations can cause issues for managers. Not only is there the lost revenue as a result of the customer cancelling, but this can also cause difficulty in coordinating bookings and adjusting revenue management practices. Data analytics can help to overcome this issue, in terms of identifying the customers who are most likely to cancel — allowing … Read more

Research of Influence in Offline and Online Social Networks

It’s a Small World — classic research findings for offline social networks Taken from Milgram (1967) In his 1967 experiment Milgram asked randomly chosen U.S. citizens to pass on a letter to random targets using only friends and acquaintances they knew on a first-name basis. The resulting median path length was 5, meaning that people are generally found to … Read more

Applying Deep-Learning for fashion e-commerce

When I was learning about Unsupervised Learning methods I came across different clustering methods like KMeans , Hierarchical Clustering . During the learning phase I wanted to implement this method on real-world problems. Also E-commerce systems have been keeping my mind occupied for a while and it was very engaging to know How the system works. So … Read more

The Power of Visualization in Data Science

Good and Bad Visualizations Humans have been creating visualizations for thousands of years, and whilst the drawings of cavemen are slightly less spectacular than what we have nowadays, it is still good to appreciate just how powerful some of the early visualizations were, as well as how impactful they have been on the modern world. … Read more

Tutorial for Using Confidence Intervals & Bootstrapping

A different way to validate hypothesis testing In this tutorial I will attempt to show how the use of bootstrapping and confidence intervals can help with highlighting statistically significant differences between sample distributions. First 5 Rows of albums_data To start off, imagine we have a dataset called albums_data that has album reviewer names, the scores … Read more

Machine Learning Boosting Algorithms — AdaBoost Explained

Photo by Franki Chamaki on Unsplash The general idea behind boosting methods is to train predictors sequentially, each trying to correct its predecessor. The two most commonly used boosting algorithms are AdaBoost and Gradient Boosting. In the proceeding article, we’ll cover AdaBoost. At a high level, AdaBoost is similar to Random Forest in that they both … Read more

ICLR 2019: Overcoming limited data

Summaries of papers that address learning from few examples Last week (5/6/19) marked the start of the International Conference on Learning Representations (ICLR). As such I thought I would dive into some of the ICLR papers that I found the most interesting. Most of these papers are related to areas of personal interest for me (unsupervised … Read more

Citizen Data Science

Using Excel with iNaturalist and eBird Steven Wright @stevenwright via Unsplash I recently gave an informal talk to a class of botany students at Gavilan College. The original topic was nature photography, but I also talked about the data science techniques that I used to create my recently completed photo book, Portraits of Birds: Shoreline Park. The … Read more

The Quiet Semi-Supervised Revolution

Time to dust off that unlabeled data? One of the most familiar settings for a machine learning engineer is having access to a lot of data, but modest resources to annotate it. Everyone in that predicament eventually goes through the logical steps of asking themselves what to do when they have limited supervised data, but … Read more

Classification and Regression Analysis with Decision Trees

The Fundamentals of Decision Trees A decision tree is constructed by recursive partitioning — starting from the root node (known as the first parent), each node can be split into left and right child nodes. These nodes can then be further split and they themselves become parent nodes of their resulting children nodes. For example, looking at the … Read more

The Central Limit Theorem and its Implications

The central limit theorem… Mother to us all machine learning researchers and practitioners. We make use of it every day, yet we don’t appreciate it. Why do those pesky neural networks even converge on all sorts of stuff? Images, time series etc. We can learn quite a lot, all thanks to this, at first glance … Read more

I’m Not A Robot!

Authentication Schemes Before we get into the schemes, let’s talk about the flow. In a REST API, sending in the credentials once and logging in is not enough. REST is stateless as we discussed in this article. Being stateless, the REST API can’t remember your credentials. So you have to tell it who you are … Read more

Uber datasets in BigQuery: Driving times around SF and your city too

Interactive travel times dashboard around the SF Bay Area. Powered by BigQuery, Data Studio, and Uber’s worldwide movement data. Uber keeps adding new cities to their public data program — let’s load them into BigQuery. We’ll take advantage of the latest new features: Native GIS functions, partitioning, clustering, and fast dashboards with BI Engine. First let’s play with the … Read more

Combining Satellite Imagery and machine learning to predict poverty

Photo credit: NASA This is a review under 5 minutes of the paper with the same name, by Neil Jean et al. This is the video version of this article: https://youtu.be/bW_-I2qYmEQ . Poverty estimation in the developing world, influences how governments of these countries allocate limited resources to create policies and conduct research. Neal Jean et al. … Read more

Building Mario Levels with Machine Learning

Super Mario Makers: Building Levels with Machine Learning (AI and Games YouTube Channel) It’s been ten years since the inception of the Mario AI research community, but work in this space is still as engaging and exciting as it’s ever been. Today I’m going to look at a variety of research using machine learning to … Read more

Creating sea routes from the sea of AIS data.

Maritime routes are important characteristics of maritime transportation. Sometimes they are clearly defined by the official guidelines “traffic separation schemes”, sometimes they are more of the recommendations. International Maritime Organization is responsible for the routeing systems, including traffic separation schemes, and they are published in the IMO Publication, Ships’ Routeing — currently 2013 Edition. Unfortunately, they are … Read more

Learning Parameters, Part 2: Momentum-Based And Nesterov Accelerated Gradient Descent

Learning Parameters Let’s look at two simple, yet very useful variants of gradient descent. In this post, we look at how the gentle-surface limitation of Gradient Descent can be overcome using the concept of momentum to some extent. Make sure you check out my blog post — Learning Parameters, Part-1: Gradient Descent, if you are unclear of what … Read more

Create deep learning models with Flowpoints

An intuitive way to build and share deep learning models. I often find myself explaining how my models work. Sometimes I try to put it in layman’s terms, sometimes I just explain it as a black box, and sometimes I draw interconnected nodes representing parts of a neural net. Introducing Flowpoints Flowpoints is an open-sourced online … Read more

Automated movie Tagging- A Multiclass classification problem

Tagging of movies reveals a wide range of heterogeneous information about movies, like the genre, plot structure, soundtracks, metadata, visual and emotional experiences. That information can be valuable in building automatic systems to create tags for movies. Automatic tagging systems also help recommendation engines to improve the retrieval of similar movies as well as help … Read more

Exploring Reddit’s ‘Ask Me Anything’ Using the PRAW API Wrapper

Step 7: Exploring Data The simplest way to help us quickly gauge the most meaningful AMAs is to plot the leaders by most commented (num_comments), most upvoted(num_upvotes), and most positive (upvote ratio), then do the same after grouping the rows by Topic/Category (‘link_flair’), taking the mean of these same stats for each group. Most Engaging … Read more

Kubernetes, The Open and Scalable Approach to ML Pipelines

Still waiting for ML training to be over? Tired of running experiments manually? Not sure how to reproduce results? Wasting too much of your time on devops and data wrangling? It’s okay if you’re a hobbyist, but data science models are meant to be incorporated into real business applications. Businesses won’t invest in data science … Read more

3 simple ways to handle large data with Pandas

Pandas love eating data Pandas love eating data Pandas has become one of the most popular Data Science libraries out there. It’s easy to use, the documentation is fantastic, and it’s capabilities are powerful. Yet regardless of what library one uses, large datasets always present an extra challenge that needs to be handled with care. You start … Read more

Predicting 2018–19 NBA’s Most Valuable Player using Machine Learning

We are already quite deep in playoff basketball but the results of award votings haven’t been released yet which gave me an idea to try and predict the results for MVP award. James Harden and Giannis Antetokounmpo, top MVP candidates. Image source Collecting The Data I found the data on basketball-reference for each season all the way … Read more

Practical Statistics & Visualization With Python & Plotly

Photo credit: Pixabay How to use Python and Plotly for statistical visualization, inference, and modeling One day last week, I was googling “statistics with Python”, the results were somewhat unfruitful. Most literature, tutorials and articles focus on statistics with R, because R is a language dedicated to statistics and has more statistical analysis features than Python. In … Read more

Predictive Analytics: Predicting Consumer Behavior with Data Analytics

Atif M.BlockedUnblockFollowFollowing May 14 It’s no surprise that businesses spend millions of dollars in carrying out market research before coming up with a new service or product. Despite that, it is important to recognize that the final product doesn’t sell itself and actually requires the right marketing tools to make itself visible to the potential … Read more

The Unhidden Mysteries of the post-digital era

Image Credit: thriving in a post-digital-world Flashback to some 15 years ago when it was intriguing to have our phones talk back to us or trust the famous female voice on Google Maps to help us navigate our way through an unfamiliar journey. Fast forward to 2019, Alexa and many more digital technologies are part … Read more

Should We Code in English

Should We Code in English? My background, before entering the world of data science, was as a linguist. I studied everything from Bahasa Melayu to Zulu, so I well know the rich variety of grammars and syntaxes that exist in human languages. Yet although the capacity for linguistic fluency is our shared human birthright, one of … Read more

Graph Convolutional Networks for Geometric Deep Learning

Types of Graph Convolutions There are 2 types of graph convolutions: Spatial Methods: don’t require the use of eigen-stuff and Spectral Methods: requires the use of eigen-stuff Both methods are built on different mathematical principles, and it’s easy to notice similarities between approaches within each method. However It’s probably not very intuitive as to why … Read more

Finding Bayesian Legos

Photo credit: Frédérique Voisin-Demery/Flickr (CC BY 2.0) Joe, a good family friend, dropped by earlier this week. As we do often, we discussed the weather (seems to be hotter than normal already here in the Pacific Northwest), the news (mostly about how we are both taking actions to avoid the news), and our kids. Both of … Read more

Get started with Object Oriented Programming in Python: Classes and Instances

New to OOP? Learn how to write a class and create instances in Python There are a lot of articles popping up on object-oriented programming in Python at the moment. Many data scientists, myself included, find ourselves in roles that focus on writing functional code, often in small scripts or prototypes. I’ve been working as a … Read more

Cancer and AI in a Single Frame

Cancer: One Uncontrollable Beast Cancer is a complex and uncontrollable beast that mutates and changes continuously even before you get into the politics and the economics surrounding the issue. There are actually more than 100+ different forms of verified cancerous diseases. We all have cancer cells in our bodies, but our immune system can fight … Read more

The hard things about being a data scientist in marketing

Hint: Not the code. Most Data Scientist job descriptions are increasingly loaded with requirements of technical skills spanning areas of machine learning, programming, tools and statistical knowledge. Candidates are constantly playing catch up with these requirements by loading their resumes with every known term in the data science vocabulary to improve their chances of being matched, … Read more

Full Stack Deep Learning Steps and Tools

Codebase Development When we do the project, expect to write codebase on doing every steps. Reproducibility is one thing that we must concern when writing the code. we need to make sure that our codebase has reproducibility on it. Before we dive into tools, we need to choose the language and framework of our codebase. … Read more

Dark Side of Data Science Hackathons

I described several reasons to participate in hackathons in the previous trilogy part. The motivation to learn a lot and win valuable awards attracts almost all people, but rather often the event fails and the participants leave dissatisfied due to the organizers’ or sponsoring companies’ mistakes. I provide the current post to avoid such unpleasant … Read more

Evolving Deep Neural Networks

Photo by Johannes Plenio on Unsplash Many of us have seen Deep learning accomplishing huge success in a variety of fields in recent years, with most of it coming from their ability to automate the frequently tedious and difficult feature engineering phase by learning “hierarchical feature extractors” from data [2]. Also, as architecture design (i.e. the … Read more