Talks

I had the pleasure to present at the following events and conferences: Upcoming: useR 2019 – Toulouse: ‘Serverless Computing in R’ PyDays Vienna 2019: ‘Hydrogen & Pweave – A better Jupyter Notebook?’ Vienna Applied AI Meetup by AI Austria Meetup ‘Serverless computing: AWS Lambda with R and Docker as a Service’ Vienna-R Meetup ‘Serverless computing … Read moreTalks

Wilmington’s crime rate has soared — so has its police spending

Illustration: Jared Whalen; photo: creative commons) Policing has taken up a greater and greater share of government spending in Wilmington over the last three decades and today makes up a larger portion of government expenditures in Wilmington than in any other large U.S. city, according to data on local government finances. Out of $516 million spent … Read moreWilmington’s crime rate has soared — so has its police spending

All you want to know about preprocessing: Data preparation

This is an introduction part, where we are going to discuss how to check and prepare your data for further preprocessing. Nowadays, almost all ML/data mining projects workflow run on a standard CRISP-DM (Cross-industry standard process for data mining) or its IBM enhance ASUM-DM (Analytics Solutions Unified Method for Data Mining/Predictive Analytics). The longest and … Read moreAll you want to know about preprocessing: Data preparation

Step-by-Step Setup for Your Automated Home Trading System

This article walks you through the step-by-step setup of your automated home trading system, built in Python. Disclaimer: Nothing herein is financial advice, and NOT a recommendation to trade real money. Many platforms exist for simulated trading (paper trading) which can be used for building and developing the methods discussed. Please use common sense and always … Read moreStep-by-Step Setup for Your Automated Home Trading System

How To Be Confident In Your Neural Network Confidence

Those notes are based on the research paper “ On Calibration of Modern Neural Networks” by (Guo et al, 2017.). Very large and deep models, as ResNet, are far more accurate than their older counterparts, as LeNet, on computer vision datasets such as CIFAR100. However while they are better at classifying images, we are less … Read moreHow To Be Confident In Your Neural Network Confidence

Reverse Engineering the Walk Score Algorithm

Using Machine Learning to Build a Walkability Score Heatmap of Predicted Walk Scores throughout Seattle, WA I live in Seattle and recently moved to a different neighborhood. According to Walk Score’s proprietary algorithms, I moved from the 9th most walkable Seattle neighborhood to the 30th. I can still easily walk to a local coffee shop and … Read moreReverse Engineering the Walk Score Algorithm

Performance Lawn Equipment: An exercise in addressing business effeciency

Business Operations Effeciency Figure 9: Ratio of Deliveries Made On Time by Month We can see the rate of on-time deliveries made across a 4-year span in figure 9 (above). A low was reached during March 2010, when 27 of 1,116 deliveries were not delivered on time, resulting in a drop from 98.1% to 97.6%. A … Read morePerformance Lawn Equipment: An exercise in addressing business effeciency

A New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys

We are excited to announce the latest major release of rIP (v1.2.0), which is an R package that detects fraud in online surveys by tracing, scoring, and visualizing IP addresses. Essentially, rIP takes an array of IP addresses, which are always captured in online surveys (e.g., MTurk), and the keys for the services the user … Read moreA New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys

Machine Learning for Radiology — Where to Begin

Anaconda Anaconda is an open-source platform that is perhaps the easiest way to get started with Python machine learning on Linux, Mac OS X and Windows. It helps you manage the programing environments, and includes common Python packages used in data science. You can download the distribution for your platform at https://www.anaconda.com/distribution/ . Once you install … Read moreMachine Learning for Radiology — Where to Begin

Machine Learning Model for Recommending the Crew Size for Cruise Ship Buyers

In this tutotial, we build a regression model using the cruise_ship_info.csv dataset for recommending the crew size for potential cruise ship buyers. This tutorial will highlight important data science and machine learning concepts such as: data proprocessing and variable selection; basic regression model building; hyper-parameters tuning; model evaluation; and techniques for dimensionality reduction. The github … Read moreMachine Learning Model for Recommending the Crew Size for Cruise Ship Buyers

The Definite Guide For Creating An Academic-Level Dataset  With Industry Requirements And…

Guidelines For Creating Your Own Data, Accompanied by Valuable Information To Aid You When Making Key Decisions. Teenagers playing football, Ipanema beach, Rio De Janeiro, Brazil. Ektar 100 Film, by Ori Cohen. In the following article, I will talk about the process of starting a research project in which an academic-level dataset, such as those shared … Read moreThe Definite Guide For Creating An Academic-Level Dataset 
With Industry Requirements And…

End to End Recipe Cuisine Classification

Who should read this? If you are interested in learning about a high level overview of a Machine Learning system from scratch including: — Data Collection (web scraping) — Processing and cleaning the data — Modeling, Training and Testing — Deployment as a cloud service — Scheduling to re-run the system, get any new recipes, … Read moreEnd to End Recipe Cuisine Classification

Know Thyself: Using Data Science to Explore Your Own Genome

DNA analysis with pandas and Selenium “Nosce te ipsum”, (“know thyself”), a well-known ancient maxim, frequently associated with anatomical knowledge. Image from the University of Cambridge 23andme once offered me a free DNA and ancestry test kit if I participated in one of their clinical studies. In exchange for a cheek swab and baring my guts … Read moreKnow Thyself: Using Data Science to Explore Your Own Genome

Yuval Noah Harari and Fei-Fei Li on AI

Outsourcing Self-Awareness to AI “What does it mean to live in a world in which you learn about something so important about yourself from an algorithm?” — Yuval Noah Harari For millennia humans have been outsourcing some of the things that our brains do. Writing allows us to keep precise records instead of relying on our memory. Navigation … Read moreYuval Noah Harari and Fei-Fei Li on AI

AI, Machine Learning and Data Science Roundup: May 2019

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications from Microsoft and elsewhere that I’ve noted over the past month or so. Open Source AI, ML & Data Science News PyTorch 1.1 is now available, with new … Read moreAI, Machine Learning and Data Science Roundup: May 2019

Meta-learning of Adversarial Generative models

Motivation Convolutional neural networks have been successful in generating realistic human head images by training neural networks on a large dataset of images of a single person. However, in many practical scenarios, such personalized talking head models need to be learnt from a few image views of a person, sometimes limited to a single image. … Read moreMeta-learning of Adversarial Generative models

8 Reasons Why Python is Good for Artificial Intelligence and Machine Learning

This article about why Python is good for ML and AI is originally posted on Django Stars blog. Artificial Intelligence (AI) and Machine Learning (ML) are the new black of the IT industry. While discussions over the safety of its development keep escalating, developers expand abilities and capacity of artificial intellect. Today Artificial Intelligence went … Read more8 Reasons Why Python is Good for Artificial Intelligence and Machine Learning

Using Reinforcement Learning to play Super Mario Bros on NES using TensorFlow

Reinforcement learning is currently one of the hottest topics in machine learning. For a recent conference we attended (the awesome Data Festival in Munich), we’ve developed a reinforcement learning model that learns to play Super Mario Bros on NES so that visitors, that come to our booth, can compete against the agent in terms of … Read moreUsing Reinforcement Learning to play Super Mario Bros on NES using TensorFlow

How to keep up with CRAN policies and processes?

CRAN, the Comprehensive R Archive Network, changes its rules and workflow every so often: see for instance the new encoding setting of one of its check flavors. As a package developer, you’d better keep up with CRAN policies and processes to be able to safely retain your package(s) on CRAN and to prepare your next … Read moreHow to keep up with CRAN policies and processes?

Employee flight risk modeling behavior

An analytical model for predicting employee flight risk behaviour “People are the nucleus of any organization. So, how can you find, engage and retain top performers who’ll contribute to your goals, your future?” There is no dearth of Enterprise Resource Planning (ERP) systems utilized by human resource companies, however, the inclusion of machine learning to … Read moreEmployee flight risk modeling behavior

Interactive charts with chartbookR

“There is no such thing as information overload. There is only bad design.” (— Edward Tufte). There is nothing worse than charts overladed with information. One solution to this are interactive charts that let users select the time series they’re interested in, zoom in on them, and focus on individual data points. The chartbookR package … Read moreInteractive charts with chartbookR

Data Science Jobs Report 2019: Python Way Up, Tensorflow Growing Rapidly, R Use Double SAS

In my ongoing quest to track The Popularity of Data Science Software, I’ve just updated my analysis of the job market. To save you from reading the entire tome, I’m reproducing that section here. Job Advertisements One of the best ways to measure the popularity or market share of software for data science is to … Read moreData Science Jobs Report 2019: Python Way Up, Tensorflow Growing Rapidly, R Use Double SAS

Basic Principles to Create a Time Series Forecast

Explaining the basics steps to create time series forecasts. We are surrounded by patterns that can be found everywhere, one can notice patterns with the four season in relation to the weather; patterns on peak hour when it refers to the volume of traffic; in your heart beats, as well as in the shares of … Read moreBasic Principles to Create a Time Series Forecast

Using Dimensionality Reduction to Visualize Job Polarization

PC1 and PC2 extracted from the MDS Embedding using 2003 data. Each point represents a job, and each color represents a job zone. The smaller the job zone, the less education requirement/experience it requires. In this post, we illustrate how dimensionality reduction techniques including principal component analysis (PCA) and multidimensional scaling (MDS) can be used … Read moreUsing Dimensionality Reduction to Visualize Job Polarization

Using Random Forest to tell if you have a representative Validation Set

This is a quick check that one of your most important machine learning tasks is correctly set up Photo by João Silas on Unsplash When running a predictive model — be that during a Kaggle competition or the real world — you need a representative validation set to check whether the model you are training, generalises well — that is, the model can … Read moreUsing Random Forest to tell if you have a representative Validation Set

How I Built a System to Track 15,000,000+ Prices a Day

Giving Waldo a Brain We started to build scrapers for each domain we wanted Waldo to support. While the HTML might differ from domain to domain, our approach was the same: get a list of all the categories for each domain get all the products within each category This meant the scrapers we built were pretty … Read moreHow I Built a System to Track 15,000,000+ Prices a Day

What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals. Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree … Read moreWhat single step does with relationship

Free Will, Clairvoyant Demons, and Determinism

The Tao of Data Science Determinism, generative machine learning, and whether or not free will in humans (or machines) is possible Laplace provided an interesting insight into generative machine learning Laplace’s Demon Pierre-Simon Laplace supposed that everything is composed of atoms and that Newtonian physics governs the motions of atoms. As a thought experiment, Laplace imagined a kind … Read moreFree Will, Clairvoyant Demons, and Determinism

Classifying Hate Speech: an overview

A brief look at label classification and hate speech By Jacob Crabb, Sherry Yang, and Anna Zubova. What is hate speech? The challenge of wrangling hate speech is an ancient one, but the scale, personalization, and velocity of today’s hate speech a uniquely modern dilemma. While there is no exact definition of hate speech, in general, it … Read moreClassifying Hate Speech: an overview

Can neural networks create new knowledge? Unboxing a neural net

Bottom-up construction of a XOR NN XOR is a boolean function defined by the mapping XOR (0,0) = XOR (1,1) = 0 XOR (1,0) = XOR (0,1) = 1 To construct a neural net for XOR we remember or google the identity XOR (x,y) = AND ( NAND (x,y) , OR (x,y) ) This helps because the … Read moreCan neural networks create new knowledge? Unboxing a neural net

Probability will only break your heart — Or —  Trust the Process, Doubt the Procedure: NBA playoff…

Data Collection & Preprocessing Finding the data A short search for the best data to settle this question led to 538’s expertly curated Historical NBA Elo dataset (under CC BY license). (Of course the eminent basketball-reference.com has the data, but not in as convenient a format, that I could tell. Only later did I learn of … Read moreProbability will only break your heart — Or —  Trust the Process, Doubt the Procedure: NBA playoff…

The Whole Data Science World in Your Hands

Testing MatrixDS capabilities on different languages and tools. If you work with data you have to check this out. I’ve been looking for years for a platform where I can run my data science projects without the pain of installations and filling my computer with dozens of different tools and environments. Luckily I found that MatrixDS … Read moreThe Whole Data Science World in Your Hands

Artificial Intelligence and The Trader

Trading used to be an art form. Now, it’s different. Jun WuBlockedUnblockFollowFollowing May 28 @mikofilm unsplash.com Artificial Intelligence is to trading what fire was to the cavemen. — an industry player. When I was working on the trading floor of some of the largest investment banks, I met some unbelievably talented traders. They were characters, to say … Read moreArtificial Intelligence and The Trader

Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS

If you are having the following symptoms at your company when it comes to business KPI forecasting, then maybe you need to look at automated forecasting: Ugly Excel spreadsheets with multiple tabs and 2000s style pastel formatting Business unit managers, store managers, operations managers, sales teams, and finance teams who give convoluted and indirect answers … Read moreAutomate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS

Intelligent Digital Robots or RPA 2.0

We live in unprecedented times of exponential growth of technology. With AI solutions knocking on every doors, it is time to think how it will influence the nature of jobs we do. In the late 18th century Western world went through Industrial Revolution, changing from hand production methods to machines. Since then the world has … Read moreIntelligent Digital Robots or RPA 2.0

Build your own Recommender System within 5 minutes!

The most successful and widespread application of machine learning technologies in business is the Recommendation System. You are browsing through Spotify to listen to a song but cannot decide which one. You are surfing through YouTube to watch some videos but are not able to decide which video to look at. There are so many … Read moreBuild your own Recommender System within 5 minutes!

A Basic Python Tweet Class

Simple strategies for processing tweet data Photo by Ray Hennessy on Unsplash Motivations Twitter is a amazing source of data with all kinds of opportunities for analysis. NLTK, spaCy, and other Python NLP tools have many powerful, applicable features, and pandas makes it easy to wrangle tabular data. Still, there are some challenges. Tweets, while short, often … Read moreA Basic Python Tweet Class

Giving Some Tips For Data Science Interviews, After Interviewing 60 Candidates at Expedia

During the past year, I interviewed many people for data science positions at Expedia Group, from entry level to senior, and thought to share my experience here in case it can be useful for people applying for data science positions, and give you guys some tips on the kind of questions you may get. Interviewing … Read moreGiving Some Tips For Data Science Interviews, After Interviewing 60 Candidates at Expedia

Epileptic Seizure Classification ML Algorithms

Data Exploration The dataset contains a hashed patient ID column, 178 EEG readings over one second, and a Y output variable describing the status of the patient at that second. When a patient is having a seizure, y is denoted as 1 while all other numbers are other statuses we aren’t interested in. So when … Read moreEpileptic Seizure Classification ML Algorithms

simstudy update – stepped-wedge design treatment assignment

simstudy has just been updated (version 0.1.13 on CRAN), and includes one interesting addition (and a couple of bug fixes). I am working on a post (or two) about intra-cluster correlations (ICCs) and stepped-wedge study designs (which I’ve written about before), and I was getting tired of going through the convoluted process of generating data … Read moresimstudy update – stepped-wedge design treatment assignment

ramlegacy: a package for RAM Legacy Database

Introduction ramlegacy is a new R package to download, cache and read in all the different versions of the RAM Legacy Stock Assessment Database, a public database containing stock assessment results of commercially exploited marine populations from around the world. The package accomplishes all this by: Providing a function download_ramlegacy(), to download all the available … Read moreramlegacy: a package for RAM Legacy Database

Job @ Oxford

Boby Mihaylova has two exciting posts available at the Health Economics Research Centre at the University of Oxford. In particular, she is looking for two R-minded researchers/analysts to develop work on disease modelling/cost-effectiveness using large individual-patients databases. In fact, I think it’s really good that they are explicitly including knowledge of R as part of … Read moreJob @ Oxford