I had the pleasure to present at the following events and conferences: Upcoming: useR 2019 – Toulouse: ‘Serverless Computing in R’ PyDays Vienna 2019: ‘Hydrogen & Pweave – A better Jupyter Notebook?’ Vienna Applied AI Meetup by AI Austria Meetup ‘Serverless computing: AWS Lambda with R and Docker as a Service’ Vienna-R Meetup ‘Serverless computing … Read more

Categories Featured ExcerptFavorite

Wilmington’s crime rate has soared — so has its police spending

Illustration: Jared Whalen; photo: creative commons) Policing has taken up a greater and greater share of government spending in Wilmington over the last three decades and today makes up a larger portion of government expenditures in Wilmington than in any other large U.S. city, according to data on local government finances. Out of $516 million spent … Read more

All you want to know about preprocessing: Data preparation

This is an introduction part, where we are going to discuss how to check and prepare your data for further preprocessing. Nowadays, almost all ML/data mining projects workflow run on a standard CRISP-DM (Cross-industry standard process for data mining) or its IBM enhance ASUM-DM (Analytics Solutions Unified Method for Data Mining/Predictive Analytics). The longest and … Read more

Step-by-Step Setup for Your Automated Home Trading System

This article walks you through the step-by-step setup of your automated home trading system, built in Python. Disclaimer: Nothing herein is financial advice, and NOT a recommendation to trade real money. Many platforms exist for simulated trading (paper trading) which can be used for building and developing the methods discussed. Please use common sense and always … Read more

The year of voice

According to a Wall Street Journal article, tech companies now have their eyes on the next billion internet users mostly from the developing world. But the new users are going to be different from the first billion in the sense they are more likely to favor voice and video rather than typing and text. Averaging … Read more

Reverse Engineering the Walk Score Algorithm

Using Machine Learning to Build a Walkability Score Heatmap of Predicted Walk Scores throughout Seattle, WA I live in Seattle and recently moved to a different neighborhood. According to Walk Score’s proprietary algorithms, I moved from the 9th most walkable Seattle neighborhood to the 30th. I can still easily walk to a local coffee shop and … Read more

A New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys

We are excited to announce the latest major release of rIP (v1.2.0), which is an R package that detects fraud in online surveys by tracing, scoring, and visualizing IP addresses. Essentially, rIP takes an array of IP addresses, which are always captured in online surveys (e.g., MTurk), and the keys for the services the user … Read more

Categories R Tags ExcerptFavorite

Machine Learning for Radiology — Where to Begin

Anaconda Anaconda is an open-source platform that is perhaps the easiest way to get started with Python machine learning on Linux, Mac OS X and Windows. It helps you manage the programing environments, and includes common Python packages used in data science. You can download the distribution for your platform at . Once you install … Read more

Machine Learning Model for Recommending the Crew Size for Cruise Ship Buyers

In this tutotial, we build a regression model using the cruise_ship_info.csv dataset for recommending the crew size for potential cruise ship buyers. This tutorial will highlight important data science and machine learning concepts such as: data proprocessing and variable selection; basic regression model building; hyper-parameters tuning; model evaluation; and techniques for dimensionality reduction. The github … Read more

The Definite Guide For Creating An Academic-Level Dataset  With Industry Requirements And…

Guidelines For Creating Your Own Data, Accompanied by Valuable Information To Aid You When Making Key Decisions. Teenagers playing football, Ipanema beach, Rio De Janeiro, Brazil. Ektar 100 Film, by Ori Cohen. In the following article, I will talk about the process of starting a research project in which an academic-level dataset, such as those shared … Read more

End to End Recipe Cuisine Classification

Who should read this? If you are interested in learning about a high level overview of a Machine Learning system from scratch including: — Data Collection (web scraping) — Processing and cleaning the data — Modeling, Training and Testing — Deployment as a cloud service — Scheduling to re-run the system, get any new recipes, … Read more

Know Thyself: Using Data Science to Explore Your Own Genome

DNA analysis with pandas and Selenium “Nosce te ipsum”, (“know thyself”), a well-known ancient maxim, frequently associated with anatomical knowledge. Image from the University of Cambridge 23andme once offered me a free DNA and ancestry test kit if I participated in one of their clinical studies. In exchange for a cheek swab and baring my guts … Read more

Yuval Noah Harari and Fei-Fei Li on AI

Outsourcing Self-Awareness to AI “What does it mean to live in a world in which you learn about something so important about yourself from an algorithm?” — Yuval Noah Harari For millennia humans have been outsourcing some of the things that our brains do. Writing allows us to keep precise records instead of relying on our memory. Navigation … Read more

AI, Machine Learning and Data Science Roundup: May 2019

A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications from Microsoft and elsewhere that I’ve noted over the past month or so. Open Source AI, ML & Data Science News PyTorch 1.1 is now available, with new … Read more

Exploring Exploratory Data Analysis

The whole point of Exploratory Data Analysis (EDA) is to just take a step back and look at the dataset before doing anything with it. EDA is just as important as any part of a data project because real datasets are really messy and lots of things can go wrong. If you don’t know your … Read more

Meta-learning of Adversarial Generative models

Motivation Convolutional neural networks have been successful in generating realistic human head images by training neural networks on a large dataset of images of a single person. However, in many practical scenarios, such personalized talking head models need to be learnt from a few image views of a person, sometimes limited to a single image. … Read more

8 Reasons Why Python is Good for Artificial Intelligence and Machine Learning

This article about why Python is good for ML and AI is originally posted on Django Stars blog. Artificial Intelligence (AI) and Machine Learning (ML) are the new black of the IT industry. While discussions over the safety of its development keep escalating, developers expand abilities and capacity of artificial intellect. Today Artificial Intelligence went … Read more

Using Reinforcement Learning to play Super Mario Bros on NES using TensorFlow

Reinforcement learning is currently one of the hottest topics in machine learning. For a recent conference we attended (the awesome Data Festival in Munich), we’ve developed a reinforcement learning model that learns to play Super Mario Bros on NES so that visitors, that come to our booth, can compete against the agent in terms of … Read more

How to keep up with CRAN policies and processes?

CRAN, the Comprehensive R Archive Network, changes its rules and workflow every so often: see for instance the new encoding setting of one of its check flavors. As a package developer, you’d better keep up with CRAN policies and processes to be able to safely retain your package(s) on CRAN and to prepare your next … Read more

Categories R Tags ExcerptFavorite

Employee flight risk modeling behavior

An analytical model for predicting employee flight risk behaviour “People are the nucleus of any organization. So, how can you find, engage and retain top performers who’ll contribute to your goals, your future?” There is no dearth of Enterprise Resource Planning (ERP) systems utilized by human resource companies, however, the inclusion of machine learning to … Read more

Categories R Tags ExcerptFavorite

Momentum Investing with R

After an extended hiatus, Reproducible Finance is back! We’ll celebrate by changing focus a bit and coding up an investment strategy called Momentum. Before we even tiptoe in that direction, please note that this is not intended as investment advice and it’s not intended to be a script that can be implemented for trading. The … Read more

Categories R Tags ExcerptFavorite

Interactive charts with chartbookR

“There is no such thing as information overload. There is only bad design.” (— Edward Tufte). There is nothing worse than charts overladed with information. One solution to this are interactive charts that let users select the time series they’re interested in, zoom in on them, and focus on individual data points. The chartbookR package … Read more

Categories R Tags ExcerptFavorite

Data Science Jobs Report 2019: Python Way Up, Tensorflow Growing Rapidly, R Use Double SAS

In my ongoing quest to track The Popularity of Data Science Software, I’ve just updated my analysis of the job market. To save you from reading the entire tome, I’m reproducing that section here. Job Advertisements One of the best ways to measure the popularity or market share of software for data science is to … Read more

Categories R Tags ExcerptFavorite

Using Dimensionality Reduction to Visualize Job Polarization

PC1 and PC2 extracted from the MDS Embedding using 2003 data. Each point represents a job, and each color represents a job zone. The smaller the job zone, the less education requirement/experience it requires. In this post, we illustrate how dimensionality reduction techniques including principal component analysis (PCA) and multidimensional scaling (MDS) can be used … Read more

Using Random Forest to tell if you have a representative Validation Set

This is a quick check that one of your most important machine learning tasks is correctly set up Photo by João Silas on Unsplash When running a predictive model — be that during a Kaggle competition or the real world — you need a representative validation set to check whether the model you are training, generalises well — that is, the model can … Read more

What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals. Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree … Read more

Categories R Tags ExcerptFavorite

Free Will, Clairvoyant Demons, and Determinism

The Tao of Data Science Determinism, generative machine learning, and whether or not free will in humans (or machines) is possible Laplace provided an interesting insight into generative machine learning Laplace’s Demon Pierre-Simon Laplace supposed that everything is composed of atoms and that Newtonian physics governs the motions of atoms. As a thought experiment, Laplace imagined a kind … Read more

Classifying Hate Speech: an overview

A brief look at label classification and hate speech By Jacob Crabb, Sherry Yang, and Anna Zubova. What is hate speech? The challenge of wrangling hate speech is an ancient one, but the scale, personalization, and velocity of today’s hate speech a uniquely modern dilemma. While there is no exact definition of hate speech, in general, it … Read more

Probability will only break your heart — Or —  Trust the Process, Doubt the Procedure: NBA playoff…

Data Collection & Preprocessing Finding the data A short search for the best data to settle this question led to 538’s expertly curated Historical NBA Elo dataset (under CC BY license). (Of course the eminent has the data, but not in as convenient a format, that I could tell. Only later did I learn of … Read more

Moving from Keras to Pytorch

The Classy way to write your network? OOPs: Object-Oriented Programming Let us create an example network in keras first which we will try to port into Pytorch. Here I would like to give a piece of advice too. When you try to move from Keras to Pytorch take any network you have and try porting it … Read more

The Whole Data Science World in Your Hands

Testing MatrixDS capabilities on different languages and tools. If you work with data you have to check this out. I’ve been looking for years for a platform where I can run my data science projects without the pain of installations and filling my computer with dozens of different tools and environments. Luckily I found that MatrixDS … Read more

Books for sale!

I’m clearing out some books. You can buy them! They are all good ones, it’s just that I don’t need to have hard copies filling up my (already bulging) shelves. They are all cheaper than Amazon, although shipping beyond the EU is more expensive as I don’t have my own logistics empire. Logistic regression yes, … Read more

Categories R Tags ExcerptFavorite

Artificial Intelligence and The Trader

Trading used to be an art form. Now, it’s different. Jun WuBlockedUnblockFollowFollowing May 28 @mikofilm Artificial Intelligence is to trading what fire was to the cavemen. — an industry player. When I was working on the trading floor of some of the largest investment banks, I met some unbelievably talented traders. They were characters, to say … Read more

Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS

If you are having the following symptoms at your company when it comes to business KPI forecasting, then maybe you need to look at automated forecasting: Ugly Excel spreadsheets with multiple tabs and 2000s style pastel formatting Business unit managers, store managers, operations managers, sales teams, and finance teams who give convoluted and indirect answers … Read more

Categories R Tags ExcerptFavorite

Intelligent Digital Robots or RPA 2.0

We live in unprecedented times of exponential growth of technology. With AI solutions knocking on every doors, it is time to think how it will influence the nature of jobs we do. In the late 18th century Western world went through Industrial Revolution, changing from hand production methods to machines. Since then the world has … Read more

Package Spotlight: anim.plots

The package anim.plots behaves like a sort of user-friendly shell on top of animate that makes animations of some of the most common types of plots in base R in a more intuitive fashion that animate. This package depends on two other important packages: –   magick, which is an R implementation of imageMagick, which itself … Read more

Categories R Tags ExcerptFavorite

Learning R: Painting with Fire

A few months ago I published a post on recursion: To understand Recursion you have to understand Recursion…. In this post we will see how to use recursion to fill free areas of an image with colour, the caveats of recursion and how to transform a recursive algorithm into a loop-based version using a queue … Read more

Categories R Tags ExcerptFavorite

A Basic Python Tweet Class

Simple strategies for processing tweet data Photo by Ray Hennessy on Unsplash Motivations Twitter is a amazing source of data with all kinds of opportunities for analysis. NLTK, spaCy, and other Python NLP tools have many powerful, applicable features, and pandas makes it easy to wrangle tabular data. Still, there are some challenges. Tweets, while short, often … Read more

Giving Some Tips For Data Science Interviews, After Interviewing 60 Candidates at Expedia

During the past year, I interviewed many people for data science positions at Expedia Group, from entry level to senior, and thought to share my experience here in case it can be useful for people applying for data science positions, and give you guys some tips on the kind of questions you may get. Interviewing … Read more

Epileptic Seizure Classification ML Algorithms

Data Exploration The dataset contains a hashed patient ID column, 178 EEG readings over one second, and a Y output variable describing the status of the patient at that second. When a patient is having a seizure, y is denoted as 1 while all other numbers are other statuses we aren’t interested in. So when … Read more

Instagram Data Analysis

picture credits to Background This project is built on top of the data challenge that Panoply has released in Apr 2019. Panoply is a cloud data warehouse that you could gather data from different data sources (i.e. AWS S3, Google analytics and etc.) easily into one place and then connect to different Business Intelligence … Read more

simstudy update – stepped-wedge design treatment assignment

simstudy has just been updated (version 0.1.13 on CRAN), and includes one interesting addition (and a couple of bug fixes). I am working on a post (or two) about intra-cluster correlations (ICCs) and stepped-wedge study designs (which I’ve written about before), and I was getting tired of going through the convoluted process of generating data … Read more

Categories R Tags ExcerptFavorite

ramlegacy: a package for RAM Legacy Database

Introduction ramlegacy is a new R package to download, cache and read in all the different versions of the RAM Legacy Stock Assessment Database, a public database containing stock assessment results of commercially exploited marine populations from around the world. The package accomplishes all this by: Providing a function download_ramlegacy(), to download all the available … Read more

Categories R Tags ExcerptFavorite

Job @ Oxford

Boby Mihaylova has two exciting posts available at the Health Economics Research Centre at the University of Oxford. In particular, she is looking for two R-minded researchers/analysts to develop work on disease modelling/cost-effectiveness using large individual-patients databases. In fact, I think it’s really good that they are explicitly including knowledge of R as part of … Read more

Categories R Tags ExcerptFavorite