Is the #10YearChallenge A Sign of the AI Apocalypse?

Viral social media “challenges,” memes, and gimmicks have taken over our feeds in recent years. The term “challenge” is used loosely though since these viral sensations aren’t so much challenging as they are just unique ways to spice up your social media presence. But are they also signs of the impending AI apocalypse? Let’s look … Read more

How It Feels to Learn Data Science in 2019

Seeing the (Random) Forest Through the (Decision) Trees The following is inspired by the article How it Feels to Learn JavaScript in 2016. Do not take this too seriously. This piece is just an opinion, much like people’s definition of data science. I heard you are the one to go to. Thank you for meeting … Read more

Maximum Likelihood Estimation

Coin Flip MLE Let’s derive the MLE estimator for our coin flip model from before. I’ll cover the MLE estimator for our linear model in a later post on linear regression. Recall that we’re modeling the outcome of a coin flip by a Bernoulli distribution, where the parameter p represents the probability of getting a heads. … Read more

Review: DRRN — Deep Recursive Residual Network (Super Resolution)

Up to 52 Convolutional Layers, With Global and Local Residual Learnings, Outperforms SRCNN, FSRCNN, ESPCN, VDSR, DRCN, and RED-Net. Digital Image Enlargement, The Need of Super Resolution In this story, DRRN (Deep Recursive Residual Network) is reviewed. With Global Residual Learning (GRL) and Multi-path mode Local Residual Learning (LRL), plus the recursive learning to control the … Read more

Python Basics: Mutable vs Immutable Objects

Source: After reading this blog post you’ll know: What are an object’s identity, type, and value What are mutable and immutable objects Introduction (Objects, Values, and Types) All the data in a Python code is represented by objects or by relations between objects. Every object has an identity, a type, and a value. Identity An … Read more

Tweets Data Visualization with Circles and User Interaction

Adding Interactivity: Tweet Info by Click After plotting and packing all the circles, we can make each circle to work like a button. To achieve this, we can include help from the function fig.canvas.mpl_connect. The function can take two arguments, the first one is a string that corresponds to the type of interaction (in our case … Read more

A little trick for debugging Shiny

This is gonna be a short post about a little trick I’ve been using while developing Shiny Apps. (Spoiler: nothing revolutionary) A browser anywhere, anytime The first thing to do is to insert an action button, and a browser() in the observeEvent() watching this button. This is a standard approach: at any time, you just … Read more

Categories R Tags ExcerptFavorite

Send UDP Probes (with payloads) and Receive/Process Responses in R

We worked pretty hard over at $DAYJOB on helping to quantify and remediate a fairly significant configuration weakness in Ubiquiti network work gear attached to the internet. Ubiquiti network gear — routers, switches, wireless access points, etc. — are enterprise grade components and are a joy to work with. Our home network is liberally populated … Read more

Categories R Tags ExcerptFavorite

Function Objects and Pipelines in R

Composing functions and sequencing operations are core programming concepts. Some notable realizations of sequencing or pipelining operations include: The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of … Read more

Categories R Tags ExcerptFavorite

Retail Data Visualization with R and Shiny

Introduction Because of my marketing background, finding information hiding wihtin a marketing dataset is always an interesting topic to me. It makes me feel a sense of accomplishment when I cleaned up a very messy large dataset, and finally discover some insights from it. Therefore, I’ve decided to practice my skills of data cleaning and … Read more

Categories R Tags ExcerptFavorite

Understanding Studies of Racial Demarcations

Studies of racial demarcations typically are implemented in context of what are referred to as regression analyses. Simply put, a regression enables assessments of relations between some variable of interest, say students’ test scores, and variables that define said students, such as race, family income, parents’ professions, parents’ education etc. Pictorially, with x’s denoting variables … Read more

Learning aggregate functions

Machine Learning with relational data This article is inspired by the Kaggle competition . While I did not participate in the competition, I used the data to explore another problem that often arises working with realistic data. All machine learning algorithms work great with the tabular data, but in reality a lot of data … Read more

These are the Easiest Data Augmentation Techniques in Natural Language Processing you can think of…

Augmentation operations for NLP proposed in [this paper]. SR=synonym replacement, RI=random insertion, RS=random swap, RD=random deletion. The Github repository for these techniques can be found [here]. Data augmentation is commonly used in computer vision. In vision, you can almost certainly flip, rotate, or mirror an image without risk of changing the original label. However, in natural … Read more

Building Our Own Open Source Supercomputer with R and AWS

How to build a scaleable computing cluster on AWS and run hundreds orthousands of models in a short amount of time. We will completely rely on R andopen source R packages. This is post 1 out of 2. Introduction An ever-increasing number of businesses is moving to the cloud and usingplatforms such as Amazon Web … Read more

Categories R Tags ExcerptFavorite

Transfer Learning using ELMO Embedding

Last year, the major developments in “Natural Language Processing” were about Transfer Learning. Basically, Transfer Learning is the process of training a model on a large-scale dataset and then using that pre-trained model to process learning for another target task. Transfer Learning became popular in the field of NLP thanks to the state-of-the-art performance of … Read more

Model-Free Prediction: Reinforcement Learning

Part 4: Model-Free Predictions with Monte-Carlo Learning, Temporal-Difference Learning and TD( λ) Previously, we looked at planning by dynamic programming to solve a known MDP. In this post, we will use model-free prediction to estimate the value function of an unknown MDP. i.e We will look at policy evaluation of an unknown MDP. This series of … Read more

Matplotlib Tutorial: Learn basics of Python’s powerful Plotting library

What is Matplotlib To make necessary statistical inferences, it becomes necessary to visualize your data and Matplotlib is one such solution for the Python users. It is a very powerful plotting library useful for those working with Python and NumPy. The most used module of Matplotib is Pyplot which provides an interface like MATLAB but … Read more

Introduction to TWO approaches of Content-based Recommendation System

A complete guide to resolve the confusion Content-based filtering is one of the common methods in building recommendation systems. While I tried to do some research in understanding the detail, it is interesting to see that there are 2 approaches that claim to be “Content-based”. Below I will share my findings and hope it can … Read more

R Package Update: urlscan

The urlscan package (an interface to the API) is now at version 0.2.0 and supports’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website. For instance, say you wanted to check on … Read more

Categories R Tags ExcerptFavorite

Synthesising Multiple Linked Data Sets and Sequences in R

In my last post I looked at generating synthetic data sets with the ‘synthpop’ package, some of the challenges and neat things the package can do. It is simple to use which is great when you have a single data set with independent features. This post will build on the last post by tackling other … Read more

Categories R Tags ExcerptFavorite

Machine Learning and Particle Motion in Liquids: An Elegant Link

The gradient descent algorithm is one of the most popular optimization techniques in machine learning. It comes in three flavors: batch or “vanilla” gradient descent (GD), stochastic gradient descent (SGD), and mini-batch gradient descent which differ in the amount of data used to compute the gradient of the loss function at each iteration. The goal … Read more

Three steps for a successful machine learning project

Less technical considerations to make for all ML projects As people and companies venture into machine learning (ML), it is common for some to expect to dive right into building models and generating useful output. And while some parts of ML feel like this technical wizardry with magical predictions, there are other aspects that are less … Read more

Contextual Embeddings for NLP Sequence Labeling

Text representation (aka text embeddings) is a breakthrough of solving NLP tasks. At the beginning, single word vector represent a word even though carrying different meaning among context. For example, “Washington” can be a location, name or state. “University of Washington” Zalando released an amazing NLP library, flair, makes our life easier. It already implement … Read more

Espresso Filters: An Analysis

Data Analysis 1. Hole Diameter First, we will look at hole size per filter. Originally, hole size was calculated by determining the area of pixels above threshold per each hole and determining the diameter. However, the results did not show the detail I was looking for because it was based on whole pixels. This is a … Read more

Multiple Data (Time Series) Streams Clustering

Related To leave a comment for the author, please follow the link and comment on their blog: Peter Laurinec. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics … Read more

Categories R Tags ExcerptFavorite

Navigate through Decennial Census and American Community Survey

Finding the right content in census data can be daunting. Just give you an idea how complex the census data are, there are 1127 tables and 25070 columns of table contents in the 2012-2017 ACS 5-year summary file alone. 2010 decennial census summary file 1 333 8959 2012-2017 5-year ACS summary file 1127 25070 2017 … Read more

Categories R Tags ExcerptFavorite

The power of tapping into your community for support

This week the owner of my favorite Mexican restaurant in Baltimore, Rosalyn Vera, got death and arson threats. I could have been a bystander, but I tapped into my network and asked for help and she has received it. It’s been great to see the power of the community in action. The backstory So, I … Read more

Categories R Tags ExcerptFavorite

Deep Learning with Satellite Data

“The rockets and the satellites, spaceships that we’re creating now,we’re pollinating the universe.” -Neil Young Overview— Satellite Data—Data Collection— Model — Results Overview While at the University of Sannio in Benevento, Italy this January, my friend Tuomas Oikarinen and I created a (semi-automated) pipeline for downloading publicly available images, and trained a 3-D Convolutional Neural Network on … Read more

Making Programming Easier with Keyboard Macros — Video

A recent video from Linus Tech Tips introduced how one of their editors uses macros for video editing. This got me thinking; can macros be easily created to improve my programming? This video demonstrates how creating code macros can be achieved and how useful it can be: Background Source: Linus Tech Tips — Can your Keyboard do … Read more

Unsupervised Feature Learning

Deep Convolutional Networks on Image tasks take in Image Matrices of the form (height x width x channels) and process them into low-dimensional features through a series of parametric functions. Supervised and Unsupervised Learning tasks both aim to learn a semantically meaningful representation of features from raw data. Training Deep Supervised Learning models requires a … Read more

Predicting Kickstarter Campaign Success with Gradient Boosted Decision Trees: A Machine Learning…

Fitting the models, evaluating performance, choosing a final model, and predicting on a new (totally real) campaign Another common thing in the data science workflow is trying out multiple models. There are ways to minimize the effort in this stage based on what you want to accomplish or what the dataset is/what the problem is (you … Read more

Best practices in Ads Search

Big Data is the process of collecting and analyzing large amounts of information. The complexity and large volume of data that our society currently generates has made it impossible to capture, manage, process or analyse with the technologies we know so far. Big Data embraces five features: volume (manages terabytes or petabytes of information), variety … Read more

Modeling cumulative impact — Part I

Create simple features of cumulative impact, predict sports performance with the fitness-fatigue model “Little by little, a little becomes a lot.” -Tanzanian proverb Welcome to Modeling cumulative impact, a series that views the cumulative impact of athletic training on sports performance through a variety of modeling lenses. The journey starts here in Part I with … Read more

Homebrew 2.0.0 Released == homebrewanalytics package updated

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew … Read more

Categories R Tags ExcerptFavorite

WTF is image classification?

Conquering convolutional neural networks for the curious and confused Photo by Micheile Henderson on Unsplash “One thing that struck me early is that you don’t put into a photograph what’s going to come out. Or, vice versa, what comes out is not what you put in.” ― Diane Arbus A notification pops up on your favorite social … Read more

Simulating the Six Nations 2019 Rugby Tournament in R

I really like running simulation models before sporting events because they can give you so much more depth of understanding compared to the ‘raw’ odds that you get from the media or bookmakers, etc.  Yes, a team might have a “30% chance of winning a tournament we might hear”.  But there might be another strong … Read more

Categories R Tags ExcerptFavorite

Review: DCN — Deformable Convolutional Networks, 2nd Runner Up in 2017 COCO Detection (Object…

With Deformable Convolution, Improved Faster R-CNN and R-FCN, Got 2nd Runner Up in COCO Detection & 3rd Runner Up in COCO Segmentation. After reviewed STN, this time, DCN (Deformable Convolutional Networks), by Microsoft Research Asia (MSRA), is reviewed. (a) Conventional Convolution, (b) Deformable Convolution, (c) Special Case of Deformable Convolution with Scaling, (d) Special Case … Read more

Statistics is the Grammar of Data Science — Part 3/5

Moments Moments describe various aspects of the nature and shape of our distribution. #1 — The first moment is the mean of the data, which describes the location of the distribution. #2 — The second moment is the variance, which describes the spread of the distribution. High values are more spread out than smaller values. #3 — The third moment is … Read more

Comparing Different Classification Machine Learning Models for an imbalanced dataset

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, … Read more

Setting up your blog with RStudio and blogdown II: Workflow

Workflow In Part I of this series of post we setup our new blog using blogdown and Hugo. Once the blog is configured, this is the typical workflow I follow to write new posts and update my blog online: Open your blog project with RStudio Load the blogdown library and start the Hugo server and … Read more

Categories R Tags ExcerptFavorite

ML Algorithms: One SD (σ)- Instance-based Algorithms

An intro to machine learning instance-based algorithms TThe obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?” Answering these questions vary depending on several factors, including: (1) The size, quality, and nature of data; (2) The … Read more

The Data Driven Partier: Movie Mustache

The concept behind ‘Movie Mustache’ is simple, but revolutionary. Watch a movie with friends but with a mustache or two on the TV — whenever the mustache lines up with a character’s upper lip, everyone drinks. This game was foreign to me until a few weeks ago when I got to experience it watching the Adam Sandler … Read more

Tutorial: Sequential Pattern Mining in R for Business Recommendations

by Allison Koenecke, Data Scientist, AI & Research Group at Microsoft, with acknowledgements to Amita Gajewar and John-Mark Agosta. In this tutorial, Allison Koenecke demonstrates how Microsoft could recommend to customers the next set of services they should acquire as they expand their use of the Azure Cloud, by using a temporal extension to conventional … Read more

Categories R Tags ExcerptFavorite

Comparing Python Virtual Environment tools

Thanks to Keith Smith, Alexander Mohr, Victor Kirillov and Alain SPAITE for recommending pew, venv and pipenv. I just love the community that we have on Medium. I recently published an article on using Virtual Environments for Python projects. The article was well received and the feedback from readers opened a new view for me. … Read more

Data Science and Agile

Suggested frameworks for effectiveness (Part 2 of 2) This is the second post in a 2-part sharing on Data Science and Agile. In the last post, we discussed about the aspects of Agile that work, and don’t work, in the data science process. You can find the previous post here. A quick recap of what works well … Read more


One cannot escape the feeling that these mathematical formulas have an independent existence and an intelligence of their own, that they are wiser than we are, wiser even than their discoverers (Heinrich Hertz) I love spending my time doing mathematics: transforming formulas into drawings, experimenting with paradoxes, learning new techniques … and R is a perfect … Read more

Categories R Tags ExcerptFavorite

PyViz: Simplifying the Data Visualisation process in Python.

Exploring Data with PyViz In this section, we will see how different libraries are effective in bringing out different insights from data and their conjunction can really help to analyse data in a better way. Dataset The dataset being used pertains to the number of cases of measles and pertussis recorded per, 100,000 people over time … Read more