Keeping It Classy: How Quizlet uses hierarchical classification to label content with academic…

Quizlet’s community-curated catalog of study sets is massive (300M and growing) and covers a wide range of academic subjects. Having such a large and varied content catalog empowers Quizlet users to master any subject under the sun, but it also creates interesting information retrieval problems. Quizlet’s search, navigation, and personalization systems all rely on being … Read more

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a “generative probabilistic model” of a collection of composites made up of parts. In terms of topic modeling, the composites are documents and the parts are words and/or phrases (n-grams). But you could apply LDA to DNA and nucleotides, pizzas and toppings, molecules and atoms, employees and skills, or keyboards … Read more

Speech recognition is hard — Part 1

Speech is the most natural form of communication for us — it’s second nature to us. And now, our machines have started to recognize our speech and they’re getting better and better at communicating with us. Current voice assistants and devices like Amazon Alexa and Google Home are getting more and more popular each month — they are changing … Read more

Great Books for Data Science

10 non-technical books that got me excited about data science Some pretty good data books There is no shortage of books that promise to teach data science. Most of these books read like college textbooks with a wealth of technical material prefaced by a short conceptual introduction. While there is plenty of demand for these technical skills, … Read more

Everything you need to know about Neural Networks and Backpropagation — Machine Learning Made Easy…

Neural Network explanation from the ground including understanding the math behind it I find it hard to get step by step and detailed explanations about Neural Networks in one place. Always some part of the explanation was missing in courses or in the videos. So I try to gather all the information and explanations in one … Read more

R Tip: Use Inline Operators For Legibility

R Tip: use inline operators for legibility. A Python feature I miss when working in R is the convenience of Python‘s inline + operator. In Python, + does the right thing for some built in data types: It concatenates lists: [1,2] + [3] is [1, 2, 3]. It concatenates strings: ‘a’ + ‘b’ is ‘ab’. … Read more

Categories R Tags ExcerptFavorite

A Beginner’s Guide To Reinforcement Learning With A Mario Bros Example

Machine learning has never been so easy! Imagine a world where every computer system is customized specifically to your own personality. It learns the nuances of how you communicate and how you wish to be communicated with. Interacting with a computer system becomes more intuitive than ever and technological literacy sky rockets. These are the potential … Read more

Predicting stock market crashes with statistical machine learning techniques and neural networks

With this blog post I am introducing the design of a machine learning algorithm that aims to forecast crashes in stock markets solely based on past price information. I start with a quick background on the problem and elaborate on my approach and findings. All the code and data are available on GitHub. A stock … Read more

I tracked my happiness each day of 2018

Observing and analyzing trends in my own mental health The Preface I tracked my mental health each day throughout 2018. I rated my happiness on a scale of 1–5, with “1” being a really bad day, “2” being a kind of bad day, “3” being a neutral day, “4” being a kind of good day, and … Read more

Why you should care about the rising economy of algorithms

In 2016, a Reddit user made a confession. FiletOfFish1066 had automated all of the work tasks and spent around six years “doing nothing”. While the original post seems to have disappeared from Reddit, there are numerous reports about the admission. The original poster suggested that he (all the stories refer to FiletOfFish1066 as male) spent … Read more

A Review of NeurIPS 2018

Perspectives from data scientists working in finance Sue LiuBlockedUnblockFollowFollowing Jan 14 Sue Liu and Boris Mitrovic, R&D Data Scientists at Mudano Ltd, Edinburgh UK Using our personal development budgets, a great scheme offered by Mudano, my colleague Boris and I attended last year’s Neural Information Processing System (NeurIPS) conference in Montreal. This is the largest and … Read more

ggeffects 0.8.0 now on CRAN: marginal effects for regression models #rstats

I’m happy to announce that version 0.8.0 of my ggeffects-package is on CRAN now. The update has fixed some bugs from the previous version and comes along with many new features or improvements. One major part that was addressed in the latest version are fixed and improvements for mixed models, especially zero-inflated mixed models (fitted … Read more

Categories R Tags ExcerptFavorite

pcLasso: a new method for sparse regression

I’m excited to announce that my first package has been accepted to CRAN! The package pcLasso implements principal components lasso, a new method for sparse regression which I’ve developed with Rob Tibshirani and Jerry Friedman. In this post, I will give a brief overview of the method and some starter code. (For an in-depth description … Read more

Categories R Tags ExcerptFavorite

4 Ways To Calculate A Running Total With SQL

Calculating a running total/rolling sum in SQL is a useful skill to have. It can often come in handy for reporting and even when developing applications. Sometimes your users might want to see a running total of the points they have gained or perhaps the money they have earned. Like many problems in SQL, there … Read more

rOpenSci’s new Code of Conduct

We are pleased to announce the release of our new Code of Conduct. rOpenSci’s community is our best asset and it’s important that we put strong mechanisms in place before we have to act on a report. As before, our Code applies equally to members of the rOpenSci team and to anyone from the community … Read more

Categories R Tags ExcerptFavorite

? R Coding Style Guide

Language is a tool that allows human beings to interact and communicate with each other. The clearer we express ourselves, the better the idea is transferred from our mind to the other. The same applies to programming languages: concise, clear and consistent codes are easier to read and edit. It is especially important, if you … Read more

Categories R Tags ExcerptFavorite

Style Transfer – Styling Images with Convolutional Neural Networks

In this project, we are going to use a pre-trained VGG16 model which looks as follows. VGG16 Architecture (source: Keep in mind that we are not going to use fully connected (blue) and softmax layers (yellow). They act as a classifier which we don’t need here. We are going to use only feature extractors … Read more

Why I Code & Coffee

A few thoughts on the value of code & coffee events. My history with tech meetup events started in Madison, WI while I was in my PhD program (finished that in May of 2017). At the time, Madison didn’t have a meetup where techies would go to work in individual projects. Instead, they were structured more … Read more

colorspace: New Tools for Colors and Palettes

A major update (version 1.4.0) of the R package colorspace has been released to CRAN, enhancing many of the package’s capabilities, e.g., more refined palettes, named palettes, ggplot2 color scales, visualizations for assessing palettes, shiny and Tcl/Tk apps, color vision deficiency emulation, and much more. Overview The colorspace package provides a broad toolbox for selecting … Read more

Categories R Tags ExcerptFavorite

A Product-centric View of Data

How does this mindset then change team composition and what they focus on? Data EngineeringAs you’re now thinking about interfaces with other systems, interactions with users, and intelligent recommendations built by the data science team in POCs, you will start needing engineers with different skills. Good data engineers have different skills to those who can architect … Read more

Understanding Scoring Propensity: A Mixed Model Approach to Evaluating NBA Players

“Who’s the best scorer in the NBA?” is a question that comes up a lot during conversations with my friends. Names like LeBron James, James Harden, and Steph Curry always come up. It’s often difficult to come up with a single answer; the question becomes more nuanced when distinctions are made within scorers. How do … Read more

Travis CI for R — Advanced guide

Travis CI for R — Advanced guide Continuous integration for building an R project in Travis CI including code coverage, pkgdown documentation, osx and multiple R-Versions Photo by Guilherme Cunha on Unsplash Travis CI is a common tool to build R packages. It is in my opinion the best platform to use R in continuous integration. Some of the … Read more

Categories R Tags ExcerptFavorite

Showing a difference in means between two groups

Visualising a difference in mean between two groups isn’t as straightforward as it should. After all, it’s probably the most common quantitative analysis in science. There are two obvious options: we can either plot the data from the two groups separately, or we can show the estimate of the difference with an interval around it. … Read more

Categories R Tags ExcerptFavorite

Medium + r-bloggers — How to integrate?

Medium + r-bloggers — How to integrate? Build up a PHP script that allows you to post your Medium articles on The script filters an RSS feed by item tags. Photo by Ato Aikins on Unsplash Motivation I started my blog about R on Medium. Medium is a wonderful platform with a great user interface. The idea to … Read more

Categories R Tags ExcerptFavorite

Hashes power Probabilistic Data Structures

Photo by Ryan Thomas Ang on Unsplash Hash functions are used all over computer science but I want to mention they usefulness within probabilistic data structures and algorithms. We all know that data structures are the building blocks of most algorithms. A bad choice could lead to hard and inefficient solutions instead of elegant and efficient … Read more

Open Questions: Carlos A. Gomez-Uribe

Q. How did you come to work in Product at Netflix? Do you think it’s important for product managers who work on recommendation have math skills? A. I started at Netflix as a data scientist, and worked on a wide range of projects across the still-small company. However, after talking to the engineering and product … Read more

Funderstanding competitive neural networks

Vector quantization: the general idea Imagine you have a black-and-white image. You can think of such an image as, effectively, a list of point coordinates (x, y) for every point you want to be coloured black. You would then approach a grid, like in a square-ruled mathematics exercise book, and colour in every point on the … Read more

First neural network for beginners explained (with code)

Creating our own simple neural network Let’s create a neural network from scratch with Python (3.x in the example below). import numpy, random, osa = 1 #learning ratebias = 1 #value of biasweights = [random.random(),random.random(),random.random()] #weights generated in a list (3 weights in total for 2 neurons and the bias) The beginning of the program just … Read more

Setting Up AWS EC2 Instance for Beginners

Access Your EC2 Instance via SSH EC2 instance, check; .pem key, check. Before proceeding, you need to locate your public DNS highlighted in green. Click on your newly created instance and a description box should appear like the own below. You use the ssh (secure shell) command to access your instance. Open a Terminal window and … Read more

Effect of Cambridge Analytica’s Facebook ads on the 2016 US Presidential Election

Cambridge Analytica an advertising company, and an offshoot of the SCL group was founded in 2013 but has gone defunct as of May 1st, 2018. The company had a political and a commercial wing, and from their website, the political wing “combines the predictive data analytics, behavioral sciences, and innovative ad tech into one award … Read more

Web Development of NLP Model in Python & Deployed in Flask

Source Google Introduction on NLP spam Architecture Considering a system using machine learning to detect spam SMS text messages. Our ML systems workflow is like this: Train offline -> Make model available as a service -> Predict online. A classifier is trained offline with spam and non-spam messages. The trained model is deployed as a … Read more

TimeSeries Data Munging — Lagging Variables that are Distributed Across Multiple Groups

2. Lag one variable across multiple groups — using “unstack” method This method is slightly more involved because there are several groups, but manageable because only one variable needs to be lagged. Overall, we should be aware that we want to index the data first, then unstack to separate the groups before applying the lag function. Failure … Read more

Getting Started with Randomized Optimization in Python

How to use randomized optimization algorithms to solve simple optimization problems with Python’s mlrose package mlrose provides functionality for implementing some of the most popular randomization and search algorithms, and applying them to a range of different optimization problem domains. In this tutorial, we will discuss what is meant by an optimization problem and step through … Read more

Secret of Google Web-Based OCR Service

Optical Character Recognition (OCR) is one of the way to connect reality world and virtual word. First OCR system is introduced in late 1920s. The objective of OCR is recognising text from image. However, it is very challenge to achieve a very high accuracy due to lots of factors. In the following story, I will … Read more

The Most Intuitive and Easiest Guide for Artificial Neural Network

Demystifying neural networks for complete starters Neural Network! Deep learning! Artificial Intelligence! Anyone who is living in a world of 2019, would have heard of these words more than once. And you probably have seen the awesome works such as image classification, computer vision, and speech recognition. So are you also interested in building those cool … Read more

Mathematics for Data Science

Motivation Learning the theoretical background for data science or machine learning can be a daunting experience, as it involves multiple fields of mathematics, and a long list of online resources. In this piece, my goal is to suggest resources to build the mathematical background necessary to get up and running in data science practical/research work. … Read more

XmR Chart | Step-by-Step Guide by Hand and with R

Is your process in control? The XmR chart is a great statistical process control (SPC) tool that can help you answer this question, reduce waste, and increase productivity. We’ll cover the concepts behind XmR charting and explain the XmR control constant with some super simple R code. Lastly, we’ll cover how to make the XmR … Read more

Categories R Tags ExcerptFavorite

I Highly Recommend Scotch

Scotch isn’t exactly the cheapest of spirits one can buy, in fact, its probably, on average, the most expensive with the majority of bottles being priced around $60-$80 and the upper end being in the thousands or even tens of thousands of dollars. That being said, most consumers of it want to know that if … Read more

How Smart is Your News Source?

Text Data Analysis of 21 Different News Outlets I think it’s more important than ever to understand the perspectives and biases of our new sources. Unfortunately there is just so much news¹ that it is almost impossible for us to escape our tiny filter bubbles. Luckily, the same technology that got us into this mess, can … Read more

Generating Synthetic Data Sets with ‘synthpop’ in R

Synthpop – A great music genre and an aptly named R package for synthesising population data. I recently came across this package while looking for an easy way to synthesise unit record data sets for public release. The goal is to generate a data set which contains no real units, therefore safe for public release … Read more

Categories R Tags ExcerptFavorite

Top Sources For Machine Learning Datasets

It can be quite hard to find a specific dataset to use for a variety of machine learning problems or to even experiment on. The list below does not only contain great datasets for experimentation but also contains a description, usage examples and in some cases the algorithm code to solve the machine learning problem … Read more

Making sense of the METS and ALTO XML standards

Last week I wrote a blog post where I analyzedone year of newspapers ads from 19th century newspapers. The data is made available by thenational library of Luxembourg.In this blog post, which is part 1 of a 2 part series, I extract data from the 257gb archive, whichcontains 10 years of publications of the L’Union, … Read more

Categories R Tags ExcerptFavorite