? R Coding Style Guide

Language is a tool that allows human beings to interact and communicate with each other. The clearer we express ourselves, the better the idea is transferred from our mind to the other. The same applies to programming languages: concise, clear and consistent codes are easier to read and edit. It is especially important, if you … Read more? R Coding Style Guide

Style Transfer – Styling Images with Convolutional Neural Networks

In this project, we are going to use a pre-trained VGG16 model which looks as follows. VGG16 Architecture (source: https://medium.com/@franky07724_57962/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1) Keep in mind that we are not going to use fully connected (blue) and softmax layers (yellow). They act as a classifier which we don’t need here. We are going to use only feature extractors … Read moreStyle Transfer – Styling Images with Convolutional Neural Networks

Interpreting the coefficients of linear regression

Source: Unsplash Nowadays there is a plethora of machine learning algorithms we can try out to find the best fit for our particular problem. Some of the algorithms have clear interpretation, other work as a blackbox and we can use approaches such as LIME or SHAP to derive some interpretations. In this article I would … Read moreInterpreting the coefficients of linear regression

colorspace: New Tools for Colors and Palettes

A major update (version 1.4.0) of the R package colorspace has been released to CRAN, enhancing many of the package’s capabilities, e.g., more refined palettes, named palettes, ggplot2 color scales, visualizations for assessing palettes, shiny and Tcl/Tk apps, color vision deficiency emulation, and much more. Overview The colorspace package provides a broad toolbox for selecting … Read morecolorspace: New Tools for Colors and Palettes

A Product-centric View of Data

How does this mindset then change team composition and what they focus on? Data EngineeringAs you’re now thinking about interfaces with other systems, interactions with users, and intelligent recommendations built by the data science team in POCs, you will start needing engineers with different skills. Good data engineers have different skills to those who can architect … Read moreA Product-centric View of Data

Useful Sentiment Analysis: Mining SEC Filings (Part 1)

Motivation and Data Collection: As an aside, another reason I like text data is because of a formative experience on my career path as a data scientist. In 2008–2009 I was a senior in high school and attended a talk by Vinton Cerf at BBN Technologies in Cambridge, MA. In his talk he went over … Read moreUseful Sentiment Analysis: Mining SEC Filings (Part 1)

Understanding Scoring Propensity: A Mixed Model Approach to Evaluating NBA Players

“Who’s the best scorer in the NBA?” is a question that comes up a lot during conversations with my friends. Names like LeBron James, James Harden, and Steph Curry always come up. It’s often difficult to come up with a single answer; the question becomes more nuanced when distinctions are made within scorers. How do … Read moreUnderstanding Scoring Propensity: A Mixed Model Approach to Evaluating NBA Players

Bayesian Modeling for Ford GoBike Ridership with PyMC3 — Part I

Photo by Andrew Gook on Unsplash Bike shares are a large part of the transport equation for cities around the world. In San Francisco, one of the major players in the bike share game is Ford with its GoBike program. Conveniently, they kindly release their data for people like me to study. I wonder if it … Read moreBayesian Modeling for Ford GoBike Ridership with PyMC3 — Part I

Identifying rooftops on low-resolution images with Masked R-CNN model

The work is done by Smriti Bahuguna and Rasika Joshi. In my previous post, I wrote how we are using the U-Net model to identify rooftops in low-resolution satellite images. I also wrote how are we using a community of ML enthusiast to build the solution. In this article, I will share results from another … Read moreIdentifying rooftops on low-resolution images with Masked R-CNN model

Travis CI for R — Advanced guide

Travis CI for R — Advanced guide Continuous integration for building an R project in Travis CI including code coverage, pkgdown documentation, osx and multiple R-Versions Photo by Guilherme Cunha on Unsplash Travis CI is a common tool to build R packages. It is in my opinion the best platform to use R in continuous integration. Some of the … Read moreTravis CI for R — Advanced guide

Showing a difference in means between two groups

Visualising a difference in mean between two groups isn’t as straightforward as it should. After all, it’s probably the most common quantitative analysis in science. There are two obvious options: we can either plot the data from the two groups separately, or we can show the estimate of the difference with an interval around it. … Read moreShowing a difference in means between two groups

Medium + r-bloggers — How to integrate?

Medium + r-bloggers — How to integrate? Build up a PHP script that allows you to post your Medium articles on r-bloggers.com. The script filters an RSS feed by item tags. Photo by Ato Aikins on Unsplash Motivation I started my blog about R on Medium. Medium is a wonderful platform with a great user interface. The idea to … Read moreMedium + r-bloggers — How to integrate?

Hashes power Probabilistic Data Structures

Photo by Ryan Thomas Ang on Unsplash Hash functions are used all over computer science but I want to mention they usefulness within probabilistic data structures and algorithms. We all know that data structures are the building blocks of most algorithms. A bad choice could lead to hard and inefficient solutions instead of elegant and efficient … Read moreHashes power Probabilistic Data Structures

Open Questions: Carlos A. Gomez-Uribe

Q. How did you come to work in Product at Netflix? Do you think it’s important for product managers who work on recommendation have math skills? A. I started at Netflix as a data scientist, and worked on a wide range of projects across the still-small company. However, after talking to the engineering and product … Read moreOpen Questions: Carlos A. Gomez-Uribe

Funderstanding competitive neural networks

Vector quantization: the general idea Imagine you have a black-and-white image. You can think of such an image as, effectively, a list of point coordinates (x, y) for every point you want to be coloured black. You would then approach a grid, like in a square-ruled mathematics exercise book, and colour in every point on the … Read moreFunderstanding competitive neural networks

First neural network for beginners explained (with code)

Creating our own simple neural network Let’s create a neural network from scratch with Python (3.x in the example below). import numpy, random, osa = 1 #learning ratebias = 1 #value of biasweights = [random.random(),random.random(),random.random()] #weights generated in a list (3 weights in total for 2 neurons and the bias) The beginning of the program just … Read moreFirst neural network for beginners explained (with code)

Setting Up AWS EC2 Instance for Beginners

Access Your EC2 Instance via SSH EC2 instance, check; .pem key, check. Before proceeding, you need to locate your public DNS highlighted in green. Click on your newly created instance and a description box should appear like the own below. You use the ssh (secure shell) command to access your instance. Open a Terminal window and … Read moreSetting Up AWS EC2 Instance for Beginners

Effect of Cambridge Analytica’s Facebook ads on the 2016 US Presidential Election

Cambridge Analytica an advertising company, and an offshoot of the SCL group was founded in 2013 but has gone defunct as of May 1st, 2018. The company had a political and a commercial wing, and from their website, the political wing “combines the predictive data analytics, behavioral sciences, and innovative ad tech into one award … Read moreEffect of Cambridge Analytica’s Facebook ads on the 2016 US Presidential Election

Web Development of NLP Model in Python & Deployed in Flask

Source Google Introduction on NLP spam Architecture Considering a system using machine learning to detect spam SMS text messages. Our ML systems workflow is like this: Train offline -> Make model available as a service -> Predict online. A classifier is trained offline with spam and non-spam messages. The trained model is deployed as a … Read moreWeb Development of NLP Model in Python & Deployed in Flask

Algorithms should contribute to the Happiness of Society

Data is here to help us answer questions that we deem important, so what do we want to ask? The following is the opening speech of Arjan van den Born, Academic Director of the Jheronimus Academy of Data Science (JADS), that we worked on for the Den Bosch Data Week of which I am the cofounder … Read moreAlgorithms should contribute to the Happiness of Society

TimeSeries Data Munging — Lagging Variables that are Distributed Across Multiple Groups

2. Lag one variable across multiple groups — using “unstack” method This method is slightly more involved because there are several groups, but manageable because only one variable needs to be lagged. Overall, we should be aware that we want to index the data first, then unstack to separate the groups before applying the lag function. Failure … Read moreTimeSeries Data Munging — Lagging Variables that are Distributed Across Multiple Groups

Getting Started with Randomized Optimization in Python

How to use randomized optimization algorithms to solve simple optimization problems with Python’s mlrose package mlrose provides functionality for implementing some of the most popular randomization and search algorithms, and applying them to a range of different optimization problem domains. In this tutorial, we will discuss what is meant by an optimization problem and step through … Read moreGetting Started with Randomized Optimization in Python

The Most Intuitive and Easiest Guide for Artificial Neural Network

Demystifying neural networks for complete starters Neural Network! Deep learning! Artificial Intelligence! Anyone who is living in a world of 2019, would have heard of these words more than once. And you probably have seen the awesome works such as image classification, computer vision, and speech recognition. So are you also interested in building those cool … Read moreThe Most Intuitive and Easiest Guide for Artificial Neural Network

Mathematics for Data Science

Motivation Learning the theoretical background for data science or machine learning can be a daunting experience, as it involves multiple fields of mathematics, and a long list of online resources. In this piece, my goal is to suggest resources to build the mathematical background necessary to get up and running in data science practical/research work. … Read moreMathematics for Data Science

XmR Chart | Step-by-Step Guide by Hand and with R

Is your process in control? The XmR chart is a great statistical process control (SPC) tool that can help you answer this question, reduce waste, and increase productivity. We’ll cover the concepts behind XmR charting and explain the XmR control constant with some super simple R code. Lastly, we’ll cover how to make the XmR … Read moreXmR Chart | Step-by-Step Guide by Hand and with R

Generating Synthetic Data Sets with ‘synthpop’ in R

Synthpop – A great music genre and an aptly named R package for synthesising population data. I recently came across this package while looking for an easy way to synthesise unit record data sets for public release. The goal is to generate a data set which contains no real units, therefore safe for public release … Read moreGenerating Synthetic Data Sets with ‘synthpop’ in R

Top Sources For Machine Learning Datasets

It can be quite hard to find a specific dataset to use for a variety of machine learning problems or to even experiment on. The list below does not only contain great datasets for experimentation but also contains a description, usage examples and in some cases the algorithm code to solve the machine learning problem … Read moreTop Sources For Machine Learning Datasets

Making sense of the METS and ALTO XML standards

Last week I wrote a blog post where I analyzedone year of newspapers ads from 19th century newspapers. The data is made available by thenational library of Luxembourg.In this blog post, which is part 1 of a 2 part series, I extract data from the 257gb archive, whichcontains 10 years of publications of the L’Union, … Read moreMaking sense of the METS and ALTO XML standards

Sentiment Classification with Natural Language Processing on LSTM

Google So Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. LSA is an information retrieval technique which analyzes and identifies the pattern in unstructured collection of text and the relationship between them. LSA itself is … Read moreSentiment Classification with Natural Language Processing on LSTM

A Practical Guide to Interpreting and Visualising Support Vector Machines

About the author Hugo Dolan is an undergraduate Financial Mathematics student at University College Dublin. This is mostly based and motivated by recent data analytics and machine learning experiences in the NFL Punt Analytics Kaggle Competition and the being part of the team who won the Citadel Dublin Data Open, along with material from Stanford’s … Read moreA Practical Guide to Interpreting and Visualising Support Vector Machines

How to Perform Lasso and Ridge Regression in Python

A quick tutorial on how to use lasso and ridge regression to improve your linear model. Photo by Zhen Hu on Unsplash Previously, I introduced the theory underlying lasso and ridge regression. We now know that they are alternate fitting methods that can greatly improve the performance of a linear model. In this quick tutorial, we revisit … Read moreHow to Perform Lasso and Ridge Regression in Python

Practical Data Science with R, 2nd Edition discount!

Please help share our news and this discount. The second edition of our best-selling book Practical Data Science with R2, Zumel, Mount is featured as deal of the day at Manning. The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished … Read morePractical Data Science with R, 2nd Edition discount!

The Reality of Global Nuclear Weapons and How Russian Nukes Turned On Your Lights

The global nuclear stockpile peaked in 1985 and has been on a rapid decline ever since. Again, exact estimates differ, but the Federation of American Scientists states numbers went from 70,300 to 14,485 as of 2018. This respresents nearly an 80% decline in total nuclear weapons! Today, there are 9 countries with confirmed nuclear weapons … Read moreThe Reality of Global Nuclear Weapons and How Russian Nukes Turned On Your Lights

Building a Conversational Chatbot for Slack using Rasa and Python -Part 2

D eploying the Bot on Slack Create a Python Script Since we are done with all the requirements, it’s time to deploy our bot. For this, we will need to write a Python script called run_app.py, which will integrate our chatbot with the slack app that we created above. We will begin by creating a slack … Read moreBuilding a Conversational Chatbot for Slack using Rasa and Python -Part 2

My Tryst with Deep Learning — German Traffic data set with Keras

Deep Learning course offered by New York Data Science Academy is great to get you started on your journey with deep learning and also encourages you to do a full fledged deep learning project. I decided to do an image recognition challenge using the German Traffic sign data set. I have never worked on image … Read moreMy Tryst with Deep Learning — German Traffic data set with Keras

The real meaning and process of data democratization.

The scene: a small village in rural India. The whole of the village has gathered to listen as public records are being read out. A villager is listed in the public record as having rented out his plough to the government-sponsored irrigation project. “No,” he says, “I did not do that. I was away in … Read moreThe real meaning and process of data democratization.

I walk the (train) line – part deux – the weight loss continues

(TL;DR: author continues to use his undiagnosed OCD for good. Breath-first search introduced on simple graph.) We learnt how to get OpenStreetMap data into R last time. And I said that we will be doing a little bit of this: So what the hell is this? This is an example of breadth-first search of a … Read moreI walk the (train) line – part deux – the weight loss continues

Taming False Discoveries with Empirical Bayes

A Matter of Belief The concepts introduced in this text broadly belong in the area of Bayesian Statistics. The key concept is that we are not as only interested in the distribution of data, but also in the distribution about beliefs about true, unobserved values. The true win-ratio of each strategy is unobserved. But we can … Read moreTaming False Discoveries with Empirical Bayes

10 years of playback history on Last.FM: “Just sit back and listen”

Alright, seems like this is developing into a blog where I am increasingly investigating my own music listening habits.Recently, I’ve come across the analyzelastfm package by Sebastian Wolf. I used it to download my complete listening history from Last.FM for the last ten years. That’s a complete dataset from 2009 to 2018 with exactly 65,356 … Read more10 years of playback history on Last.FM: “Just sit back and listen”

Soft Skills Will Make or Break You as a Data Scientist

As businesses gather an increasing amount of data related to various aspects of their organisation (e.g. internal business operations, customer purchases and behaviour), the demand for data-savvy employees has exploded over the last 5 years. Business leaders have woken up to the fact that data-driven decision-making can lead to making better decisions (it is not … Read moreSoft Skills Will Make or Break You as a Data Scientist

How to use news articles to predict BTC price changes

Bitcoin (BTC) price changes are volatile due to many reasons, such as its specially different perceived values by public and high profile losses. In this article, we focus on one of its major factor, BTC news articles’ influences. Due to the past momentum of BTC and a huge portion of BTC market among the cryptocurrency … Read moreHow to use news articles to predict BTC price changes

How to combine Multiple ggplot Plots to make Publication-ready Plots

Categories Visualizing Data Tags Best R Packages Data Visualisation R Programming The life cycle of Data science can never be completed without communicating the results of the analysis/research. In fact, Data Visualization is one of the areas where R as a language for Data science has got an edge over the most-celebrated Python. With ggplot2 … Read moreHow to combine Multiple ggplot Plots to make Publication-ready Plots

GetDFPData Ver 1.4

I just released a major update to package GetDFPData. Here are the main changes: Naming conventions for caching system are improved so that it reflects different versions of FRE and DFP files. This means the old caching system no longer works. If you have built yourself your own cache folder with many companies, do clean … Read moreGetDFPData Ver 1.4

The most important idea in statistics

What comes to mind when you think of the discipline of statistics? Populations, samples, and hypotheses? Or perhaps you took a course that emphasized probabilities, distributions, p-values, and confidence intervals? All of these are pieces of the puzzle, but they’re downstream from the core. The real start of everything — the springboard that launches the whole tangle — is … Read moreThe most important idea in statistics