AzureVM update: flexible and powerful deployment and management of VMs in Azure

by Hong Ooi, senior data scientist, Microsoft Azure I’m happy to announce version 2.0 of AzureVM, a package for deploying and managing virtual machines in Azure. This is a complete rewrite of the package, with the objective of making it a truly generic and flexible tool for working with VMs and VM scale sets (clusters). … Read more AzureVM update: flexible and powerful deployment and management of VMs in Azure

A Gentle Introduction to tidymodels

Use the metrics() function to measure the performance of the model. It will automatically choose metrics appropriate for a given type of model. The function expects a tibble that contains the actual results (truth) and what the model predicted (estimate). Because of the consistency of the new interface, measuring the same metrics against the randomForest … Read more A Gentle Introduction to tidymodels

Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles

Continuing on the below post, I am going to use a gradient boosted machine model to predict combined miles per gallon for all 2019 motor vehicles. Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles The raw data is located on the EPA government site The variables/features I am using … Read more Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles

Cohen’s D for Experimental Planning

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a … Read more Cohen’s D for Experimental Planning

Quick hit: Some ggplot2 Stat 💙 for {logspline}

I’ve become a big fan of the {logspline} package over the past ~6 months and decided to wrap up a manual ggplot2 plotting process (well, it was at least in an RStudio snippet) into a small {ggplot2} Stat to make it easier to visualize various components of the fitted model. If you’re new to logspline … Read more Quick hit: Some ggplot2 Stat 💙 for {logspline}

New Versions of R GUIs: BlueSky, JASP, jamovi

It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the … Read more New Versions of R GUIs: BlueSky, JASP, jamovi

How to Perform Ordinal Logistic Regression in R

In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the go-to tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, … Read more How to Perform Ordinal Logistic Regression in R

anytime 0.3.4

A new minor release of the anytime package is arriving on CRAN. This is the fifteenth release, and first since the 0.3.3 release in November. anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date … Read more anytime 0.3.4

Understanding AdaBoost – or how to turn Weakness into Strength

Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off. The same principle is at work in … Read more Understanding AdaBoost – or how to turn Weakness into Strength

Getting from flat data a world of relationships to visualise with Gephi

by Mariluz Congosto Network analysis offers a perspective of the data that broadens and enriches any investigation. Many times we deal with data in which the elements are related, but we have them in a tabulated format that is difficult to import into network analysis tools. Relationship data require a definition of nodes and connections. … Read more Getting from flat data a world of relationships to visualise with Gephi

Parametric survival modeling

Survival analysis is used to analyze the time until the occurrence of an event (or multiple events). Cox models—which are often referred to as semiparametric because they do not assume any particular baseline survival distribution—are perhaps the most widely used technique; however, Cox models are not without limitations and parametric approaches can be advantageous in … Read more Parametric survival modeling

Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!

Another summer and another edition of the Copa América! Along with theAfrica Cup of Nations, Nations League finals, the Women’s World Cup,Under-21 European Championship AND the Gold Cup this is yet anothersoccer-filled season after last year’s World Cup and the Asian Cupearlier this year (I also did a blog post on these last two tournamentswhich … Read more Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!

On my way to Manizales (Colombia)

Next week, I will be in Manizales, Colombia, for the Third International Congress on Actuarial Science and Quantitative Finance. I will be giving a lecture on Wednesday with Jed Fress and Emilianos Valdez. I will give my course on Algorithms for Predictive Modeling on Thursday morning (after Jed and Emil’s lectures). Unfortunately, my computer locked … Read more On my way to Manizales (Colombia)

modelDown is now on CRAN!

The modelDown package turns classification or regression models into HTML static websites.With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits. The modelDown uses DALEX explainers. So it’s model agnostic (feel free … Read more modelDown is now on CRAN!

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and … Read more ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

Introducing the {ethercalc} package

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix). I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the … Read more Introducing the {ethercalc} package

Stabilising transformations: how do I present my results?

ANOVA is routinely used in applied biology for data analyses, although, in some instances, the basic assumptions of normality and homoscedasticity of residuals do not hold. In those instances, most biologists would be inclined to adopt some sort of stabilising transformations (logarithm, square root, arcsin square root…), prior to ANOVA. Yes, there might be more … Read more Stabilising transformations: how do I present my results?

Exploring Categorical Data With Inspectdf

Exploring categorical data with inspectdf What’s inspectdf and what’s it for? I often find myself viewing and reviewing dataframes throughout thecourse of an analysis, and a substantial amount of time can be spentrewriting the same code to do this. inspectdf is an R package designedto make common exploratory tools a bit more useful and easy … Read more Exploring Categorical Data With Inspectdf

EARL London keynote announcement: Helen Hunter, Sainsbury’s

We are delighted to announce that Helen Hunter, Chief Data Officer at Sainsbury’s will deliver the opening keynote address at this year’s London EARL conference. As Chief Data Officer at Sainsbury’s plc, Helen’s remit is to maximise the value of the Group’s data asset: democratising access and finding creative ways to unlock its insight potential … Read more EARL London keynote announcement: Helen Hunter, Sainsbury’s

Fixing your mistakes: sentiment analysis edition

Today tidytext 0.2.1 is available on CRAN! This new release of tidytext has a collection of nice new features. Bug squashing! 🐛 Improvements to error messages and documentation 📃 Switching from broom to generics for lighter dependencies Addition of some helper plotting functions I look forward to blogging about soon An additional change is significant … Read more Fixing your mistakes: sentiment analysis edition

#rstats adventures in the land of @rstudio shiny (apps)

PreambleColleagues and I had some sweet telemetry data, we did some simple models (& some relatively more complex ones too), we drew maps, and we wrote a paper. However, I thought it would be great to also provide stakeholders with the capacity to engage with the models, data, and maps. I published the data with … Read more #rstats adventures in the land of @rstudio shiny (apps)

Polygon plotting in R

# ————————————————————— # # What : Plot neighborhood data on geo-map (city of Utrecht) # # Author : Wouter van Gils ([email protected]) # # Date : February 2019 # # ————————————————————— # # packages required library(geojsonio) library(leaflet) library(magrittr) library(htmlwidgets) ####################### ## READ POLYGON DATA ## ####################### # Read shapefile: Spatial Polygon DB neighborhoods_utrecht <- … Read more Polygon plotting in R

R vs. Python

For some time, I’ve planned to write up a point-by-point comparison of R and Python. I’ve done so now! Comments welcome. Related To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: … Read more R vs. Python

Working with SPSS labels in R

TL;DR 📖 This post provides an overview of R functions for dealing with survey data labels, particularly ones that I wish I’d known when I first started out analysing survey data in R (primarily stored in SPSS data files). Some of these functions come from surveytoolbox, a package I’m developing (GitHub only) which contains a … Read more Working with SPSS labels in R

Community Call – Involving Multilingual Communities

rOpenSci’s community is increasingly international and multilingual. While we have operated primarily in English, we now receive submissions of packages from authors whose primary language is not. As we expand our community in this way, we want to learn from the experience of other organizations. How can we manage our peer-review process and open-source projects … Read more Community Call – Involving Multilingual Communities

RStudio Connect 1.7.4.2 – Important Security Patch

This RStudio Connect patch release addresses an urgent security update and an important bug fix. Security Update: Password AuthenticationA vulnerability has been identified for customers using RStudio Connect’s built-in password authentication. Due to the risks involved, password authentication will now require the configuration setting Server.Address for operations that will send emails. If this setting is … Read more RStudio Connect 1.7.4.2 – Important Security Patch

Equal Size kmeans

We were recently presented with a problem where the decision maker wanted to understand how their data would naturally group together. The classic technique of k-means clustering was a natural choice; it’s well known, computationally efficient, and implemented in base R via the kmeans() function. Our problem has a slight wrinkle: the decision maker wished … Read more Equal Size kmeans

Predicting MPG for 2019 Vehicles using R

I am going to use regression, decision trees, and the random forest algorithm to predict combined miles per gallon for all 2019 motor vehicles.  The raw data is located on the EPA government site After preliminary diagnostics, exploration and cleaning I am going to start with a multiple linear regression model. The variables/features I am … Read more Predicting MPG for 2019 Vehicles using R

Beyond Bar Graphs and Pie Charts

A BEGINNER’S GUIDE Using Python, R, Tableau, and RawGraphs to effectively and beautifully communicate your data I understand. Maybe you forgot about your presentation this afternoon. Maybe you have 5 minutes to throw together the 3 visuals your boss wants on his desk by the end of the day. Maybe you’re just tired of dealing … Read more Beyond Bar Graphs and Pie Charts

Il voto per le europee a Milano

La geografia socio-politica delle grandi città italiane del centro-nord è radicalmente cambiata negli ultimi 25 anni. Se osserviamo la distribuzione dei voti a Milano tra i partiti alle elezioni europee del 1994, tenutesi pochi mesi dopo la straordinaria vittoria elettorale di Silvio Berlusconi nel Marzo dello stesso anno in cui Forza Italia ottenne il 21% … Read more Il voto per le europee a Milano

RcppArmadillo 0.9.500.2.0

A new RcppArmadillo release based on a new Armadillo upstream release arrived on CRAN, and will get to Debian shortly. It brings a few upstream changes, including extened interfaces to LAPACK following the recent gcc/gfortran issue. See below for more details. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards … Read more RcppArmadillo 0.9.500.2.0

Interactive Network Visualization with R

Networks are everywhere. We have social networks like Facebook, competitive product networks or various networks in an organisation. Also, for STATWORX it is a common task to unveil hidden structures and clusters in a network and visualize it for our customers.In the past, we used the tool Gephi to visualize our results in network analysis. … Read more Interactive Network Visualization with R

Why R? 2019 Conference – Keynotes and Workshops announced

Why R? is one of the best opportunities to learning secrets of data science from true experts. This year we are happy to invite you for six keynote presentations, one invited workshop and eight regular workshops. We are more than happy to announce that Why R? 2019 Conference will be organized by Why R? Foundation. … Read more Why R? 2019 Conference – Keynotes and Workshops announced

Taking over maintenance of a software package

Software is maintained by people. While software can in theory live on indefinitely, to do so requires people. People change jobs, move locations, retire, and unfortunately die sometimes. When a software maintainer can no longer maintain a package, what happens to the software? Because of the fragility of people in software, in an ideal world … Read more Taking over maintenance of a software package

Intermittent demand, Croston and Die Hard

I have recently been confronted to a kind of data set and problem that I was not even aware existed:intermittent demand data. Intermittent demand arises when the demand for a certain good arrivessporadically. Let’s take a look at an example, by analyzing the number of downloads for the {RDieHarder}package: library(tidyverse) library(tsintermittent) library(nnfor) library(cranlogs) library(brotools) rdieharder … Read more Intermittent demand, Croston and Die Hard

Makeover Jumbalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

This morning, @kairyssdal tweeted out the following graphic from @axios: Confusing, but interesting. Data shows we’re a nation of news consumption hypocrites – Axios https://t.co/O0lPSc4OV3 — Kai Ryssdal (@kairyssdal) June 11, 2019 If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below: … Read more Makeover Jumbalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

Shiny apps need more info! – our new shiny.info package

At Appsilon we thrive on supporting the open source community with open source packages. As we have already mentioned at various R conferences, the typical cycle of our work is: identification of a repeating programming problem, solving it, wrapping it into a package, testing internally and once we decide it’s useful, happily sharing it with … Read more Shiny apps need more info! – our new shiny.info package

How to deal with outliers in a noisy population?

Defining outliers can be a straight forward task. On the other hand, deciding what to do with them always requires some deeper study. Motivation Data can be noisy. When you have a small (relative to population size), random sample of the population, especially noisy population, it can be quite a challenge, if not impossible, to build … Read more How to deal with outliers in a noisy population?

Does “Sell in May, Go Away” really work?

If you follow the stock market, you’ve probably heard the expression “Sell in May, Go Away.” This expression generally refers to the perceived idea that the stock market goes up between the end of October and end of April, but one should sell at the beginning of May to avoid losses. The general recommendation according … Read more Does “Sell in May, Go Away” really work?

From Coin Tosses to p-Hacking: Make Statistics Significant Again!

One of the most notoriously difficult subjects in statistics is the concept of statistical tests. We will explain the ideas behind it step by step to give you some intuition on how to use (and misuse) it, so read on… Let us begin with some coin tosses and the question how to find out whether … Read more From Coin Tosses to p-Hacking: Make Statistics Significant Again!

Don’t get too excited – it might just be regression to the mean

It is always exciting to find an interesting pattern in the data that seems to point to some important difference or relationship. A while ago, one of my colleagues shared a figure with me that looked something like this: It looks like something is going on. On average low scorers in the first period increased … Read more Don’t get too excited – it might just be regression to the mean

riddles on Egyptian fractions and Bernoulli factories

Two fairy different riddles on the weekend Riddler. The first one is (in fine) about Egyptian fractions: I understand the first one as Find the Egyptian fraction decomposition of 2 into 11 distinct unit fractions that maximises the smallest fraction. And which I cannot solve despite perusing this amazing webpage on Egyptian fractions and making … Read more riddles on Egyptian fractions and Bernoulli factories