Improve GRNN by Weighting

In the post (https://statcompute.wordpress.com/2019/07/14/yet-another-r-package-for-general-regression-neural-network), several advantages of General Regression Neural Network (GRNN) have been discussed. However, as pointed out by Specht, a major weakness of GRNN is the high computational cost required for a GRNN to generate predicted values based on a new input matrix due to its unique network structure, e.g. the number of … Read more Improve GRNN by Weighting

Reversi in R – Part 1: Bare Bones

In this post, I showcase a bare-bones point-and-click implementation of the classic board Reversi (also called Othello*) in the R programming language. R is typically used for more serious, statistical endeavors, but it works reasonably well for more playful projects. Building a classic game like this is an excellent high-school level introduction to programming, as … Read more Reversi in R – Part 1: Bare Bones

What Data Tells Us About the World’s Wealthiest

In the spring semester of my freshman year at Harvard, I took a class called “Using Big Data to Solve Economic and Social Problems.” One of the most interesting subjects we explored was equality of opportunity in the United States. We learned that children’s chances of earning more than their parents is not uniform: it … Read more What Data Tells Us About the World’s Wealthiest

Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their … Read more Big News: Porting vtreat to Python

Program Evaluation: Difference-in-differences in R

Category Tags Regression analysis is one of the most demanding machine learning methods in 2019. One group of regression analysis for measuring effects and to evaluate a policy program is Difference-in-Difference. This method is well suited for benchmarking and finding improvements for optimization in organizations. It can, therefore, be used to design organizations so they … Read more Program Evaluation: Difference-in-differences in R

My useR! 2019 Highlights & Experience: Shiny, R Community, {packages}, and more!

The useR! Conference was held in Toulouse, France and for me thiswas my second useR! after my first in Brisbane last year. This timearound I wanted to write about my experiences and some highlightssimilar to my post on the RStudio::Conference 2019 & Tidyverse DevDayearlier this year. This blog post will be divided into 4 sections: … Read more My useR! 2019 Highlights & Experience: Shiny, R Community, {packages}, and more!

GEDCOM Reader for the R Language: Analysing Family History

Understanding who you are is strongly related to understanding your family history. Discovering ancestors is now a popular hobby, as many archives are available on the internet. The GEDCOM format provides a standardised way to store information about ancestors. This article shows how to develop a GEDCOM reader using the R language. Download the Code … Read more GEDCOM Reader for the R Language: Analysing Family History

Impressions from useR! 2019

This year, the greater R community gathering useR! took place in sunny Toulouse in July, bringing together over 1000 practitioners from both academia and industry. The event spanned over five days, including: a tidyverse day one full day of workshops 6 keynotes and a few sponsor talks contributed talks and lightning talks over 6 parallel … Read more Impressions from useR! 2019

Learn Principle Component Analysis in R

Hi there! Welcome to my blog on pricipal component analysis in R. Purpose: PCA is a dimensionality rediction technique; meaning that each additional variable you’re including in your modeling process represents a dimension. What does it do?: In terms of what PCA actually does, it takes a dataset with high dimensionality, and reduces them down … Read more Learn Principle Component Analysis in R

Germination data and time-to-event methods: comparing germination curves

Very often, seed scientists need to compare the germination behaviour of different seed populations, e.g., different plant species, or one single plant species submitted to different temperatures, light conditions, priming treatments and so on. How should such a comparison be performed? Let’s take a practical approach and start from an appropriate example: a few years … Read more Germination data and time-to-event methods: comparing germination curves

Generating a Gallery of Visualizations for a Static Website (using R)

While I was browsing the website of fellow R blogger Ryo Nakagawara, Iwas intrigued by his “Visualizations” page.The concept of creating an online “portfolio” is not novel , butI hadn’t thought to make one as a compilation of my own work (from blog posts)…until now 😄. I should state a couple of caveats/notes for anyone … Read more Generating a Gallery of Visualizations for a Static Website (using R)

Watch keynote presentations from the useR!2019 conference

The keynote presentations from last week’s useR!2019 conference in Toulouse are now available for everyone to view on YouTube. (The regular talks were also recorded and video should follow soon, and slides for most talks are available for download now at the conference website.) Here are links to the videos, indexed to the start of … Read more Watch keynote presentations from the useR!2019 conference

Time series forecast cross-validation by @ellis2013nz

Time series cross-validation is important part of the toolkit for good evaluation of forecasting models. forecast::tsCV makes it straightforward to implement, even with different combinations of explanatory regressors in the different candidate models for evaluation. Suprious correlation between time series is a well documented and mocked problem, with Tyler Vigen’s educational website on the topic … Read more Time series forecast cross-validation by @ellis2013nz

How to make 3D Plots in R (from 2D Plots of ggplot2)

Category Tags 3D Plots built in the right way for the right purpose are always stunning. In this article, we’ll see how to make stunning 3D plots with R using ggplot2 and rayshader . While ggplot2 might be familiar to anyone in Data science, rayshader may not. So, let’s start with a small introduction to … Read more How to make 3D Plots in R (from 2D Plots of ggplot2)

What NOT to do when building a shiny app (lessons learned the hard way)

I’ve been building R shiny apps for a while now, and ever since I started working with shiny, it has significantly increased the set of services I offer my clients. Here’s a documentations of some of the many lessons I learned in previous projects I did. Hopefully, others can avoid them in the future. Background … Read more What NOT to do when building a shiny app (lessons learned the hard way)

Dotplot – the single most useful yet largely neglected dataviz type

I have to confess that the core message of this post is not really a fresh saying. But if I was given a chance to deliver one dataviz advise to every (ha-ha-ha) listening mind, I’d choose this: forget multi-category bar plots and use dotplots instead. I was converted several years ago after reading this brilliant … Read more Dotplot – the single most useful yet largely neglected dataviz type

Statistical matching, or when one single data source is not enough

I was recently asked how to go about matching several datasets where different samples ofindividuals were interviewed. This sounds like a big problem; say that you have dataset A and B,and that A contain one sample of individuals, and B another sample of individuals, then how couldyou possibly match the datasets? Matching datasets requires a … Read more Statistical matching, or when one single data source is not enough

Wordcloud of conference abstracts – FOSS4G Edinburgh

I’m helping run a conference this September – FOSS4GUK. To help promote the event I’ve created a wordcloud of conference abstracts, in R! The conference is taking place in Edinburgh, Scotland at Dynamic Earth. It’s focused on free and open source software for geospatial (FOSS4G), as such is full stack. Everything from backend databases, ETL, … Read more Wordcloud of conference abstracts – FOSS4G Edinburgh

rOpenSci Hiring for New Position in Statistical Software Testing and Peer Review

Are you passionate about statistical methods and software? If so we would love for you to join our team to dig deep into the world of statistical software packages. You’ll develop standards for evaluating and reviewing statistical tools, publish, and work closely with an international team of experts to set up a new software review … Read more rOpenSci Hiring for New Position in Statistical Software Testing and Peer Review

Processing satellite image collections in R with the gdalcubes package

[view rawRmd] The problem Scientists working with collections and time series of satellite imageryquickly run into some of the following problems: Images from different areas of the world have different spatialreference systems (e.g., UTM zones). The pixel size of a single image sometimes differs among itsspectral bands / variables. Spatially adjacent image tiles often overlap. … Read more Processing satellite image collections in R with the gdalcubes package

Plotting Bayes Factors for multiple comparisons using ggsignif

This week my post is relatively short and very focused. What makes it interesting(at least to me) is whether it will be seen as a useful “bridge” betweenfrequentist methods and bayesian methods or as an abomination to both! There’ssome reasonably decent code and explanation in this post but before I spend muchmore time on the … Read more Plotting Bayes Factors for multiple comparisons using ggsignif

RStudio Trainer Directory Launches

Several dozen people have taken part in RStudio’s instructor training and certification program since it was announced earlier this year. Since our last update, many of them have completed certification, so we are pleased to announce a preview of our trainers’ directory. Each of the people listed there has completed an exam on modern evidence-based … Read more RStudio Trainer Directory Launches

Combining momentum and value into a simple strategy to achieve higher returns

In this post I’ll introduce a simple investing strategy that is well diversified and has been shown to work across different markets. In short, buying cheap and uptrending stocks has historically led to notably higher returns. The strategy is a combination of these two different investment styles, value and momentum. In a previous post I explained … Read more Combining momentum and value into a simple strategy to achieve higher returns

An Ad-hoc Method for Calibrating Uncalibrated Models

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make … Read more An Ad-hoc Method for Calibrating Uncalibrated Models

101 Machine Learning Algorithms for Data Science with Cheat Sheets

Think of this as the one-stop-shop/dictionary/directory for your machine learning algorithms. The algorithms have been sorted into 9 groups: Anomaly Detection, Association Rule Learning, Classification, Clustering, Dimensional Reduction, Ensemble, Neural Networks, Regression, Regularization. In this post, you’ll find 101 machine learning algorithms, including useful infographics to help you know when to use each one (if … Read more 101 Machine Learning Algorithms for Data Science with Cheat Sheets

shinymeta — a revolution for reproducibility

Related R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from … Read more shinymeta — a revolution for reproducibility

Estimating treatment effects (and ICCs) for stepped-wedge designs

In the last two posts, I introduced the notion of time-varying intra-cluster correlations in the context of stepped-wedge study designs. (See here and here). Though I generated lots of data for those posts, I didn’t fit any models to see if I could recover the estimates and any underlying assumptions. That’s what I am doing … Read more Estimating treatment effects (and ICCs) for stepped-wedge designs

Pricing floating legs of interest rate swaps

In this post we will close the trilogy on (old style) swap pricing. In particular, we will look at how downloading the data for the variable rate needed to calculate the variable leg accrual. Part 1gave the general idea behind tidy pricing interest rate swaps using a 7 linespipe Part 2went much more into detail … Read more Pricing floating legs of interest rate swaps

Bojack Horseman and Tidy Data Principles (Part 1)

After reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I wanted to do something similar for Rick and Morty and I did. Now I’m doing something similar for Bojack Horseman. In this post I’ll focus on the Tidy Data principles. However, here is the Github repo … Read more Bojack Horseman and Tidy Data Principles (Part 1)

Aggregating spatial data with the grainchanger package

The grainchanger package provides functionality for data aggregation to a coarser resolution via moving-window or direct methods. Why do we need new methods for data aggregation? As landscape ecologists and macroecologists, we often need to aggregate data in order to harmonise datasets. In doing so, we often lose a lot of information about the spatial … Read more Aggregating spatial data with the grainchanger package

Quick Hit: A Different (Diminutive) Look At Distributions With {ggeconodist}

Despite being a full-on denizen of all things digital I receive a fair number of dead-tree print magazines as there’s nothing quite like seeing an amazing, large, full-color print data-driven visualization up close and personal. I also like supporting data journalism through the subscriptions since without cash we will only have insane, extreme left/right-wing perspectives … Read more Quick Hit: A Different (Diminutive) Look At Distributions With {ggeconodist}

Is Scholarly Use of R Use Beating SPSS Already?

by Bob Muenchen & Sean Mackinnon One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found each year for each software. He chose Google Scholar since it searches “across many disciplines … Read more Is Scholarly Use of R Use Beating SPSS Already?

Twitter coverage of the useR! 2019 conference

Very briefly: Last week was useR! conference time again, coming to you this time from Toulouse, France I’ve retrieved 8 318 tweets that mention #user2019 and run them through my report generator And here are the results Take-home message this year: the R Ladies rock! Related R-bloggers.com offers daily e-mail updates about R news and … Read more Twitter coverage of the useR! 2019 conference

Looking at flood insurance claims with choroplethr

I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims. I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than … Read more Looking at flood insurance claims with choroplethr

Recreating ‘Unknown Pleasures’ graphic

For some time I’ve wanted to recreate the cover art from Joy Division’s Unknown Pleasures album. The visualisation depicts successive pulses from the pulsar PSR B1919+21, discovered by Jocelyn Bell in 1967. Album art. Data The first obstacle was acquiring the data. I found a D3 visualisation by Mike Bostock. This in turn pointed me … Read more Recreating ‘Unknown Pleasures’ graphic

Distribution of Headline Sentiment

My web scraping project explored the distribution of headline sentiment by news source. To do this, I scraped the Nasdaq latest market headlines page and applied sentiment analysis to the retrieved text. It should be noted that I only scraped one web page, but this page aggregates headlines from multiple sources. I wanted to see … Read more Distribution of Headline Sentiment

rstudio::conf(2020) is open for registration!

rstudio::conf, the conference for all things R and RStudio, will take place January 29 and 30, 2020 in San Francisco, California. It will be preceded by Training Days on January 27 and 28. Early Bird registration is now open! Conference: Wednesday-Thursday, Jan 29-30 Join me, your host and Chief Scientist of RStudio, for our keynote … Read more rstudio::conf(2020) is open for registration!

Experimenting with Hierarchical Clustering in a galaxy far far away…

Introduction This post will be taking a bit of an unexpected diversion. As I was experimenting with hierarchical clustering I ran into the issue of how many clusters to assume. From that point I went deep into the rabbit hole and found out some really useful stuff that I wish I’d have known when I … Read more Experimenting with Hierarchical Clustering in a galaxy far far away…

rOpenSci Announces $678K Award from the Sloan Foundation to Expand Software Peer Review

We’re delighted to announce that we have received new funding from the Alfred P. Sloan Foundation. The $678K grant, awarded through the Foundation’s Data & Computational Research program, will be used to expand our efforts in software peer review. Software peer review has become a core part of rOpenSci, helping improve scientific software quality, drive … Read more rOpenSci Announces $678K Award from the Sloan Foundation to Expand Software Peer Review

Yet Another R Package for General Regression Neural Network

Compared with other types of neural networks, General Regression Neural Network (Specht, 1991) is advantageous in several aspects. Being an universal approximation function, GRNN has only one tuning parameter to control the overall generalization The network structure of GRNN is surprisingly simple, with only one hidden layer and the number of neurons equal to the … Read more Yet Another R Package for General Regression Neural Network