Where the Beer is

Since prohibition ended in 1933, the American brewing industry has undergone two massive transformations. The first saw hundreds of regional breweries from across the country, often brewing beer unique their respective regions, become consolidated into a handful of behemoths. During the period of greatest consolidation in the early 1980s the ten largest breweries produced almost … Read more

Categories R Tags ExcerptFavorite

Generating Beatles’ Lyrics with Machine Learning

The Beatles were a huge cultural phenomenon. Their timeless music still resonates with people today, both young and old. Personally, I’m a big fan. In my humble opinion, they are the greatest band to have ever lived¹. Their songs are full of interesting lyrics and deep ideas. Take these bars for example: When you’ve seen … Read more

Practical Psychology for Data Scientists

You aren’t as logical as you think. None of us are. We are susceptible to cognitive biases each and every day. If you have had the pleasure of reading Thinking, Fast and Slow by Daniel Kahneman, then you are more than familiar with this reality. We are imperfect creatures and the world is an imperfect … Read more

7 Useful Pandas Tips for Data Management

A Premier League Financial Review Example Money Ball The Premier league is big business. In fact, Premier League clubs have paid out more than £260m to football agents during 2018–19 – an increase of £49m on the previous 12 months. This statistic alone piqued my interest and drove me to delve deeper into Premier League spending … Read more

Traffic tickets and where to find them.

Dashboard: Link Introduction & Data Anyone who has ever experienced driving in New York City can attest to the fact that it may at times be a less-than-satisfactory experience. With pedestrians who rightfully assume ownership of all parts of this city, cyclists who do the same, and other drivers who, let’s face it, are no … Read more

Categories R Tags ExcerptFavorite

The power of unbiased recursive partitioning

The significance tests underlying the unbiased tree algorithms CTree, MOB, and GUIDE are embedded into a unifying framework. This allows to assess relative strengths and weaknesses in a variety of setups, highlighting the advantages of score-based tests (as in CTree/MOB) vs. residual-based tests (as in GUIDE). Citation Schlosser L, Hothorn T, Zeileis A (2019). “The … Read more

Categories R Tags ExcerptFavorite

Deep (learning) like Jacques Cousteau – Part 7 – Matrices

(TL;DR: matrices are rectangular arrays of numbers.) LaTeX and MathJax warning for those viewing my feed: please viewdirectly on website! I wanted to avoid a picture related to the film franchise! Me Lasttime,we learnt about dot products. We will now finally start talkingaobut matrices. We know what row vectors and column vectors are from our … Read more

Categories R Tags ExcerptFavorite

An Introduction to Reproducible Analyses in R

Earlier this week I had a lot of fun running a one-day workshop for the Royal Society of Biology titled “An Introduction to Reproducible Analyses in R”. It was intended to introduce researchers at all stages of their careers to using R to make their analyses and figures more reproducible. We ran the course because … Read more

Categories R Tags ExcerptFavorite

Update: Finding Economic Articles With Data

An earlier post from February, describes a Shiny app that allows to search among currently more than 4000 economic articles that have an accessible data and code supplement. Finally, I managed to configure an nginx reverse proxy server and now you can also access the app under a proper https link here: https://ejd.econ.mathematik.uni-ulm.de (I was … Read more

Categories R Tags ExcerptFavorite

RProtoBuf 0.4.14

A new release 0.4.14 of RProtoBuf is arriving at CRAN. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol. This release contains two very helpful pull requests by Jarod … Read more

Categories R Tags ExcerptFavorite

Make Refreshing Segmented Column Charts with {ggchicklet}

The first U.S. Democratic debates of the 2020 election season were held over two nights this past week due to the daft number of candidates running for POTUS. The spiffy @NYTgraphics folks took the tallies of time spent blathering by each speaker/topic and made rounded rectangle segmented bar charts ordered by the time the blathering … Read more

Categories R Tags ExcerptFavorite

Feature Elimination Using SVM Weights

Specifically for SVMLight, but this feature elimination methodology can be used for any linear SVM. Figure 1: a random example of accuracy based on the number of SVM features used. While working on my M.Sc thesis, circa 2005–2007, I had to calculate features weights based on an SVM Model. This was before SKlearn, which started in 2007. … Read more

What Separates Good from Great Data Scientists?

The most valuable skills in an evolving field The data science job market is changing rapidly. Being able to build machine learning models used to be an elitist skill that only a few distinguished scientists possessed. But nowadays, anyone with basic coding experience can follow the steps to train a simple scikit-learn or keras model. Recruiters … Read more

Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example

Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data. In other words, the samples used to train our model do not come with predefined categories. In comparison to other clustering algorithms, DBSCAN is particularly well suited for … Read more

Machine Learning At Scale With Apache Spark MLlib Python Example

For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped increasing the clock frequency of individual processors and opted for parallel CPU cores. This is fine for playing video games on a desktop computer. However, when it … Read more

What is Reinforcement Learning?

“Reinforcement Learning is like many topics with names ending in -ing such as Machine Learning, Deep Learning in AI techniques etc. Some names like planning and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and … Read more

Semantic similarity classifier and clustering sentences based on semantic similarity.

Recently we have been doing some experiments to cluster semantically similar messages, by leveraging pre-trained models so we can get something off the ground using no labelled data. Task here is given a list of sentences we cluster them such that semantically similar sentences are in same cluster and number of clusters is not predetermined. … Read more

Parallel R: Socket or Fork

In the R parallel package, there are two implementations of parallelism, e.g. fork and socket, with pros and cons. For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the … Read more

Categories R Tags ExcerptFavorite

Medal Breakdown at Comrades Marathon (2019)

A quick breakdown of the medal distribution at the 2019 edition of the Comrades Marathon. This is what the medal categories correspond to: Gold — first 10 men and women Wally Hayward (men) — 11th position to sub-6:00 Isavel Roche-Kelly (women) — 11th position to sub-7:30 Silver (men) — 6:00 to sub-7:30 Bill Rowan — … Read more

Categories R Tags ExcerptFavorite

The Incredible Shrinking Bernoulli

Simulating Hacker News inter-arrival times distribution with the flip of a coin Joey Kyber via Pexels Bernoulli counting process Bernoulli distributions sounds like a complex statistical construct, but they represent flipping a coin (possibly biased). What I find fascinating is how this simple idea can lead to modeling more complex processes such as the probability to get a … Read more

Coping with varying `gcc` versions and capabilities in R packages

I have a package called strex which is for string manipulation. In this package, I want to take advantage of the regex capabilities of C++11. The reason for this is that in strex, I find myself needing to do a calculation like x <- list(c(“1,000”, “2,000,000”), c(“1”, “50”, “3,455”)) lapply(x, function(x) as.numeric(stringr::str_replace_all(x, “,”, “”))) #> … Read more

Categories R Tags ExcerptFavorite

Determining Presidential Approval Rating Using Reddit Sentiment Analysis

The Team As mentioned, this problem was tackled by 6 Duke undergraduate students — Milan Bhat, a sophomore studying Electrical and Computer Engineering, Andrew Cuffe, a senior studying Economics and Computer Science, Catherine Dana, a junior studying Computer Science, Melanie Farfel, a senior studying Economics and Computer Science, Adam Snowden, a junior studying Biology and Computer Science, … Read more

The Truth About Open Data

I’m currently volunteering at a data journalism startup in Cali, Colombia. In the past two weeks I’ve had meetings with business owners, students, mayoral candidates, and government officials to dive deep into data. I’ve learned some interesting things. The city of Cali is one of a few places in Latin America that has really begun … Read more

Trail Secrets: An Intelligent Recommendation Engine for Finding Better Hikes

I recently went on a weekend camping trip in The Enchantments, which is just over a two hour drive from where I live in Seattle, WA. To plan for the trip, we relied on AllTrails, which is a fantastic application with over 75,000 hand-curated hiking trails along with photos, reviews, and in-depth trail information. AllTrails … Read more

Analyzing Online Activity and Sleep Patterns

Making Data Science Fun Analyze your Facebook friends’ online activity and sleep patterns There is tons of information publicly available on social networks, which, sometimes we even forget exists. Information as little as just the online activity of our Facebook friends can enable us to deduce information like when they sleep or when they are most active … Read more

Collaborative filtering to “predict” the efficacy of a drug (2)

Another case study and some thoughts on domain knowledge Yu LiuBlockedUnblockFollowFollowing Jun 29 I showed the result of using collaborative filtering to predict the interaction strength between a drug and its target in the first blog post of this series. In this sequel, I will try to work on another dataset and discuss the significance … Read more

My Favorite data.table Feature

My favorite R data.table feature is the “by” grouping notation when combined with the := notation. Let’s take a look at this powerful notation. First, let’s build an example data.frame. d <- wrapr::build_frame( “group” , “value” | “a” , 1L | “a” , 2L | “b” , 3L | “b” , 4L ) knitr::kable(d) a … Read more

Categories R Tags ExcerptFavorite

Defining Quality: Towards a Better Understanding of “Statistical Quality Control”

Quality of products and services plays an important role in decision making processes of different customer segments. Maintaining quality at the desired level, though may be challenging, is imperative in achieving high level of customer satisfaction, as well as, maximizing revenue and market share, rendering elimination of waste product for companies and prolongation of product … Read more

Writing a simple Flask Web Application in 80 lines

Sample tutorial for getting started with flask Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions. easy to use. built in development server and debugger integrated unit testing support RESTful request dispatching uses Jinja2 templating support for secure cookies (client-side sessions) 100% WSGI 1.0 compliant Unicode based extensively documented The … Read more

Modelling with Tidymodels and Parsnip

Overview Recently I have completed the Business Analysis With R online course focused on applied data and business science with R, which introduced me to a couple of new modelling concepts and approaches. One that especially captured my attention is parsnip and its attempt to implement a unified modelling and analysis interface (similar to python’s … Read more

rvw 0.6.0: First release

Today Dirk Eddelbuettel, James Balamuta and Ivan Pavlov are happy to announce the first release of a reworked R interface to the Vowpal Wabbit machine learning system. Started as a GSoC 2018 project, the new rvw package was built to give R users easier access to a variety of efficient machine learning algorithms. Key features … Read more

Categories R Tags ExcerptFavorite

Automated Data Quality Testing at Scale using Apache Spark

I have been working as a Technology Architect, mainly responsible for the Data Lake/Hub/Platform kind of projects. Every day we ingest data from 100+ business systems so that the data can be made available to the analytics and BI teams for their projects. Problem Statement While ingesting data, we avoid any transformations. The data is … Read more

Basics of BASH for Beginners.

Terminal The terminal is a program that is used to interact with a shell. It is just an interface to the Shell and to the other command line programs that run inside it. This is akin to how a web browser is an interface to websites. Here is how a typical terminal on Mac looks … Read more

Analysts are from Venus, Managers are from Mars

Being an analyst I sometimes get frustrated working with upper management. It’s nothing personal, we just have two different mindsets. This post is all about resolving some common misconceptions managers have about data analysis. He uses statistics as a drunken man uses lamp posts — for support rather than for illumination. ~Andrew Lang, Scottish novelist … Read more

Custom NER Model (CRFSUITE) — Automate Confidence Score Boosting on Predicted Entities

Custom NER Model (CRFSUITE) — Automate Confidence Score Boosting on Predicted Entities An Automated approach to boost confidence score of the NER model without manual intervention. Introduction With the advent of technology and AI, the world is moving from ‘Automation’ to ‘Smart Automation’. Companies across various verticals , be it Banking/Healthcare/Logistics etc receive information from various channels in varied … Read more

Spurious correlations and random walks

The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence … Read more

Categories R Tags ExcerptFavorite

Descriptive Statistics Fundamentals For Data Science Aspirants

Few lines I wrote, dedicated to data engineers: Data Data everywhere, consumers are now more aware So mine the data with utmost care, and serve them everywhere. Yes, that valuable it is to treat and process data with the required precision, so that you can serve your customers/consumers effectively and responsibly. In Applied statistics we … Read more

Data Visualization

In this blog we’ll try to understand what data visualization is and how it could be used for making plots using matplotlib and seaborn in Python. We will also talk about the various types of analysis along with the most common types of plots used in data visualization. What is Data Visualization ? Data visualization … Read more

Data Preprocessing

With its implementation in Python At the heart of Machine Learning is to process data. Your machine learning tools are as good as the quality of your data. This blog deals with the various steps of cleaning data. Your data needs to go through a few steps before it is could be used for making predictions. … Read more

Implementation of Data Preprocessing on Titanic Dataset

With step by step Implementation Python, Numpy, Pandas Kaggle titanic dataset : https://www.kaggle.com/c/titanic-gettingStarted/data The machine learning model is supposed to predict who survived during the titanic shipwreck. Here I will show you how to apply preprocessing techniques on the Titanic dataset. For machine learning algorithms to work, it is necessary to convert the raw data … Read more

An introduction to Attention, Transformers and BERT: Part 1

The why and the what These lines from Ted Chiang’s novella “Story of your life” perhaps give a good sense of what differentiates attention based architectures from the sequential nature of vanilla RNNs. Let’s take a quick look at vanilla RNNs and the encoder-decoder variation used in sequence to sequence tasks, understand what drawbacks these designs … Read more

Mapping NBA Shot Locations

I recently came across the article “How Mapping Shots In The NBA Changed It Forever” and although I am not a big basketball fan, I was impressed by the visualizations. I actually bought the book “Sprawlball” by Kirk Goldsberry afterwards, where this was taken from. I can only recommend it, even if you are not … Read more

Categories R Tags ExcerptFavorite

Curly-Curly, the successor of Bang-Bang

Writing functions that take data frame columns as arguments is a problem that most R users have beenconfronted with at some point. There are different ways to tackle this issue, and this blog post willfocus on the solution provided by the latest release of the {rlang} package. You can read theannouncement here, which explains reallywell … Read more

Categories R Tags ExcerptFavorite

Text Parsing and Text Analysis of a Periodic Report (with R)

Some Context Those of you non-academia folk who work in industry (like me)are probably conscious of any/all periodic reports that an independent entitypublishes for your company’s industry. For example, in the insurance industry inthe United States, theFederal Insurance Office of the U.S. Department of the Treasurypublishes several reports on an annualbasisdiscussing the industry at large, … Read more

Categories R Tags ExcerptFavorite