Where the Beer is

Since prohibition ended in 1933, the American brewing industry has undergone two massive transformations. The first saw hundreds of regional breweries from across the country, often brewing beer unique their respective regions, become consolidated into a handful of behemoths. During the period of greatest consolidation in the early 1980s the ten largest breweries produced almost … Read more Where the Beer is

Generating Beatles’ Lyrics with Machine Learning

The Beatles were a huge cultural phenomenon. Their timeless music still resonates with people today, both young and old. Personally, I’m a big fan. In my humble opinion, they are the greatest band to have ever lived¹. Their songs are full of interesting lyrics and deep ideas. Take these bars for example: When you’ve seen … Read more Generating Beatles’ Lyrics with Machine Learning

7 Useful Pandas Tips for Data Management

A Premier League Financial Review Example Money Ball The Premier league is big business. In fact, Premier League clubs have paid out more than £260m to football agents during 2018–19 – an increase of £49m on the previous 12 months. This statistic alone piqued my interest and drove me to delve deeper into Premier League spending … Read more 7 Useful Pandas Tips for Data Management

Illustrating Predictive Models with the ROC Curve

Anatomy of a Curve The nice thing about the ROC curve is that it is an easy-to-interpret graphical tool that can be applied to any predictive model you create. Here are the basics of the curve: The Axes: Sensitivity and False Positive Rate First, we need to create the space for the plot. The ROC curve is … Read more Illustrating Predictive Models with the ROC Curve

Traffic tickets and where to find them.

Dashboard: Link Introduction & Data Anyone who has ever experienced driving in New York City can attest to the fact that it may at times be a less-than-satisfactory experience. With pedestrians who rightfully assume ownership of all parts of this city, cyclists who do the same, and other drivers who, let’s face it, are no … Read more Traffic tickets and where to find them.

Linear Algebra. Points matching with SVD in 3D space

Problem We need to find best rotation & translation params between two sets of points in 3D space. This type of transformation called Euclidean as it preserves sizes. Our task to solve equation R*A + t = B Solution There are many ways of getting things done almost in any case, but currently we will … Read more Linear Algebra. Points matching with SVD in 3D space

The power of unbiased recursive partitioning

The significance tests underlying the unbiased tree algorithms CTree, MOB, and GUIDE are embedded into a unifying framework. This allows to assess relative strengths and weaknesses in a variety of setups, highlighting the advantages of score-based tests (as in CTree/MOB) vs. residual-based tests (as in GUIDE). Citation Schlosser L, Hothorn T, Zeileis A (2019). “The … Read more The power of unbiased recursive partitioning

Deep (learning) like Jacques Cousteau – Part 7 – Matrices

(TL;DR: matrices are rectangular arrays of numbers.) LaTeX and MathJax warning for those viewing my feed: please viewdirectly on website! I wanted to avoid a picture related to the film franchise! Me Lasttime,we learnt about dot products. We will now finally start talkingaobut matrices. We know what row vectors and column vectors are from our … Read more Deep (learning) like Jacques Cousteau – Part 7 – Matrices

An Introduction to Reproducible Analyses in R

Earlier this week I had a lot of fun running a one-day workshop for the Royal Society of Biology titled “An Introduction to Reproducible Analyses in R”. It was intended to introduce researchers at all stages of their careers to using R to make their analyses and figures more reproducible. We ran the course because … Read more An Introduction to Reproducible Analyses in R

Update: Finding Economic Articles With Data

An earlier post from February, describes a Shiny app that allows to search among currently more than 4000 economic articles that have an accessible data and code supplement. Finally, I managed to configure an nginx reverse proxy server and now you can also access the app under a proper https link here: https://ejd.econ.mathematik.uni-ulm.de (I was … Read more Update: Finding Economic Articles With Data

RProtoBuf 0.4.14

A new release 0.4.14 of RProtoBuf is arriving at CRAN. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed very widely in numerous projects as a language and operating-system agnostic protocol. This release contains two very helpful pull requests by Jarod … Read more RProtoBuf 0.4.14

Make Refreshing Segmented Column Charts with {ggchicklet}

The first U.S. Democratic debates of the 2020 election season were held over two nights this past week due to the daft number of candidates running for POTUS. The spiffy @NYTgraphics folks took the tallies of time spent blathering by each speaker/topic and made rounded rectangle segmented bar charts ordered by the time the blathering … Read more Make Refreshing Segmented Column Charts with {ggchicklet}

Feature Elimination Using SVM Weights

Specifically for SVMLight, but this feature elimination methodology can be used for any linear SVM. Figure 1: a random example of accuracy based on the number of SVM features used. While working on my M.Sc thesis, circa 2005–2007, I had to calculate features weights based on an SVM Model. This was before SKlearn, which started in 2007. … Read more Feature Elimination Using SVM Weights

What Separates Good from Great Data Scientists?

The most valuable skills in an evolving field The data science job market is changing rapidly. Being able to build machine learning models used to be an elitist skill that only a few distinguished scientists possessed. But nowadays, anyone with basic coding experience can follow the steps to train a simple scikit-learn or keras model. Recruiters … Read more What Separates Good from Great Data Scientists?

Log Book —Guide to Excel & Outlook email Delivery Automation via Python

This article is divided into 2 parts, first part deals with the generation of image/PDF from an excel and next part attaching the same in an outlook email and sending it out. It also has an added bonus on macros and handling different mail boxes. In our day to day activities we often come across tasks … Read more Log Book —Guide to Excel & Outlook email Delivery Automation via Python

Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example

Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data. In other words, the samples used to train our model do not come with predefined categories. In comparison to other clustering algorithms, DBSCAN is particularly well suited for … Read more Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example

Automate your Python Scripts with Task Scheduler

Definitive Guide to Data Professionals Windows Task Scheduler to Scrape Alternative Data Credit: Stocksnap Running my Python Scripts every day is too troublesome. I need a way to run my Python Scripts periodically and automatically Imagine your manager asks you to wake up in the middle of night to run a script. This will be … Read more Automate your Python Scripts with Task Scheduler

Machine Learning At Scale With Apache Spark MLlib Python Example

For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped increasing the clock frequency of individual processors and opted for parallel CPU cores. This is fine for playing video games on a desktop computer. However, when it … Read more Machine Learning At Scale With Apache Spark MLlib Python Example

What is Reinforcement Learning?

“Reinforcement Learning is like many topics with names ending in -ing such as Machine Learning, Deep Learning in AI techniques etc. Some names like planning and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and … Read more What is Reinforcement Learning?

Semantic similarity classifier and clustering sentences based on semantic similarity.

Recently we have been doing some experiments to cluster semantically similar messages, by leveraging pre-trained models so we can get something off the ground using no labelled data. Task here is given a list of sentences we cluster them such that semantically similar sentences are in same cluster and number of clusters is not predetermined. … Read more Semantic similarity classifier and clustering sentences based on semantic similarity.

Parallel R: Socket or Fork

In the R parallel package, there are two implementations of parallelism, e.g. fork and socket, with pros and cons. For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the … Read more Parallel R: Socket or Fork

Medal Breakdown at Comrades Marathon (2019)

A quick breakdown of the medal distribution at the 2019 edition of the Comrades Marathon. This is what the medal categories correspond to: Gold — first 10 men and women Wally Hayward (men) — 11th position to sub-6:00 Isavel Roche-Kelly (women) — 11th position to sub-7:30 Silver (men) — 6:00 to sub-7:30 Bill Rowan — … Read more Medal Breakdown at Comrades Marathon (2019)

The Incredible Shrinking Bernoulli

Simulating Hacker News inter-arrival times distribution with the flip of a coin Joey Kyber via Pexels Bernoulli counting process Bernoulli distributions sounds like a complex statistical construct, but they represent flipping a coin (possibly biased). What I find fascinating is how this simple idea can lead to modeling more complex processes such as the probability to get a … Read more The Incredible Shrinking Bernoulli

Coping with varying `gcc` versions and capabilities in R packages

I have a package called strex which is for string manipulation. In this package, I want to take advantage of the regex capabilities of C++11. The reason for this is that in strex, I find myself needing to do a calculation like x <- list(c(“1,000”, “2,000,000”), c(“1”, “50”, “3,455”)) lapply(x, function(x) as.numeric(stringr::str_replace_all(x, “,”, “”))) #> … Read more Coping with varying `gcc` versions and capabilities in R packages

Determining Presidential Approval Rating Using Reddit Sentiment Analysis

The Team As mentioned, this problem was tackled by 6 Duke undergraduate students — Milan Bhat, a sophomore studying Electrical and Computer Engineering, Andrew Cuffe, a senior studying Economics and Computer Science, Catherine Dana, a junior studying Computer Science, Melanie Farfel, a senior studying Economics and Computer Science, Adam Snowden, a junior studying Biology and Computer Science, … Read more Determining Presidential Approval Rating Using Reddit Sentiment Analysis

Impress Onlookers with your newly acquired Shell Skills

Starting with the Basic Commands Think of every command as a color in your palette It always helps to start with the basics whenever trying to learn a new language. And the shell is a new language. We will go through some basic commands one by one. 1. cat: There are a lot of times when you … Read more Impress Onlookers with your newly acquired Shell Skills

Trail Secrets: An Intelligent Recommendation Engine for Finding Better Hikes

I recently went on a weekend camping trip in The Enchantments, which is just over a two hour drive from where I live in Seattle, WA. To plan for the trip, we relied on AllTrails, which is a fantastic application with over 75,000 hand-curated hiking trails along with photos, reviews, and in-depth trail information. AllTrails … Read more Trail Secrets: An Intelligent Recommendation Engine for Finding Better Hikes

Analyzing Online Activity and Sleep Patterns

Making Data Science Fun Analyze your Facebook friends’ online activity and sleep patterns There is tons of information publicly available on social networks, which, sometimes we even forget exists. Information as little as just the online activity of our Facebook friends can enable us to deduce information like when they sleep or when they are most active … Read more Analyzing Online Activity and Sleep Patterns

Collaborative filtering to “predict” the efficacy of a drug (2)

Another case study and some thoughts on domain knowledge Yu LiuBlockedUnblockFollowFollowing Jun 29 I showed the result of using collaborative filtering to predict the interaction strength between a drug and its target in the first blog post of this series. In this sequel, I will try to work on another dataset and discuss the significance … Read more Collaborative filtering to “predict” the efficacy of a drug (2)

Defining Quality: Towards a Better Understanding of “Statistical Quality Control”

Quality of products and services plays an important role in decision making processes of different customer segments. Maintaining quality at the desired level, though may be challenging, is imperative in achieving high level of customer satisfaction, as well as, maximizing revenue and market share, rendering elimination of waste product for companies and prolongation of product … Read more Defining Quality: Towards a Better Understanding of “Statistical Quality Control”

Writing a simple Flask Web Application in 80 lines

Sample tutorial for getting started with flask Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions. easy to use. built in development server and debugger integrated unit testing support RESTful request dispatching uses Jinja2 templating support for secure cookies (client-side sessions) 100% WSGI 1.0 compliant Unicode based extensively documented The … Read more Writing a simple Flask Web Application in 80 lines

Modelling with Tidymodels and Parsnip

Overview Recently I have completed the Business Analysis With R online course focused on applied data and business science with R, which introduced me to a couple of new modelling concepts and approaches. One that especially captured my attention is parsnip and its attempt to implement a unified modelling and analysis interface (similar to python’s … Read more Modelling with Tidymodels and Parsnip

Automated Data Quality Testing at Scale using Apache Spark

I have been working as a Technology Architect, mainly responsible for the Data Lake/Hub/Platform kind of projects. Every day we ingest data from 100+ business systems so that the data can be made available to the analytics and BI teams for their projects. Problem Statement While ingesting data, we avoid any transformations. The data is … Read more Automated Data Quality Testing at Scale using Apache Spark

Analysts are from Venus, Managers are from Mars

Being an analyst I sometimes get frustrated working with upper management. It’s nothing personal, we just have two different mindsets. This post is all about resolving some common misconceptions managers have about data analysis. He uses statistics as a drunken man uses lamp posts — for support rather than for illumination. ~Andrew Lang, Scottish novelist … Read more Analysts are from Venus, Managers are from Mars

Custom NER Model (CRFSUITE) — Automate Confidence Score Boosting on Predicted Entities

Custom NER Model (CRFSUITE) — Automate Confidence Score Boosting on Predicted Entities An Automated approach to boost confidence score of the NER model without manual intervention. Introduction With the advent of technology and AI, the world is moving from ‘Automation’ to ‘Smart Automation’. Companies across various verticals , be it Banking/Healthcare/Logistics etc receive information from various channels in varied … Read more Custom NER Model (CRFSUITE) — Automate Confidence Score Boosting on Predicted Entities

Spurious correlations and random walks

The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence … Read more Spurious correlations and random walks

Descriptive Statistics Fundamentals For Data Science Aspirants

Few lines I wrote, dedicated to data engineers: Data Data everywhere, consumers are now more aware So mine the data with utmost care, and serve them everywhere. Yes, that valuable it is to treat and process data with the required precision, so that you can serve your customers/consumers effectively and responsibly. In Applied statistics we … Read more Descriptive Statistics Fundamentals For Data Science Aspirants

Predicting NBA Rookie Stats with Machine Learning

Every year, millions of basketball fans from around the world tune in to the NBA Draft with the hope that their favorite team strikes gold and discovers the next big NBA star. The people in the front offices of these NBA teams spend thousands of hours scouting and evaluating college and international talent trying to … Read more Predicting NBA Rookie Stats with Machine Learning

Implementation of Data Preprocessing on Titanic Dataset

With step by step Implementation Python, Numpy, Pandas Kaggle titanic dataset : https://www.kaggle.com/c/titanic-gettingStarted/data The machine learning model is supposed to predict who survived during the titanic shipwreck. Here I will show you how to apply preprocessing techniques on the Titanic dataset. For machine learning algorithms to work, it is necessary to convert the raw data … Read more Implementation of Data Preprocessing on Titanic Dataset

An introduction to Attention, Transformers and BERT: Part 1

The why and the what These lines from Ted Chiang’s novella “Story of your life” perhaps give a good sense of what differentiates attention based architectures from the sequential nature of vanilla RNNs. Let’s take a quick look at vanilla RNNs and the encoder-decoder variation used in sequence to sequence tasks, understand what drawbacks these designs … Read more An introduction to Attention, Transformers and BERT: Part 1

Curly-Curly, the successor of Bang-Bang

Writing functions that take data frame columns as arguments is a problem that most R users have beenconfronted with at some point. There are different ways to tackle this issue, and this blog post willfocus on the solution provided by the latest release of the {rlang} package. You can read theannouncement here, which explains reallywell … Read more Curly-Curly, the successor of Bang-Bang

Text Parsing and Text Analysis of a Periodic Report (with R)

Some Context Those of you non-academia folk who work in industry (like me)are probably conscious of any/all periodic reports that an independent entitypublishes for your company’s industry. For example, in the insurance industry inthe United States, theFederal Insurance Office of the U.S. Department of the Treasurypublishes several reports on an annualbasisdiscussing the industry at large, … Read more Text Parsing and Text Analysis of a Periodic Report (with R)