Multilevel Models in R

Category Tags Regression analysis is one of the most demanding machine learning methods in 2019. One group of regression analysis for measuring hierarchical effects is Multilevel Models. This method is well suited for spatial differences between groups in the dataset. In this article, you learn how to do Multilevel Modelling in R. Multilevel models Many … Read moreMultilevel Models in R

Machine Learning Pipelines: Nonlinear Model Stacking

Normally, we face data sets that are fairly linear or can be manipulated into one. But what if the data set that we are examining really should be looked at in a nonlinear way? Step into the world of nonlinear feature engineering. First, we’ll look at examples of nonlinear data. Next, we’ll briefly discuss the … Read moreMachine Learning Pipelines: Nonlinear Model Stacking

Fooling real cars with Deep Learning

The vulnerability Machine learning based classifiers are prone to Adversarial attacksThis means that visual machine learning classifiers that perceive a certain traffic sign (50 km/h, for instance), can not correctly deal with all images that are correctly interpreted by a human being. It is possible to intentionally create traffic sign images that will be understood … Read moreFooling real cars with Deep Learning

How a Non-Data Scientist Learned R and Delivered Reports 3 Days Faster

In 2016, Chris Cardillo was a Strategist supporting the media buying team at M&C Saatchi Performance (known at the time as M&C Saatchi Mobile), a digital advertising agency with over 100 employees. His team faced a problem familiar to many: how to efficiently aggregate data for client reporting. The main culprit was multiple sources of … Read moreHow a Non-Data Scientist Learned R and Delivered Reports 3 Days Faster

Importance of data visualization to derive actionable insights

Photo by NASA on Unsplash Data visualization is not itself about insight, but rather, about communicating insight. Quantitative insights driven from churning huge amounts of data are often subtle, surprising, and technically complex. Given this it makes it more challenging to communicate these insights to any audience and especially to a business audience who might … Read moreImportance of data visualization to derive actionable insights

Introduction to Image Segmentation with K-Means clustering

Image segmentation is an important step in image processing, and it seems everywhere if we want to analyze what’s inside the image. For example, if we seek to find if there is a chair or person inside an indoor image, we may need image segmentation to separate objects and analyze each object individually to check … Read moreIntroduction to Image Segmentation with K-Means clustering

Getting Started with Apache Spark

Apache Spark is explained as a ‘fast and general engine for large-scale data processing.’ However, that doesn’t even begin to encapsulate the reason it has become such a prominent player in the big data space. Apache Spark is a distributed computing platform, and its adoption by big data companies has been on the rise at … Read moreGetting Started with Apache Spark

Actuarial Science and Data Science with Lifelib

One elementary concept in the actuarial profession, tested on the FM (Financial Mathematics) Exam, is evaluating bond prices by discounting bond coupons as well as the redemption value back to the date of issue of the bond. When buying a bond, the investor is essentially giving the government or company a loan, which the government … Read moreActuarial Science and Data Science with Lifelib

Illustrated: 10 CNN Architectures

What architecture is this? 🤔 A compiled visualisation of the common convolutional neural networks (TL;DR — jump to the illustrations here) How have you been keeping up with the different convolutional neural networks (CNNs)? In recent years, we have witnessed the birth of numerous CNNs. These networks have only been getting unforgivingly deeper that it … Read moreIllustrated: 10 CNN Architectures

Creating vectors

A vector is the most elementary way to store and structure data in R. For now, think of it as a list of numbers, which can be as short as a single number, or as long as about 2 billion(!) numbers. Perhaps you were used to working with lists of numbers already in a spreadsheet … Read moreCreating vectors

Supercharge Your AI Research With Pytorch Lightning

How you feel when running a single model on 200 GPUs Pytorch Lightning has all of this already coded for you, including tests to guarantee that there are no bugs in that part of the program. This means you can focus on the core of your research and not worry about all the tedious engineering … Read moreSupercharge Your AI Research With Pytorch Lightning

What is Two-Stream Self-Attention in XLNet

Understand the Two-Stream Self-Attention in XLNet intuitively In my previous post What is XLNet and why it outperforms BERT, I mainly talked about the difference between XLNet (AR language model) and BERT (AE language model) and the Permutation Language Modeling. I believe that having an intuitive understanding of XLNet is far important than the implementation … Read moreWhat is Two-Stream Self-Attention in XLNet

Are All Explainable Models Trustworthy?

Picture: Thinkstock Explainable AI or Explainable Data Science is one of the top buzzwords of Data Science at the moment. Models that are explainable are seen as the A frequently given to make models more explainable is that they will then be trusted more readily by users, and sometimes it appears people assume the ideas … Read moreAre All Explainable Models Trustworthy?

Uncertainty Sampling Cheatsheet

When a Supervised Machine Learning model makes a prediction, it often gives a confidence in that prediction. If the model is uncertain (low confidence), then human feedback can help. Getting human feedback when a model is uncertain is a type of Active Learning known as Uncertainty Sampling. The four types of Uncertainty Sampling covered in … Read moreUncertainty Sampling Cheatsheet

Algorithmic Game Theory with Nashpy

Game Theory is a method of studying strategic situations. A ‘strategic’ situation is a setting where the outcomes which affect you depend not just on your own actions, but on the actions of others as well. Let’s think about the market of firms: if the scenario is that of Perfect Competition, all the firms are … Read moreAlgorithmic Game Theory with Nashpy

NLP for Beginners: Cleaning & Preprocessing Text Data

NLP is short for Natural Language Processing. As you probably know, computers are not as great at understanding words as they are numbers. This is all changing though as advances in NLP are happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually) comprehend when we ask the weather, for directions, … Read moreNLP for Beginners: Cleaning & Preprocessing Text Data

7 Steps to Ensure and Sustain Data Quality

Several years ago, I met a senior director from a large company. He mentioned the company he worked for was facing data quality issues that eroded customer satisfaction, and he had spent months investigating the potential causes and how to fix them. “What have you found?” I asked eagerly. “It is a tough issue. I … Read more7 Steps to Ensure and Sustain Data Quality

An Actual Application for the MNIST Digits Classifier

Solving Sudoku Puzzles Even Faster Than Advertised Have you ever thought to yourself “I just made a great MNIST classifier! Now what?”. While the handwritten digits dataset is a great, clean way to get into machine learning (on the classification side, anyway), it is rightly dubbed the “Hello World” of the field. You can use … Read moreAn Actual Application for the MNIST Digits Classifier

Pitfalls of Data Normalization

Mathematical Statistics and Machine Learning for Life Sciences How normalization leads to spurious correlations in Simplex Image source This is the fourth article of the column Mathematical Statistics and Machine Learning for Life Sciences. In this column, as well as in Deep Learning for Life Sciences I have been repeatedly emphasizing that the data we … Read morePitfalls of Data Normalization

Signal Detection Theory vs. Logistic Regression

I recently came across a paper that explained the equality between the parameters of signal detection theory (SDT) and the parameters of logistic regression in which the state (“absent”/“present”) is used to predict the response (“yes”/“no”, but also applicable in scale-rating designs) (DeCarlo, 1998; DOI: 10.1037/1082-989X.3.2.186). Here is a short simulation-proof for this equality. Setup … Read moreSignal Detection Theory vs. Logistic Regression

I didn’t mean() to ignore the median()

This week’s post follows directly from last week’sinvestigationof data from the 2016 US Census Bureau’s American Community Survey (ACS) PublicUse Microdata Sample(PUMS). We exploredmean differences in income across several different types of employment status(self-employed, private sector, government, etc.). We found, using bayesianmethods, strong evidence for differences across the categories and were able toplot them in … Read moreI didn’t mean() to ignore the median()

Setting up your Own Little Server at your Home

Now comes the fun part. Just getting it all connected. This is how it looks after connecting everything and switching on the power. All the lego blocks in place Here I have connected an HDMI cable to the display, micro SD Card, Keyboard, a mouse and finally a Power adapter. Once you have started up … Read moreSetting up your Own Little Server at your Home

Introduction to backtesting trading strategies

Source: pixabay Learn how to build and backtest trading strategies using zipline In this article, I would like to continue the series on quantitative finance. In the first part, I described the stylized facts of asset returns. Now I would like to introduce the concept of backtesting trading strategies and how to do it using … Read moreIntroduction to backtesting trading strategies

A Guide to Convolutional Neural Networks from Scratch

Convolutional Layers: Convolution may seem like a scary word, but it is not. In pure mathematics, it is a way to measure how the shape of one function is modified by the shape of another function. This definition can easily be extended to computer vision. If you were trying to see if an image is … Read moreA Guide to Convolutional Neural Networks from Scratch

How to manage Machine Learning and Data Science projects

Whether you’re building a complex computer vision algorithm using deep learning, a learning-to-rank model with LightGBM, or even a simple linear regression, the process of building an ML model has well-defined phases. Below is how we break the model-building process into sequential phases, from the initial research all the way to the analysis of the … Read moreHow to manage Machine Learning and Data Science projects

Instance Selection: The myth behind Data Sampling

One of the most common and most challenging issues in any Big Data system is to select stratified samples in a way that it’s representative of characteristics of the overall data population. From data annotation to the selection of evaluation dataset, Data sampling is key to success behind every Data Science solution. Efficient sampling is … Read moreInstance Selection: The myth behind Data Sampling

Reinforcement Learning: let’s teach a taxi-cab how to drive

Reinforcement Learning is a subfield of Machine Learning whose tasks differ from ‘standard’ ways of learning. Indeed, rather than being provided with historical data and make predictions or inferences on them, you want your reinforcement algorithm to learn, from scratch, from the surrounding environment. Basically, you want it to behave as you would have done … Read moreReinforcement Learning: let’s teach a taxi-cab how to drive

Lessons from How to Lie with Statistics

Scientists are usually limited to small samples by legitimate problems, but advertisers use small numbers of participants in their favor by conducting many tiny studies, one of which will produce a positive result. Humans are not great at adjusting for sample sizes when evaluating a study which in practice means we treat the results of … Read moreLessons from How to Lie with Statistics

anytime 0.3.5

A new release of the anytime package is arriving on CRAN. This is the sixteenth release, and comes a good month after the 0.3.4 release. anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date … Read moreanytime 0.3.5

Th True Meaning of p-Value

And how it impacts your decision making p-values shouldn’t make us feel lost, but give us direction. Photo Credit: Martin Reisch. We’ve learnt about p-values many times before. First, in the statistics book we crammed at university; second, when we read some articles assuring us they were legitimate; and finally now as we decide how … Read moreTh True Meaning of p-Value

APTOS 2019 Blindness Detection

Deep Learning models require significant amounts of data and resources to train properly. As a general rule of thumb, the more data the more accurate the output. For example, the ImageNet ILSVRC model was trained on 1.2 million images over the period of 2–3 weeks across multiple GPUs. Transfer learning is a machine learning method … Read moreAPTOS 2019 Blindness Detection

10 Lessons I Learned Training GANs for a Year

Training Generative Adversarial Networks is hard: let’s make it easier A year ago I decided to begin my journey into the world of Generative Adversarial Networks, or GANs. I’ve always been intrigued by them since the beginning of my interest in Deep Learning, mainly for the incredible results that they could produce. When I think … Read more10 Lessons I Learned Training GANs for a Year

The Fibonacci sequence and linear algebra

Leonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where I stands for one, V for five, X for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This … Read moreThe Fibonacci sequence and linear algebra

Program Evaluation: Interrupted Time Series in R

Category Tags Regression analysis is one of the most demanding machine learning methods in 2019. One group of regression analysis for measuring effects and to evaluate a policy program is Interrupted Time Series. This method is well suited for benchmarking and finding improvements for optimization in organizations. It can, therefore, be used to design organizations … Read moreProgram Evaluation: Interrupted Time Series in R

Unsupervised Machine Learning in R: K-Means

K-Means clustering is unsupervised machine learning because there is not a target variable. Clustering can be used to create a target variable, or simply group data by certain characteristics. Here’s a great and simple way to use R to find clusters, visualize and then tie back to the data source to implement a marketing strategy. … Read moreUnsupervised Machine Learning in R: K-Means

Introduction to Principal Component Analysis

By In Visal, Yin Seng, Choung Chamnab & Buoy Rina — this article was presented to ‘Facebook Developer Circle: Phnom Penh’ group on 20th July 2019. Here is the original slide pack — https://bit.ly/2Z1pyAb According to DataCamp, PCA can be viewed in the following ways: One of the more-useful methods from applied linear algebra Non-parametric … Read moreIntroduction to Principal Component Analysis

All Road Lead to Rome

I was inspired by this visualisation, showing the optimal routes (by car) from the geographic centre of the USA to all counties. The proverb “All Roads Lead to Rome” immediately came to mind and I set out to hack together something along that theme. This is what was required: Find a list of major cities … Read moreAll Road Lead to Rome

The statistics of the improbable

Bible codes, investment funds, lotteries, and the curse of the look-elsewhere effect Photo by Brett Jordan on Unsplash In 1994, researchers from the Hebrew University at Jerusalem published a paper in the journal Statistical Science, claiming to find evidence that the Book of Genesis predicts the future. The result baffled the researchers as well as … Read moreThe statistics of the improbable

Network Analysis and Community Clustering using Chicago Ride-Share Data

Community Clustering using Louvain ModularityWith the network built, algorithms can be used to cluster the network in order to identify locations which form communities based on the ride-share data. While a native of Chicago might be able to answer this question, I was curious to see how quickly I could learn about how transportation worked … Read moreNetwork Analysis and Community Clustering using Chicago Ride-Share Data

Explaining Predictions: Random Forest Post-hoc Analysis (permutation & impurity variable importance)

Direction of post In the next few posts, we will look at model specific post-hoc analysis which involves ranking the variables according to importance to the model. Though these interpretation can be applied on both white box and black box models, we will be explaining black box models in these posts as white box models … Read moreExplaining Predictions: Random Forest Post-hoc Analysis (permutation & impurity variable importance)