How To Fetch The Exact Values From A Boxplot (Python)

An efficient way of extracting the descriptive statistics for a dataset from a matplotlib boxplot From Unsplash A boxplot is a type of visualization used for displaying the five-number set of descriptive statistics for a dataset: the minimum and maximum (excluding the outliers), the median, the first (Q1) and third (Q3) quartiles. In Python, boxplots … Read more

Make a mock “real-time” stream of data with Python and Kafka

A Dockerized tutorial with everything you need to turn a .csv file of timestamped data into a Kafka stream Abraca-Kafka! (Poorly drawn cartoon by author.) With more and more data science work moving towards real-time pipelines, data scientists are in need of learning to write streaming analytics. While some great, user-friendly, streaming data pipeline tools … Read more

RStudio Connect 2021.08.0 Python Updates

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. At RStudio we know that many data science teams leverage both R … Read more

Categories R Tags ExcerptFavorite

RStudio Connect 2021.08.0 Custom Branding

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Many organizations want to align RStudio Connect with their branding strategy. Whether … Read more

Categories R Tags ExcerptFavorite

Analyze and Map Photo Locations with Python and Tableau

Even today, many DSLR and mirrorless cameras do not directly capture and store the GPS coordinates of a photo’s location in its image file. However, most, if not all, contemporary smartphones capture GPS coordinates. In the example shown here, the Nikon D750 did not capture GPS coordinates. However, the Nikon Z50 stored GPS coordinates only … Read more

The AI Checklist

Training Checklist: 41. Ensure interpretability is not compromised prematurely for performance during early stages of model development 42. Verify model tuning is following a scientific approach (instead of ad-hoc) 43. Verify the learning rate is not too high 44. Verify root causes are analyzed and documented if the loss-epoch graph is not converging 45. Analyzed … Read more

Interested In Deep Learning?

This section introduces the integral components of deep learning and how they operate to emulate learning commonly found amongst humans. Perceptrons And Neurons The brain is responsible for all human cognitive functions; in short, the brain is responsible for your ability to learn, acquire knowledge, retain information and recall knowledge. Photo by Josh Riemer on … Read more

Deep understanding of the ARIMA model

Generally, a model for time-series forecasting can be written as Eq 0.2 Definition of the time-series forecasting model where yₜ is the variables to be forecasted (dependent variable, or response variable), t is the time at which the forecast is made, h is the forecast horizon, Xₜ is the variables used at time t to … Read more


In this article, you’ll learn everything that you need to know about SMOTE. SMOTE is a machine learning technique that solves problems that occur when using an imbalanced data set. Imbalanced data sets often occur in practice, and it is crucial to master the tools needed to work with this type of data. SMOTE stands … Read more

Generate Interactive Plots in one line of Python Code

Essential guide to Plotly Express library Image by Colin Behrens from Pixabay Exploratory data analysis is an essential component of a data science model development pipeline. A data scientist spends most of the time performing EDA, to get a better understanding and generating insights from the data. There are various univariate, bivariate, and multivariate visualizing … Read more

Build a Your Own Custom Dataset using Python

For the ID attribute, I used the uuid library to generate a random string of characters 100,000 times. Then, I assigned it to the ID attribute in the dataframe. df[‘id’] = [uuid.uuid4().hex for i in range(num_users)] UUID is a great library to generate unique IDs for each user because of its astronomically low chance of … Read more

When Do Support Vector Machines Fail?

Photo by Tomas Sobek on Unsplash When To And Not Use Them, And How You may have chanced upon my previous article on introducing support vector machines (SVMs), where key fundamental concepts were introduced at a high level. In this article, we discuss when SVMs are not appropriate for use, across the classification and regression … Read more

Hypothesis Testing: Z-Scores

Surely in some part of your training or even in your work, you have heard about hypothesis tests, but do you know what they are for or how they are implemented? If the answer is no, I invite you to stay because we will talk about the famous hypothesis tests in this blog. Get comfortable, … Read more

R : a combined usage of split, lapply and

#————————————————– # 5) Multiple output with multiple key columns #    while original row is preserved #————————————————–          # outer lapply     lt.out  – lapply(                   # split by currency for outer lapply         split(,$currency ),          function(x){                          # inner lapply             y – lapply(                                  # split by maturity for inner lapply                 split(x, x$maturity),                  function(x) {                     data.frame(curr = x$currency,                                mat  = x$maturity,                                ws   = x$ws,                                sum_ws = sum(x$ws))})             # concatenate inner rows   ,y)         })          # concatenate outer rows     df.out –,lt.out)      rownames(df.out) – NULL          # add another group based calculation     df.out$group_wgt – df.out$ws/df.out$sum_ws     print(df.out)    ––––––––––––––––––––––––––––––––––––––––––––––––––– >     print(df.out)       curr mat         ws     sum_ws group_wgt 1   AUD  2y  106000000  320000000 0.3312500 2   AUD  2y  214000000  320000000 0.6687500 3   AUD  6m  213000000  213000000 1.0000000 4   CNY  6m   84000000  270000000 0.3111111 5   CNY  6m   42000000  270000000 0.1555556 6   CNY  6m  144000000  270000000 0.5333333 7   EUR  1y  250000000 2105000000 0.1187648 8   EUR  1y 1855000000 2105000000 0.8812352 9   EUR  3m 1785000000 1785000000 1.0000000 10  EUR  6m  200000000  200000000 1.0000000 11  USD  1y  112000000  112000000 1.0000000 12  USD  2y   56000000   56000000 1.0000000 13  USD  3m  285000000  741000000 0.3846154 14  USD  3m  456000000  741000000 0.6153846 Favorite

Categories R Tags ExcerptFavorite

Working with Data Frames and Visualization Using Basic Python Libraries

This article aims to introduce fundamental Python approaches in interacting with structured data using a case study in the payment services industry. Photo by Scott Graham on Unsplash Last year, my boss came to my place and gave me a small assignment to analyze the AS-IS online payment industry. Motivated by my passion for Python … Read more

Novel Approaches to Similarity Learning

The triplet loss has a couple of disadvantages that should be considered. First, it requires a careful selection of the anchor, positive, and negative images. The difference between the negative and the anchor images can’t be too much, if it was, the network will satisfy the loss function easily without learning anything. The anchor and … Read more

Three questions for a data science manager

Diary of a data scientist A framework for data science management Image from Gratisography Most of what I know about managing data scientists I learned on the job as a data science manager. After four years of practice, I wanted to reflect on what I have learned about data science management and what excellence in … Read more

How to Train your own TensorFlow models, and run them on shared hardware

Photo by Alexander Sinn on Unsplash Before we start, let’s run a few commands to get our system ready. We install TensorFlow, TensorFlow Model Maker, Numpy and also Pandas. pip3 install tensorflowpip3 install tflite-model-makerpip3 install numpy~=1.19.2pip3 install pandas Then we open a python3 interpreter and run the following code. import osimport numpy as npimport pandas … Read more

Merging the Theory of Mind and AI

With the rapid increase in the development of computational and deep neural networks, AI has evolved a lot in the past decade. Almost every field of life is getting involved with intelligent machines. Smart robotic systems have already covered a major portion of the automation industry. In all the industries of the modern era, the … Read more

Exploring Stock Market Listing Mortality since 1986

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Click to see R set-up code # Libraries if(!require(“pacman”)) { … Read more

Categories R Tags ExcerptFavorite

Make Your Neural Networks Smaller: Pruning

Pruning is an important tool to make neural networks more economical. Read on to find out how it works. Photo by C D-X on Unsplash One problem of neural networks is their size. The neural networks you see in online tutorials are small enough to run efficiently on your computer, but many neural networks in … Read more

Binary Classification: XGBoost Hyperparameter Tuning Scenarios by Non-exhaustive Grid Search and…

Practical example of balancing model performance and computational resource limitations — with code and visualization Photo by d kah on Unsplash XGBoost or eXtreme Gradient Boosting is one of the most widely used machine learning algorithms nowadays. It is famously efficient at winning Kaggle competitions. Many articles praise it and address its advantage over alternative … Read more

Darkeras: Execute YOLOv3/YOLOv4 Object Detection on Keras with Darknet Pre-trained Weights

Everything in the universe is connected. Convolutional neural network-based object detection has become a dominant topic in computer vision as it has attracted numerous researchers in the field. Various state-of-the-art methods can be categorized into two main genres: one-stage object detector (e.g. SSD, YOLOv1-v5, EfficientDet, RetinaNet) and two-stage object detector (e.g. R-CNN, Fast R-CNN, Faster … Read more

7 Reasons Why You Should Use the Streamlit AgGrid Component

Improve displaying dataframes with the best JavaScript data grid Photo de MART PRODUCTION provenant de Pexels I use Streamlit a lot. It’s a great library to quickly prototype visually appealing web applications that interact with data and machine learning models in a fun way. As a data scientist, I find Streamlit extremely helpful to share … Read more

Here’s How You Can Auto-Adjust Your Datatable Range in Excel with Java

Use-Case Explanation: To minimise the tedious, manual updates of excel reports, a common request I have received from users at my workplace is to append and input incoming records into the same Excel Datatable on a regular basis. Image by Author | An example to illustrate a Datatable rendered in Excel | Note that the … Read more

Is There Any Difference Between Scikit-Learn and Sklearn?

Is there any difference between scikit-learn and sklearn? The short answer is no. scikit-learn and sklearn both refer to the same package however, there are a couple of things you need to be aware of. Firstly, you can install the package by using either of scikit-learn or sklearn identifiers however, it is recommended to install … Read more

Building a Residual Network with PyTorch

The Moment When Networks Become Really Deep Image from Unsplash. Autonomous driving, face detections, and numerous computer applications owe their success to deep neural networks. Many may not realize, however, that the blossom of computer vision advancements was due to a specific type of architecture: residual networks. In fact, state-of-the-art results that led to this … Read more

Build A Text Recommendation System with Python

Use NLP semantic similarity to provide the most accurate recommendations Denise Jans Natural Language Processing is one of the most exciting fields of Machine Learning. It enables our computer to understand very dense corpus, analyze them, and provide us the information we are looking for. In this article, we’ll create a recommendation system that acts … Read more

A Checklist of Basic Statistics

Suppose we knew the average weight (μ) and standard deviation (σ) of the male population in the United States. We then measure at random 100 males and find the average to be different from the population’s average. We do this a second time and find yet another average. If we keep doing this many times, … Read more

Explore and understand your data with a network of significant associations.

Lets continue with the Titanic dataset as it contains a structure that is often seen in real use cases, i.e., the presence of categorical, boolean, and continuous variables per sample. In the previous step we initialized and loaded the Titanic dataset. In this step we will pre-process the 12 input features; typing and one-hot encoding. … Read more

How to Perform Tukey HSD Test in R

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Tukey HSD Test in R, When there are three or more … Read more

Categories R Tags ExcerptFavorite

From Bayes’ Theorem to Bayesian Inference

To understand how Bayes’ Theorem relates to Bayesian Inference, we have to understand the theorem through probability distributions rather than just point probabilities. A probability distribution just gives the probability of all possible outcomes in any scenario, not just the most likely outcome. A probability distribution can be continuous, as in the expected IQ of … Read more

Creating And Using A Quantile Normalizer

Before we use the quantile normalizer, we are going to need to understand how it works. The philosophy behind this normalizer is that values set about the third quantile and below the first quantile are likely somewhat of outliers. I am also going to make the positions be editable using arguments. Then I am going … Read more

How to Use GETDATE() for Reports

No More Manual Adjustments! Photo by Manavita S via Unsplash ‘Just copy and paste this query, go to the WHERE clause, change it to last week, and run it. It’s simple!’ — Some people you might know Manually changing values in a query on a daily/weekly/monthly cadence can be a huge pain. It also opens … Read more

Improve Linear Regression for Time Series Forecasting

Time series forecasting is a very fascinating task. However, build a machine-learning algorithm to predict future data is trickier than expected. The hardest thing to handle is the temporal dependency present in the data. By their nature, time-series data are subject to shifts. This may result in temporal drifts of various kinds which may become … Read more

Data Collection in Machine Learning Products

With examples Photo by Brett Jordan on Unsplash When I’ve just started my path in data science everything was about accurate modeling for me. But quickly I realized that to provide real value, models can’t exist in a vacuum. I was missing important aspects of data to get reasonable performance, it wasn’t very clear how … Read more

Automated Marketing Mix Modeling with Facebook Experimental’s Robyn

Two big questions every marketeer has are What’s the impact of my current marketing channels? and How should I allocate my budget strategically to get the optimal marketing mix? These questions are not new. John Wanamaker (1838–1922), considered by some to be a pioneer in marketing had the same questions and is known for his … Read more

Detecting time series outliers

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. The tsoutliers() function in the forecast package for R … Read more

Categories R Tags ExcerptFavorite

Anomaly detection with TensorFlow Probability and Vertex AIAnomaly detection with TensorFlow Probability and Vertex AISWE Intern

Time series anomaly detection is currently a trending topic—statisticians are scrambling to recalibrate their models for retail demand forecasting and more given the recent drastic changes in consumer behavior. As an intern, I was given the task of creating a machine-learning based solution for anomaly detection on Vertex AI to automate these laborious processes of … Read more

How I Taught Myself Tableau

1. Using available/pre-existing Tableau dashboards While trying to use existing dashboards to analyze, I often needed to understand the underlying data and add new filters to further dissect the trends. So, I downloaded a copy from the server and tried to understand how the original author had structured it. This really helped me in understanding … Read more

The Fundamentals of Data Warehouse + Data Lake = Lake House

Photo by janer zhang on Unsplash With the evolution of Data Warehouses and Data Lakes, they have certainly become more specialized yet siloed in their respective landscapes over the last few years. Both data management technologies each have their own identities and are best used for certain tasks and needs, however they also struggle in … Read more

How to do “Limitless” Math in Python

Sounds like a catchy title? Well, what we really meant by that term is arbitrary-precision computation i.e. breaking away from the restriction of 32-bit or 64-bit arithmetic that we are normally familiar with. Here is a quick example. This is what you will get as the value for the square-root of 2 if you just … Read more