Forgotten features of R 4.0.0

R version 4.0.0 was released almost two years ago. The change in the major version, 3.x.y to 4.0.0, represented significant and potentially breaking changes. For an organisation to start using these new features, everyone in the company must have access to that version; otherwise code isn’t shareable. This naturally slows down adoption. We moved our … Read more

Categories R Tags ExcerptFavorite

AI Geospatial Wildfire Risk Prediction

The goal of this project is to use the expansive geospatial datasets available on GEE to create a map that rates wildfire risk and danger levels across the United States. Approaching this as a pixel-wise classification problem, there are two big steps to discuss. First is the preparation of the dataset which involves: the selection … Read more

Tackling the Take-Home Challenge

The markdown file is very helpful to get a feel for what kind of data we’ll be working with. It includes data definitions and a very open ended instruction. Because there are essentially no constraints, we will use Python and Jupyter notebooks. I have a directory on my computer, coding_interviews that contains every take home … Read more

Modern Data Stack: which place for Spark ?

I just had to type “Modern Data Stack” in Google Images to notice that all the companies in the Data market are proposing their own list of technologies composing this stack, as they generally try to include themselves in the list. But I also noticed that this Modern Data Stack is generally built completely without … Read more

Text Cleaning for NLP in Python

A Critical Step in Natural Language Processing Made Easy! Photo by Dmitry Ratushny on Unsplash One of the most common tasks in Natural Language Processing (NLP) is to clean text data. In order to maximize your results, it’s important to distill your text to the most important root words in the corpus and clean out … Read more

What’s New in txtai 4.0

When content is enabled, the entire dictionary will be stored and can be queried. In addition to similarity queries, txtai accepts SQL queries. This enables combined queries using both a similarity index and content stored in a database backend. Query with SQL [{‘text’: ‘The National Park Service warns against sacrificing slower friends in a bear … Read more

Boruta SHAP: an amazing tool for feature selection every data scientist should know

When building a machine learning model, we know that having too many features brings issues such as the curse of dimensionality, besides the need for more memory, processing time, and power. On our Feature Engineering pipelines we employ feature selection techniques to try to remove less useful features from our datasets. This raises a problem: … Read more

Demystifying ROC and precision-recall curves

Debunking some myths about the ROC curve / AUC and the precision-recall curve / AUPRC for binary classification with a focus on imbalanced data The receiver operating characteristic (ROC) curve and the precision-recall (PR) curve are two visual tools for comparing binary classifiers. Related to this, the area under the ROC curve (AUC, aka AUROC) … Read more

Image Compression with PCA

Utilizing Images to Beautifully Represent Principal Component Analysis Photo by Erik Mclean on Unsplash Principal Component Analysis or PCA is a dimensionality reduction technique for data sets with many continuous (numeric) features or dimensions. It uses linear algebra to determine the most important features of a dataset. After these features have been identified, you can … Read more

Inventory Management for Retail — Periodic Review Policy

1. Inventory Management for Retail As an Inventory Manager of a mid-size retail chain, you are in charge of setting the replenishment quantity in the ERP. Because your warehouse operational manager is complaining about the orders frequencies, you start to challenge the replenishment rules implemented in the ERP, especially for the fast runners. Previously we … Read more

Rating Each Drivers 2021 Season – 10 – 1

[This article was first published on Sport Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. ShareTweet Hello, welcome to the second part of this look at … Read more

Categories R Tags ExcerptFavorite

How to Develop an R Shiny Dashboard In 10 Minutes or Less

Developing an R Shiny dashboard from scratch can be a time-consuming process. Luckily for you, you don’t need to start from scratch. In 2021 we released four R Shiny dashboard templates that are open to the public. The best part is – you can use and modify them free of charge! Today we’ll show you … Read more

Categories R Tags ExcerptFavorite

A common mistake to avoid in Machine Learning projects

Photo by Marvin Esteve on Unsplash Machine Learning projects are usually very exciting and may provide a lot of valor to a company if properly developed. However, even though some projects may seem similar, each one has its unique characteristics and must be developed with caution to avoid mistakes. The eagerness to tackle the project … Read more

Metaprogramming in Julia: A Full Overview

The first thing you need to know about Julia prior to engaging with metaprogramming in general is that everything in Julia is a symbol. That is to say, the type of everything is not Symbol(), but there is a lookup for every existing name inside of Julia. We can actually index individual scopes by symbols, … Read more

What’s in a Lambda? — Part 2

Now that you’ve learned about lambda functions in Python, I’ll walk through a data processing example. Photo by OpenIcons on Pixabay This is a follow-up to my earlier article, What’s in a Lambda?. Be sure to check it out first — I decided to write this follow-up due to the original article’s popularity. In that … Read more

The power of Modulo in Data Analysis

You might know the Modulo Operator in different programming languages. But how can you use this operator, and for what? Photo by Anoushka P on Unsplash In short, the Modulo operator returns the remaining of a division. Many programming languages have an operator or a function to calculate Module.T-SQL has the % operator, and DAX … Read more

Fuzzy String Search: Pruning The Search Space

Phonetic keys, Locality Sensitive Hashing Photo by Octavian Dan on Unsplash This is the problem of finding approximate matches of a string in a given dictionary of strings. Let’s see an example. We want to search for Jonahtan in a dictionary of clean first names of people. What we really mean is we want to … Read more

5 Advanced Tips on Python Decorators

Do you want to write concise, readable, and efficient code? Well, python decorators may help you on your journey. Photo by Mauricio Muñoz on Unsplash In chapter 7 of Fluent Python, Luciano Ramalho discusses decorators and closures. They are not super common in basic DS work, however as you start building production models writing async … Read more

Upgrading R

[This article was first published on R – datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. This is the recipe I use to upgrade R on a … Read more

Categories R Tags ExcerptFavorite

DataCamp Competition – Was a website redesign successful

[This article was first published on Blogs on Adejumo R.S, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. “🧑If first you don’t succeed, try two or more times … Read more

Categories R Tags ExcerptFavorite

simstudy update: ordinal data generation that violates proportionality

Version 0.4.0 of simstudy is now available on CRAN and GitHub. This update includes two enhancements (and at least one major bug fix). genOrdCat now includes an argument to generate ordinal data without an assumption of cumulative proportional odds. And two new functions defRepeat and defRepeatAdd make it a bit easier to define multiple variables … Read more

Categories R Tags ExcerptFavorite

The Naive Bayes classifier: How it works

Image by author Contents: Introduction 1. Bayes’ theorem 2. Naïve Bayes classifier 3. A simple binary classification problem 3.1 Prior probability computation 3.2 Class conditional probability computation 3.3 Predicting posterior probability 3.4 Treating Features with continuous data 3.5 Treating incomplete datasets 4. Naïve Bayes using Scikit Learn 4.1 Handling mixed features 5. Conclusion 6. References … Read more

A Comparative Review of the R-Instat GUI for R

[This article was first published on R |, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. by Robert A. Muenchen Introduction R-Instat is a free and open source … Read more

Categories R Tags ExcerptFavorite

Designing Data Systems: Complexity & Modular Design

Design thinking approach for data scientists and engineers. Photo by Med Badr Chemmaoui on Unsplash Going from notebooks to creating machine learning systems that work in the real-world means shifting context from writing simple scripts, notebooks and visualizing data in the lab environment. Now it’s the time to think about building a system and good … Read more

Ameca Is Proof Hyper-Realistic Robots Won’t Be Long

Its expressiveness will give you goosebumps. Ameca — Engineered Arts Ltd Ameca isn’t the smartest. It isn’t the most helpful or skillful. It can’t have an engaging conversation or even move. But despite its apparent irrelevance, this handsome robot went viral recently — because it looks like you and me. Conversations on AI and robotics … Read more

Machine Learning Serverlessly

How to build a complicated Machine Learning web application serverlessly on AWS GIF by author A lot of work and effort goes into productionizing a machine learning based product. We recently finished developing a web application to predict an Olympic diver’s performance. Given an Olympic diver’s video, the web app predicts what score a human … Read more

K-Means Clustering: Explain It To Me Like I’m 10

A friendly introduction to a perennially popular clustering algorithm This is going to be the second installment (only because installment sounds fancier than article!) in the Explaining Machine Learning Algorithms to 10-year Olds series. You can find the XGBoost Classification article here. Today we’ll be explaining K-Means Clustering, a very popular clustering algorithm, to a … Read more

SSL could Avoid Supervised Learning

For select supervised tasks with self-supervised learning(SSL) models satisfying certain properties Figure 1. Named entity recognition (NER) is solved in this post with self-supervised learning (SSL) alone avoiding supervised learning. The approach described here addresses the challenges facing any NER model in real-world applications. A supervised model, in particular, requires sufficient labeled sentences to address … Read more

Amazon RDS for PostgreSQL supports new minor versions 13.5, 12.9, 11.14, 10.19, and 9.6.24; Amazon RDS on Outposts supports new PostgreSQL minor versions 13.5 and 12.9

Following the announcement of updates to the PostgreSQL database, we have added support in Amazon Relational Database Service (Amazon RDS) for PostgreSQL minor versions 13.5, 12.9, 11.14, 10.19, and 9.6.24. We have also added support in Amazon RDS on Outposts for PostgreSQL minor versions 13.5 and 12.9. This release closes security vulnerabilities in PostgreSQL and … Read more

Categories AWS ExcerptFavorite

Building Confidence on Explainability Methods

Explainability must be an integral part of modeling for a Data Scientist. Suppose we were to develop a credit scoring model (whether or not we are going to grant a loan to someone); explainability could provide many insights: verify that expected features (salary, debt ratio…) have a significant impact, or conversely understand why unexpected ones … Read more

Model-Based Decision-Making for Health Data

Now comes the juicy section I’d like to put my energy and attention toward: model-based learning. We’ve filled in our missing data and found our best informative features, so now let’s take a look at the shape of the dataset. We have 165 total data records, with 102 cases of patient survival and 63 cases … Read more

Create Artificial Data With SMOTE

How you can leverage a simple algorithm to compensate for lack of data Photo by Brett Jordan on Unsplash Data imbalance is ubiquitous in machine learning. Real data rarely represents every class equally. In applications such as disease diagnosis, fraud detection, and spam classification, some classes will always be underrepresented. This is a major obstacle … Read more

How I developed a fully functional Purchasing Application using Python

Purchase Order entry, send it to your supplier, and receive the product in your warehouse Photo by Christiann Koepke on Unsplash Python is the most famous language when it comes to data: right from data integration to analysis to prediction. Considering it is open source, there are developers developing new libraries, bringing in new capabilities. … Read more

December 2021: “Top 40” New CRAN Packages

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. One hundred thirty-four new packages made it to CRAN last December. Here … Read more

Categories R Tags ExcerptFavorite

Automatic differentiation in R with Stan Math

Automatic differentiation Automatic differentiation (AD) refers to the automatic/algorithmic calculation of derivatives of a function defined as a computer program by repeated application of the chain rule. Automatic differentiation plays an important role in many statistical computing problems, such as gradient-based optimization of large-scale models, where gradient calculation by means of numeric differentiation (i.e. finite-differencing) is … Read more

Categories R Tags ExcerptFavorite

Predicting When Kickers Get Iced with {tidymodels}

Normally, I would do some EDA to better understand the data set but in the interest of word count I’ll jump right into using tidymodels to predict whether or not a given field goal attempt will be iced. In order to make the data work with the XGBoost algorithm I’ll subset and convert some numeric … Read more

Categories R Tags ExcerptFavorite

How I started writing Data Science blog posts: overcoming fear and procrastination

To be completely transparent, I am not a successful author — or even a particularly successful blogger — so if you are aiming to make a living from writing articles this may not be for you. However, if you are a busy Data Scientist, Machine Learning Engineer or Software Engineer, who has been meaning to … Read more