How to Boost 10X Productivity with Tmux

Personal Advice for Productive Data Scientists For Data Scientists and Software Engineers Tmux is a terminal multiplexer. That means you can view multiple terminal views and histories in one session (sourced by author) Oh no, I closed my terminal and lost my progress to run <ML model> Oh no, I lost track of my commands … Read more

Visualizing Missing Values in Python is Shockingly Easy

You should start by importing the packages: # Package importsimport seaborn as snsimport pandas as pdimport missingno as msno%matplotlib inline Importing missingno with the alias msno is the recommended way. Now you can use seaborn to import the Titanic dataset. This dataset comes preinstalled with seaborn, and you can simply run the command: # Load … Read more

Creating deep neural networks with 3 to 5 lines of code

We can create new deep neural networks by changing very few lines of code of already proposed models. Image by author. When dealing with supervised learning within deep learning, we might say that there are some classical approaches to follow. The first solution is the so-called “heroic” strategy where one creates a completely new deep … Read more

Detailed Guide to Multiple Linear Regression Model, Assessment, and Inference in R

Model Development, Interpretation, Variance Calculation, F-test, and t-test Linear regression is one of those old-school statistical modeling approaches that are still popular. With the development of new languages and libraries, it is now in a much-improved version and much easier to work on. Multiple linear regression is an extension of simple linear regression. In simple … Read more

8 Ways To Filter a Pandas DataFrame by a Partial String or Pattern

Filtering a DataFrame refers to checking its contents and returning only those that fit certain criteria. It is part of the data analysis task known as data wrangling and is efficiently done using the Pandas library of Python. The idea is that once you have filtered this data, you can analyze it separately and gain … Read more

PyQt & Relational Databases — Data Format

DATA PyQt provides a convenient way for relational database data presentation. Image by Author, background by Pexels What have we learned? We have learned that the QTableView widget is a convenient and flexible way to present data to the user. In addition, it can handle relational databases efficiently, and it is needless to emphasize that … Read more

Wrapping numpy’s arrays

The container approach. Numpy’s arrays are powerful objects, and are often used as a base data structure for more complex objects, like pandas or xarray. That being said, you can of course also use numpy’s powerful arrays in your own classes — and to do so, you basically have 2 approaches: the subclass approach: you … Read more

A Quick Note on Graphs and the Formulation of Their Downstream Tasks

Photo by NASA on Unsplash The entities in the context of knowledge graphs are treated as triples. Triple classification boils down to a binary classification task where there are two labels, namely, positive and negative. Positive triples are the ones that are part of the knowledge graph and negative ones are the ones that are … Read more

How Do You Use Categorical Features Directly with CatBoost?

This is the 4th (last) boosting algorithm that we cover under the “Boosting algorithms in machine learning” article series. So far, we’ve discussed AdaBoost, Gradient Boosting, XGBoost and LightGBM algorithms in detail with their Python implementations. CatBoost (Categorical Boosting) is an alternative to XGBoost. It has the following special features: Can handle categorical features directly … Read more

Temporal Loops: Intro to Recurrent Neural Networks for Time Series Forecasting in Python

A Tutorial on LSTM, GRU, and Vanilla RNNs — Wrapped by the Darts Multi-Method Forecast Library People Collective Group, by geralt — Free image on Pixabay Today’s article will take up the ball and go beyond two earlier October articles I wrote on time series forecasts. The earlier tutorials introduced the Darts multi-method forecast library, … Read more

Data Science on Blockchain with R. Part II: Tracking the NFTs

A story about nodes and vertices Examples of Weird Whale NFTs. These NFTs (token ids 525, 564, 618, 645, 816, 1109, 1523 and 2968) belong to the creator of the collection Benyamin Ahmed (Benoni) who gave us the permission to show them in this article. By Thomas de Marchin and Milana Filatenkova Thomas is Senior Data Scientist … Read more

Categories R Tags ExcerptFavorite

Integration of Discontinuous Functions and Euler’s Constant

Does the floor function have an antiderivative? Image from Wikimedia Commons Recently I’ve been looking into some more experimental mathematics. It is always exciting and a little nerve-racking to present something that is not standard in the sense that the rigor might not be in place or the mathematical community hasn’t accepted it as a … Read more

Cross Validation in R with Example

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. What Does Cross-Validation Mean? Cross-validation is a statistical approach for determining … Read more

Categories R Tags ExcerptFavorite

Fast and scalable forecasting with ahead::ridge2f

Two weeks ago I presented ahead, an R package for univariate and multivariate time series forecasting. And last week, I’ve shown how ahead::dynrmf could be used for automatic univariate forecasting. This week, I compare the speeds of execution of ahead::ridge2f (quasi-randomized autoregressive network) and ahead::varf (Vector AutoRegressive model), with their default parameters (notably 1 lag, … Read more

Categories R Tags ExcerptFavorite

Visualising Similarity Clusters with Interactive Graphs

Take advantage of Python, Plotly, and NetworkX to create interactive graphs to find similarity clusters Let us assume, as a running example, that my data is composed of word embeddings of the English language. I want to gain insights about the word distribution in the embedding space, specifically, if there are any clusters of very … Read more

Binomial distributions in practice

The ‘!’ notation is the factorial. As you might see, for non-negative integer x, it is calculated as the multiplication of all numbers up to x, for example: 2.2. The binomial density funtion (PMF) Now, we are ready to define the binomial density function as a probability of obtaining m successes in N Bernoulli trails: … Read more

lambda.min, lambda.1se and Cross Validation in Lasso : Binomial Response

#============================================ # cross validation by hand #============================================ # get a vector of fold id used in cv.glmnet # to replicate the same result. # Therefore, this is subject to the change foldid – cvfit$foldid # from glmnet #foldid – user.foldid  # user-defined # candidate lambda range fit  – glmnet(x, y, family = “binomial”) v.lambda – fit$lambda nla      – length(v.lambda)      m.mce – matrix(0,nrow = nfolds, ncol=nla) m.tot – matrix(0,nrow = nfolds, ncol=nla) m.mcr – matrix(0,nrow = nfolds, ncol=nla)   #——————————- # iteration over all folds #——————————- for (i in 1:nfolds) {          # training   fold : tr     # validation fold : va          ifd – which(foldid==i) # i-th fold          tr.x – x[–ifd,]; tr.y – y[–ifd]     va.x – x[ifd,];  va.y – y[ifd]          # estimation using training fold     fit – glmnet(tr.x, tr.y, family = “binomial”,                    lambda = v.lambda)          # prediction on validation fold     prd – predict(fit, newx = va.x, type = “class”)              # misclassification error for each lambda     for(c in 1:nla) {         # confusion matrix         cfm – confusionMatrix(             as.factor(prd[,c]), as.factor(va.y))                  # misclassification count         m.mce[i,c] – cfm$table[1,2]+cfm$table[2,1]                  # total count         m.tot[i,c] – sum(cfm$table)                  # misclassification rate         m.mcr[i,c] –  m.mce[i,c]/m.tot[i,c]     } }      # average misclassification error rate (mcr) v.mcr – colMeans(m.mcr) … Read more

Categories R Tags ExcerptFavorite

How to set the Minimum Detectable Effect in AB-Tests

(Image from Demystifying the most elusive AB-Test Parameter “What Minimum Detectable Effect Size should we use for this test?” Determining a Minimum Detectable Effect (MDE) value is one of the trickier parts whenever setting up an AB-Test with product teams. There exists a lot of confusion about what this term means. And about what … Read more

Dynamic Mode Decomposition for Spatiotemporal Traffic Speed Time Series in Seattle Freeway

Spatiotemporal traffic data analysis is an emerging area in intelligent transportation systems. In the past few years, data-driven machine learning models have provided new dimensions for understanding real-world data, building data computing paradigm, and supporting real-world applications. In this blog post, we plan to: introduce a publicly available traffic flow data in Seattle, USA, design … Read more

SQL Challenge: Case Study

Photo by Emily Morter on Unsplash In a previous post, I talked about how to use a framework to solve a SQL challenge. Today, I’m going to do a walkthrough of this challenge, asked by Microsoft in February 2021. Before you continue reading this article, I highly encourage you to try solving this question on … Read more

An Introduction To Decision Trees and Predictive Analytics

How can you ensure that a product launch will be successful? Decision trees are a great introduction to using data science for these types of business problems Image by fietzfotos on Pixabay Decision trees represent a connecting series of tests that branch off further and further down until a specific path matches a class or … Read more

Regression in R-Ultimate Guide

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Regression in R, In a recent article, we discussed model fitting … Read more

Categories R Tags ExcerptFavorite

The Polarization of Death

[This article was first published on R on, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I’m continuing to update the covdata package in anticipation of a … Read more

Categories R Tags ExcerptFavorite

Use ipywidgets and matplotlib to visualize noise decomposition in 3D: this is Pythagorean

Use 3D vectors to help your brain. Image by author. It is a well known fact that the total noise variance of independant noise source is the sum of the variances. Hence, for 3 independant noises sources with variances σi², we get : Hence, the total noise standard deviation is just : This equation can … Read more

Introduction to Tensorflow-Probability

tfp.layers module sets up a user-friendly interface for developers to easily switch their models from Standard Neural Network into Bayesian Neural Network by replacing the original layers into probabilistic layers. In the following, I will list out some of the layers I often use for reference. DenseVariational: epistemic uncertainty IndependentNormal: aleatory uncertainty DistributionLambda: aleatory uncertainty … Read more

Brownlow Medal Predictor

Using a data-driven approach to predict AFL Brownlow Medal votes Ruck Contest. Image by The-Pope, CC BY-SA 4.0, via Wikimedia Commons The Brownlow Medal is awarded to the “best and fairest” player in AFL during the home and away season, as determined by the umpires. After each game, the three field umpires award 3, 2 … Read more

Caret vs. tidymodels — Create complete reusable machine learning workflows

An HR analytics battle between the two packages Photo by Jonathan Tomas on Unsplash If you use machine learning models in R, you probably use either caret or tidymodels. Interestingly, both packages were developed by the same author among many others: Max Kuhn. But how do they compare to each other in terms of feasibility … Read more

Key Metrics for Data Science Team Success

How data science team leaders can measure team performance and demonstrate success for the C-suite Photo by Kaleidico on Unsplash As the field of data science continues to grow and mature, many data science leaders struggle when C-suite executives ask them to demonstrate consistent success. A team may have delivered substantial projects, and models delivering … Read more

Amazon Transcribe now supports batch transcription in AWS Stockholm and Cape Town Regions

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for you to add speech-to-text capabilities to your applications without any machine learning expertise. Starting today, Amazon Transcribe supports batch transcription in the AWS Stockholm and Cape Town Regions.  Amazon Transcribe enables organizations to increase the accessibility and discoverability of their audio and … Read more

Categories AWS ExcerptFavorite

AWS App Mesh Metric Extension is now generally available

AWS App Mesh Metric Extension is now generally available. With the Metric Extension, customers can collect and filter aggregated App Mesh service metrics that help with debugging, simplify monitoring, and reduce usage costs. App Mesh Metric Extension is available to all customers running workloads on Amazon EC2, Amazon ECS, Amazon EKS, and self-managed Kubernetes. AWS App … Read more

Categories AWS ExcerptFavorite

Automating Machine Learning Pipelines

Using Model Search for Automating Machine Learning Photo by Pietro Jeng on Unsplash Creating a Machine Learning model is a difficult task because we need to write a lot of code to try different models and find out the performing model for that particular problem. There are different libraries that can automate this process to … Read more

4 Different Methods for Changing the Font Size in Python Seaborn

Data visualization 101 Photo by Nick Fewings on Unsplash Data visualization is an integral part of data science. We make use of them in exploratory data analysis, model evaluation, and delivering the results. A well-prepared data visualization has the potential to be much more informative than plain numbers. Python being the top programming language in … Read more

Amazon QLDB launches new version of QLDB Shell

Amazon Quantum Ledger Database (QLDB) launches a new version of the QLDB Shell that is easier to install and use. QLDB customers can now download the QLDB Shell tailored to their favorite operating system and begin querying a QLDB ledger without any other installation steps or dependencies to install. Expert users have the option to … Read more

Categories AWS ExcerptFavorite

The Ultimate Literature Review for Causal Inference

Part 2: When experiment is not possible: Quasi-Experiments 1. Difference in Difference(DID) DID is usually used when there are pre-existing differences between the control and treatment groups. We utilize pre-experiment data to control for these baseline differences in the absence of any interventions. The table here summarizes DID: by author While it is widely used … Read more

Data Scientists’ baptisms of fire

It is not easy to tell when does exactly a data science cadet becomes an actual data scientist. Unlike in the army, we do not have distinct ranks that would correlate with seniority. Neither is a career path of a data scientist linear. It is not uncommon for a data scientist working in an e-commerce … Read more

How to Create Your Personal Data Science Learning Curriculum

Of the paid learning resources that are out there, we’re going to cross out universities and bootcamps as technically they take an instructor and mentor approach and is usually delivered on-site in real-time. The great thing about this approach is the structured curriculum that they provide. Perhaps a topic for another article. 5.1. Books Books … Read more

The most popular languages on Reddit, analyzed with Snowflake and a Java UDTF

The most popular language on Reddit (other than English) will surprise you. To build this chart I analyzed almost a million Reddit comments with Snowflake and a Java UDTF (in less than 2 minutes). The most popular languages on Reddit, after analyzing 1M comments: English, German(!), Spanish, Portuguese, French, Italian, Romanian(!), Dutch(!)… Surprising results, compared … Read more

Data Science Interview at Shopee Singapore

Having completed the second technical interview, the HR department informed me 30 minutes later that I’d successfully passed. In less than an hour, I will have my last interview. (The recruiter arranged 2 interviews for me on the same day. If I passed the second interview, I will be allowed to proceed to the last … Read more

Django ORM support for Cloud Spanner is now Generally AvailableDjango ORM support for Cloud Spanner is now Generally AvailableSoftware Engineer, GCP Databases

Today we’re happy to announce GA support for Google Cloud Spanner in the Django ORM. The django-google-spanner package is a third-party database backend for Cloud Spanner, powered by the Cloud Spanner Python client library. The Django ORM is a powerful standalone component of the Django web framework that maps Python objects to relational data. It … Read more

Cloud CISO Perspectives: October 2021Cloud CISO Perspectives: October 2021VP/CISO, Google Cloud

October has been a busy month for Google Cloud. We just held our annual conference, Google Cloud Next ‘21, where we made significant security announcements for our customers of all sizes and geographies. It’s also National Cybersecurity Awareness Month where our security teams across Google delivered important research on new threat campaigns and product updates … Read more

Video walkthrough: Set up a multiplayer game server with Google CloudVideo walkthrough: Set up a multiplayer game server with Google CloudGlobal Content Production & Delivery Lead

In this video, we walk through the real-world situation described above, in which one of our team members wants to create a persistent shared gaming experience with a friend. One of our training experts shows his colleague step-by-step how to use Compute Engine to host a multiplayer instance of Valheim from Iron Gate Studio and … Read more