faster-hyperparameter-tuning-with-scikit-learn

Photo by Roberta Sorge on Unsplash If you are a Scikit-Learn fan, Christmas came a few days early in 2020 with the release of version 0.24.0. Among the new features are 2 experimental classes in the model_selection module that support faster hyperparameter optimization: HalvingGridSearchCV and HalvingRandomSearchCV. Like their close cousins GridSearchCV and RandomizedSearchCV, they use … Read more faster-hyperparameter-tuning-with-scikit-learn

Kaggle — Predict survival on the Titanic challenge in MATLAB

Let’s begin … Reading the dataset >> Titanic_table = readtable(‘train.csv’);>> Titanic_data = (table2cell(Titanic_table)); The train set has 891 passenger entries and 12 columns. Now, lets take a look into our data. The head function displays the top rows of a table, similar to that in Pandas. >> head(Titanic_table) The ‘Survived’ column is our binary target … Read more Kaggle — Predict survival on the Titanic challenge in MATLAB

Synthetic Data to the rescue!

Don’t have enough data? Don’t worry! Synthetic data is here! Data shortage has become an important problem to be addressed in industries such as health, sports, manufacturing, and law due to the lack of data, privacy, and confidentiality. Photo by Luke Chesser on Unsplash Today, great advancements are being made in multiple sectors of society, … Read more Synthetic Data to the rescue!

Using D3.js to create dynamic maps and visuals that show competing climate change scenarios for…

After writing this data story, I realized that it may be helpful to write a piece outlining the power of creating web visualizations for audiences. As a data science graduate student (and data engineer by profession), I’m often disappointed by the lack of follow-through (me included) on how we communicate conclusions of our analyses and … Read more Using D3.js to create dynamic maps and visuals that show competing climate change scenarios for…

Back to the municipality clusters: celestial objects and maps

Photo by sergio souza on Unsplash It was in August of the fateful year 2020 that I wrote a text right here in the Medium showing the findings of a model of clusters of municipalities from variables related to GDP. The time has come to visit that model again and make new experiments. Here the … Read more Back to the municipality clusters: celestial objects and maps

The graphical user interface of Porting Assistant for .NET is now open source

The graphical user interface of Porting Assistant for .NET is now available in open source. Users can now view, modify, and contribute to its source code. The Porting Assistant for .NET data store and analytics engine , which includes information such as package compatibility and their known replacements, is already available through open source. With … Read more The graphical user interface of Porting Assistant for .NET is now open source

Learn AI Today 05: Image segmentation with U-Net models

Let’s get started and write a basic U-Net in Pytorch based on the diagram in Figure 1. To make the code simpler and clean we can define several modules, starting with the ConvLayer that is no more than a 2d convolutional layer followed by a 2d batch normalization and the ReLU activation. Notice that the … Read more Learn AI Today 05: Image segmentation with U-Net models

Search Engine Evaluation in Jina

Setup If you are coming from the previous tutorial, you will need to make some small changes to app.py, FinBertQARanker/__init__.py , and FinBertQARanker/tests/test_finbertqaranker.py. Joan Fontanals Martinez and I have added some helper functions and batching in the Ranker to help speed up the process. Instead of pointing out the changes, I have made a new … Read more Search Engine Evaluation in Jina

Eventarc: A unified eventing experience in Google CloudEventarc: A unified eventing experience in Google Cloud Developer Advocate

Getting events to Cloud Run There are already other ways to get events to Cloud Run, so you might wonder what’s special about Eventarc? I’ll get to this question, but let’s first explore one of those ways, Pub/Sub. As shown in this Using Pub/Sub with Cloud Run tutorial, Cloud Run services can receive messages pushed … Read more Eventarc: A unified eventing experience in Google CloudEventarc: A unified eventing experience in Google Cloud Developer Advocate

Creating an interactive datetime filter with Pandas and Streamlit

Implementing a visual datetime filter for timeseries data in Python Visual by author. Introduction Perhaps the most proliferated type of data that we grapple with on a daily basis is timeseries data. Basically, anything that is indexed using date, time or both can be considered as a timeseries dataset. And more often than not, you … Read more Creating an interactive datetime filter with Pandas and Streamlit

Three Concepts to Become a Better Python Programmer

Let’s say we have a list: num_list = [1,2,3,4,5] And we define a function that takes in 5 arguments and returns their sum: def num_sum(num1,num2,num3,num4,num5):return num1 + num2 + num3 + num4 + num5 And we want to find the sum of all the elements in num_list. Well, we can accomplish this by passing in … Read more Three Concepts to Become a Better Python Programmer

10 Popular Coding Interview Questions on Recursion

Working Smart It takes a sizeable amount of time to prepare for a coding interview. There are so many different topics, data structures, and algorithms to go over. Recursion is one of the most important algorithm types. Because it is the basis for so many important algorithms like divide and conquers, graph algorithms, dynamic programming, … Read more 10 Popular Coding Interview Questions on Recursion

Predicting Song Skipping on Spotify

Using LightGMB to predict my song prediction habits based solely on audio features Image by Omid Armin https://unsplash.com/@omidarmin In early 2019, Spotify shared interesting statistics about their platform. Out of 35+ million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). I thought of the analogy that our music taste … Read more Predicting Song Skipping on Spotify

From high-school physics to GANs: essentials for mastering generative machine learning [1/2]

Wave motion illustration http://animatedphysics.com/insights/modelling-photon-phase/ GANs and other generative machine learning algorithms are still hyped and work fantastically with images, texts, and sounds. They’re able not only to generate data for fun but solve important theoretical issues and boost production ML pipelines. Unfortunately, typical today’s practical use case is limited to “fine-tune pre-trained StyleGAN2 for zombies … Read more From high-school physics to GANs: essentials for mastering generative machine learning [1/2]

Who is A Data Champion? Here is Why Your Organization Needs One

DATA SCIENCE Data Champions and Data Driven Culture Happy New Year 2021 dear readers. Another year ahead of us and as life returns slowly to normal after a pandemic filled year in 2020, it is time to think about the data future of organizations, why this matters in this age of digital transformation and approaches … Read more Who is A Data Champion? Here is Why Your Organization Needs One

R Shiny {golem} – Initializing Your Project – Part 2 – Development to Production

Welcome to the second post of our blog series where we are working on creating a Shiny app with the {golem} package for the hit TV show, The Office. If you are just starting this, please take a look at the first post to see an overview. If you simply search “golem” on Google, you … Read more R Shiny {golem} – Initializing Your Project – Part 2 – Development to Production

How to Report the Distribution of Attributes per Cluster

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Let’s say that you have applied your Clustering algorithm and … Read more How to Report the Distribution of Attributes per Cluster

Customer segmentation — Part II

Segmentation of online customers by RFM-country and combination with part I Photo by Hal Gatewood on Unsplash Customer segmentation is one of the most common uses of data analysis/data science. In this the second part of a two posts series, where we see an example of customer segmentation. The dataset we use is the Online … Read more Customer segmentation — Part II

Fine-tuning pre-trained transformer models for sentence entailment

A PyTorch and Hugging Face implementation of fine-tuning BERT on the MultiNLI dataset Image from PNGWING. In this article, I will be describing the process of fine-tuning pre-trained models such as BERT and ALBERT on the task of sentence entailment using the MultiNLI dataset (Bowman et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through … Read more Fine-tuning pre-trained transformer models for sentence entailment

HDBSCAN Clustering with Neo4j

HDBSCAN is a clustering algorithm that identifies islands of closely related elements in a sea of noisy outliers. I recently came across the article “How HDBSCAN works” by Leland McInnes, and I was struck by the informative, accessible way he explained a complex machine learning algorithm. Clusters identified with HDBSCAN Unlike clustering algorithms like k-means, … Read more HDBSCAN Clustering with Neo4j

Algorithms For Data Scientists — Insertion Sort

Algorithms are set of instructions designed to modify an input to create the desired output. The Sorting problem is a typical programming task that Data Scientists and other programming disciplines come across. The sorting problem’s main task is to order a sequence of elements in either ascending or descending order. Photo by Andrik Langfield on … Read more Algorithms For Data Scientists — Insertion Sort

ease_aes() Demo

Easing In R, easing is the interpolation, or tweening, between successive states of a plot (1). It is used to control the motion of data elements in animated data displays (2), with different easing functions giving different appearances or dynamics to the display’s animation. The ease_aes() Function The ease_aes() function controls the easing of aesthetics … Read more ease_aes() Demo

Where should I eat after the pandemic? (Part 1/2)

To prepare the dataset for training and evaluation, I use the following code to load the SemEval-2014 dataset from my GitHub repository and prepare it for PyTorch: Train Now that we’ve narrowed the scope of models to sentence-pair classification using transformers in conjunction with the preprocessing method defined above, we still have several options for … Read more Where should I eat after the pandemic? (Part 1/2)

3 Simple Questions to Hone Python Skills for Beginners in 2021

– Count the number of prime numbers less than a non-negative number, n.– https://leetcode.com/problems/count-primes/ Walk Through My Thinking Surprisingly, every FAANG company has asked this question. I promise it will trip us over if we don’t know the shortcut. Honestly, my first instinct is to iterate over the range up until the number of interests … Read more 3 Simple Questions to Hone Python Skills for Beginners in 2021

Data Science vs Computer Science. Here’s the Difference.

Introduction Data Scientist Computer Scientist Similarities and Differences Summary References Data Science and Computer Science often go hand-in-hand, but what really makes them different? What do they have in common? After experiencing several different roles in Data Science at various companies, I have realized general themes of the Data Science process, along with how Computer … Read more Data Science vs Computer Science. Here’s the Difference.

Want To Get Good At Time Series Forecasting? Predict The Weather.

Understanding the components of a time series Source: Photo by geralt from Pixabay For someone who originally comes from an economics background, it might seem quite strange that I would spend some time building models that can predict weather patterns. I often questioned it myself — but there is a reason for it. Temperature patterns … Read more Want To Get Good At Time Series Forecasting? Predict The Weather.

Kedro hands-on: Build your own demographics atlas. Pt. 2: building footprints classification

1 Completeness and Correctness — 3.1 Completeness — 3.2 Correctness & Tobler’s law2 Building footprints classification pipeline — 2.1 Generate building features — 2.2 Building blocks segmentation with HDBSCAN — 2.3 Building types classification — XGBoost3 Wrapup Since OSM (OpenStreetMap) is a VGI platform (volunteer geographic information), buildings are traced and tagged voluntarily by the … Read more Kedro hands-on: Build your own demographics atlas. Pt. 2: building footprints classification

Bayesian modelling for tennis player ranking

How can Bayesian modelling help us to model tennis game outcomes and create player rankings? Image from Bessi on Pixabay My last two articles have been about the theory behind Markov Chain Monte Carlo and Variational Inference which lays the foundation for Bayesian modelling. In this article, we’re going to use Bayesian modelling and MCMC … Read more Bayesian modelling for tennis player ranking

Deep Learning for 3D Synthesis

Introduction to 3D Data It’s a consensus that synthesizing 3D data from a single perspective is a fundamental human vision functionality which is extremely challenging for computer vision algorithms. But recent advancements in 3D acquisition technology have taken a great leap after the increased availability and affordability of 3D sensors like LiDARs, RGB-D cameras (RealSense, … Read more Deep Learning for 3D Synthesis

Machine Learning a Systems Engineering Perspective

This article takes a holistic approach to a machine learning using elementary Systems Engineering principles. Enabling you to understand and manage the fundamental building block in most machine learning systems. You will learn how to get form data to predictions. Figure 1: Machine Learning System Design Systems Engineering principles Machine Learning Exploratory Data Analysis Training … Read more Machine Learning a Systems Engineering Perspective

Dials, Tune, and Parsnip: Tidymodels’ Way to Create and Tune Model Parameters

Earlier, the rf_spec uses default parameter values. To tune the parameters, I need to add the arguments. I provide two ways to do so below. # Add parametersrf_spec <-rf_spec %>%update(mtry = tune(), trees = tune())# Option 2: Start againrf_spec_new <-rand_forest(mode = “regression”,mtry = tune(),trees = tune()) %>%set_engine(“randomForest”) The tune() is a place holder for values … Read more Dials, Tune, and Parsnip: Tidymodels’ Way to Create and Tune Model Parameters

Loading complex CSV files into BigQuery using Google SheetsLoading complex CSV files into BigQuery using Google SheetsHead of Analytics & AI Solutions, Google Cloud

The cool thing is that by using a Google Sheet, you can do interactive data preparation in the Sheet before loading it into BigQuery. First, delete the first row (the header) from the sheet. We don’t want that in our data. ELT from a Google Sheet Once it is in Google Sheets, we can use … Read more Loading complex CSV files into BigQuery using Google SheetsLoading complex CSV files into BigQuery using Google SheetsHead of Analytics & AI Solutions, Google Cloud

Data Cleaning in R Made Simple

The title says it all Photo by JESHOOTS.COM on Unsplash Data cleaning. The process of identifying, correcting, or removing inaccurate raw data for downstream purposes. Or, more colloquially, an unglamorous yet wholely necessary first step towards an analysis-ready dataset. Data cleaning may not be the sexiest task in a data scientist’s day but never underestimate … Read more Data Cleaning in R Made Simple

Announcing New Segmentation Capabilities for Amazon Pinpoint

Amazon Pinpoint now provides customers additional filters to perform more granular segmentation. Amazon Pinpoint customers can now increase the level of campaign and message personalization by being able to reach more specific audiences.  Granular segmentation helps marketers increase user engagement by allowing them to tailor the right messaging and campaigns to specific sub-groups based on … Read more Announcing New Segmentation Capabilities for Amazon Pinpoint

Non-hierarchical edge bundling, flow maps and metro maps in R

This post introduces the R package edgebundle, an R package that implements several edge bundling/flow and metro map algorithms. Note that edgebundle imports reticulate and uses a pretty big python library (datashader). To install all dependencies, use install_bundle_py(). Edge bundling The expected input of each edge bundling function is a graph (igraph/network or tbl_graph object) … Read more Non-hierarchical edge bundling, flow maps and metro maps in R

Explore art media over time in the #TidyTuesday Tate collection dataset

[This article was first published on rstats | Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. This is the latest in my series of screencasts demonstrating … Read more Explore art media over time in the #TidyTuesday Tate collection dataset

7 Examples to Master SQL Joins

A comprehensive practical guide Photo by Mineragua Sparkling Water on Unsplash SQL is a programming language used by most relational database management systems (RDBMS) to manage data stored in tabular form (i.e. tables). A relational database consists of multiple tables that relate to each other. The relation between tables is formed with shared columns. When … Read more 7 Examples to Master SQL Joins

3 Essential Ways to Calculate Feature Importance in Python

Probably the easiest way to examine feature importances is by examining the model’s coefficients. For example, both linear and logistic regression boils down to an equation in which coefficients (importances) are assigned to each input value. Put simply, if an assigned coefficient is a large (negative or positive) number, it has some influence on the … Read more 3 Essential Ways to Calculate Feature Importance in Python

Distinct Values in DAX

The differences and similarities between values, distinct, and all Photo by Noah Näf on Unsplash The functional language DAX used in Power Pivot, SQL Server Analysis Services, and Power BI is powerful. Like all powerful languages, understanding the nuanced differences in their usage is important. On the surface, the two functions VALUES and DISTINCT appear … Read more Distinct Values in DAX

Analysing the Flow of NFL Games with One Simple Statistic

How play success can help you make sense of plain box scores Photo by Tim Mielke on Unsplash In their weekend matchup against the Colts, the Bills QB Josh Allen threw 26 passes off 35 attempts for 324 yards and 2 touchdowns. The Bills run game also had 21 rushes for 96 yards. In response, … Read more Analysing the Flow of NFL Games with One Simple Statistic

7 Proven Ways to Develop a Coding Habit

This is a well-tested and often touted method of habit-forming. The Habit Loop was developed by journalist and productivity author Charles Duhigg who worked in tandem with neurologists, psychologists, and researchers to create a method for developing habits that people will maintain. He describes the common phenomenon of people entering the new year all gung-ho … Read more 7 Proven Ways to Develop a Coding Habit

Top 13 Resources to Learn Python Programming

Thanks to the push in the e-learning domain, the internet is now filled with loads of convenient resources for learning Python, such as videos, online courses, eBooks, websites, and so much more. Much like any other website, the Python websites that we have covered in this article are free to access and cover everything from … Read more Top 13 Resources to Learn Python Programming

glmnet v4.1: regularized Cox models for (start, stop] and stratified data

My latest work on the glmnet package has just been pushed to CRAN! In this release (v4.1), we extend the scope of regularized Cox models to include (start, stop] data and strata variables. In addition, we provide the survfit method for plotting survival curves based on the model (as the survival package does). Why is this … Read more glmnet v4.1: regularized Cox models for (start, stop] and stratified data

How to transition from Academia to Data Science

A practical guide Image Source: Unspash Oh boy, another one of these blog posts about transitioning to Data Science from Academia. Well, in this post I’ll try to add a slightly different take on the usual advice along with some more traditional advice. This is not a step-by-step guide because I don’t think such a … Read more How to transition from Academia to Data Science

Azure and HITRUST publish shared responsibility matrix

Healthcare solutions offered in the cloud are drawing unprecedented attention today with the ongoing global pandemic and the accompanying need for social distancing. Microsoft has been on the forefront of empowering health organizations to leverage the power of the cloud.  Protecting health information and complying with health regulations are critical components of any healthcare solution … Read more Azure and HITRUST publish shared responsibility matrix