Ten Tricks To Speed Up Your Python Codes

1. Familiar with built-in functions Figure 1 | Built-in Functions in Python 3 Python comes with many built-in functions implemented in C, which are very fast and well maintained (Figure 1). We should at least familiar with these function names and know where to find it (some commonly used computation-related functions are abs(), len(), max(), … Read more Ten Tricks To Speed Up Your Python Codes

A walk through imbalanced classes in machine learning through a visual cheat sheet

What imbalanced training data is and how to address it through precision, recall and f1 score There are many detailed articles explaining the problem of imbalanced training samples and how to cope up with it. In this article, I summarize the understanding of the problem into a visual cheat sheet. I often find it useful … Read more A walk through imbalanced classes in machine learning through a visual cheat sheet

To Publish, or Not to Publish

Very broadly let’s divide publications possibilities into four options: Top tier journals [Nature, Science, The Lancet, …] Clinical journals or clinical conferences [Journal Rankings] Technical oriented medical conferences [MICCAI, MIDL, MLHC, …] Top technical conferences [CVPR, ECCV, ICLR, NeurIPS, …] All of these options benefit the company, team, and also the culture, but in a … Read more To Publish, or Not to Publish

Architecture as a Graph

The Bayesian approach used in this article demonstrates the relevance of stochasticity for the design process. On one hand, statistical inference allows us to model and replicate complicated phenomena and here, complexity found among floorplans. On the other hand, it allows us to generate a wide variety of options, that will inspire the creative process. … Read more Architecture as a Graph

To Serve Man

Deploying models on Kubernetes using Seldon Core Source: Pixabay One of the very best papers ever written about machine learning ironically had very little to do with, well, actual machine learning! In the paper, Hidden Technical Debt in Machine Learning Systems, a group of machine learning researchers from Google astutely pointed out that “only a … Read more To Serve Man

Using Kalman Filter to Predict Corona Virus Spread

To predict COVID19 spread, I’ve implemented a Kalman filter algorithm alongside other linear models. The optimization problem was solved with Python, while the script is available in the Google Colab notebook. the process of this project is described below, a full code can be found in Github here. Pre-processing data:* Read the data from Github-contain … Read more Using Kalman Filter to Predict Corona Virus Spread

How to approach technical questions in an analytics / data science interview

We cover how to approach technical questions with 2 real examples asked during an interview for an analytics / data science role. I love to engage with my readers and learn about their concerns when it comes to the technical interview. They often are full of anxiety and don’t know where to start when trying … Read more How to approach technical questions in an analytics / data science interview

SVD in Machine Learning: PCA

Intuitively, PCA is a transformation procedure that converts a data matrix with possibly correlated features into a set of linearly uncorrelated variables called principal components. Each principal component is a linear combination of the original features (PCᵢ = Xϕᵢ, here PCᵢ stands for the i-th principal component) and accounts for the largest possible variance while … Read more SVD in Machine Learning: PCA

Real or Spurious Correlations: Attractive People You Date Are Nastier

But how do we know if the negative correlation between attractiveness and personality is real? Data scientists deal correlations regularly, and a good way to gain more intuition about the data and learn analysis methods is via simulation. So let’s simulate some data to test our intuition. First, let’s import the common data science modules: … Read more Real or Spurious Correlations: Attractive People You Date Are Nastier

The Engine of the Neural Network: the Backpropagation Equation

Let us consider the most simple neural network, with a single input and a single output. Let’s use the following notation: In this case, the cost function c(…) is simply … or the difference between the predicted final value and the target variable. It is important to remember that … where A represents any activation … Read more The Engine of the Neural Network: the Backpropagation Equation

A Data Analysis of the Democratic Debates

Photo by davide ragusa on Unsplash Who’s talking and what are they saying? As the Democratic primary season heats up and the debate count mounts, it can be difficult to follow what the candidates have been saying. Thanks to online debate transcripts, however, it’s fairly easy to retrieve every word that has been spoken. Rather … Read more A Data Analysis of the Democratic Debates

Stop duplicating deep learning training datasets with Amazon EBS multi-attach

Use a single training dataset EBS volume for distributed training on up to 16 EC2 instances and 128 GPUs! Some of you doing deep learning are lucky enough to have infrastructure teams who’ll help set up GPUs clusters for you, install and manage job schedulers, and host and manage file systems for training datasets. The … Read more Stop duplicating deep learning training datasets with Amazon EBS multi-attach

String Similarity Matching for Big Data using Distributed Cloud Computations

Before becoming a man and facing t̶h̶e̶ ̶r̶e̶a̶l̶ ̶w̶o̶r̶l̶d̶ production, a boy needs to develop his skills in the playground. A fancy one is Google Colaboratory, a Jupyter notebook hosted on the cloud, offering free, powerful computing resources (including GPUs/TPUs) and that’s where the first tests were run. A computer scientist would classify our problem … Read more String Similarity Matching for Big Data using Distributed Cloud Computations

Creating and connecting a PostgreSQL database with Amazon’s Relational Database Service (RDS)

Getting started using a database in the cloud doesn’t have to be hard. In fact, it’s never been easier thanks to Amazon Web Services (AWS). Relational Database Service (RDS) is part of Amazon’s comprehensive suite of web products, and as of today, RDS supports 6 database engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database and … Read more Creating and connecting a PostgreSQL database with Amazon’s Relational Database Service (RDS)

The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 2

In my last post, I examined the significance of the sector on the salary for different occupational groups using statistics from different regions. In previous posts I have shown a correlation between the salary and experience and also salary and education, In this post, I will examine the correlation between salary and sector using statistics … Read more The significance of the sector on the salary in Sweden, a comparison between different occupational groups, part 2

Synthetic micro-datasets: a promising middle ground between data privacy and data analysis

Intro: the need for microdata, and the risk of disclosure Survey and administrative data are essential for scientific research, however accessing such datasetscan be very tricky, or even impossible. In my previous job I was responsible for getting access tosuch “scientific micro-datasets” from institutions like Eurostat.In general, getting access to these micro datasets was only … Read more Synthetic micro-datasets: a promising middle ground between data privacy and data analysis

What does an AI system think about Austria?

Let us look at some examples of initial texts and continuations: „The most popular person in Austria is … ” The most popular person in Austria is probably the head of the Social Democratic Party, Gabriel. His party won the most votes in the parliamentary elections in October 2016 The most popular person in Austria … Read more What does an AI system think about Austria?

How to Enter Your First Kaggle Competition

Detecting disaster tweets One of the latest competitions on the website provides a data set containing tweets together with a label which tells us if they are really about a disaster or not. This competition has a leaderboard with nearly 3,000 entries and a top cash prize of $10,000. The data and competition outline can … Read more How to Enter Your First Kaggle Competition

Web Scraping News Articles to Build an NLP Data Pipeline

Now that we cleaned and normalized our text as well as splitting it into sentences, it is time to construct a data pipeline with Tensorflow 2.0. In many cases, feeding the text content directly into the NLP model is not an efficient way of managing the data input process. Tensorflow ‘tf.data API’ provides better performance … Read more Web Scraping News Articles to Build an NLP Data Pipeline

Mobile Data Collection: What it is and what it can do

Data collection is nothing new, but the introduction of mobile devices has made it more interesting and efficient. Before the advent of mobile technology, we needed to use pen and paper to record information on the spot, or manually enter it into a computer to organize the information. But now, mobile data collection means information … Read more Mobile Data Collection: What it is and what it can do

A practical guide to using Human Centered Design to deliver Advanced Analytics Projects

Using human centered design can reduce the risk of failure associated with advanced analytics projects and helps to drive innovation for employees and customers Firms are at risk of wasting billions on failed analytics and data science projects over the next 3 years: Over the past decade the failure rates on analytics projects, particularly advanced … Read more A practical guide to using Human Centered Design to deliver Advanced Analytics Projects

Transforming Real Photos Into Master Artworks with GANs

Art — the ability to create something original, to use one’s unbounded creativity and imagination — it’s something that we humans like to believe is unique to us. After all, no other animal or computer thus far has come close to matching the artistic skill of humans when it comes to realistic paintings. I mean … Read more Transforming Real Photos Into Master Artworks with GANs

RetinaNet : Custom Object Detection training with 5 lines of code

Making computer vision easy with Monk, low code Deep Learning tool and a unified wrapper for Computer Vision. Indoor Object detection In a previous article, we have built a custom object detector using Monk’s EfficientDet. In this article, we will build an Indoor Object Detector using Monk’s RetinaNet, built on top of PyTorch retinanet. These … Read more RetinaNet : Custom Object Detection training with 5 lines of code

Scrapy: This is how to successfully login with ease

Sheng Li from Unsplashed Demystifying the process of logging in with Scrapy. Once you understand the basics of Scrapy one of the first complication is having to deal with logins. To do this its useful to get an understanding of how logging in works and how you can observe that process in your browser. We … Read more Scrapy: This is how to successfully login with ease

R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

[This article was first published on Jozef’s Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. It is almost the 29th of February 2020! A day that is … Read more R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

relgam: Fitting reluctant generalized additive models

Introduction and motivation tl;dr: Reluctant generalized additive modeling (RGAM) produces highly interpretable sparse models which allow non-linear relationships between the response and each individual feature. However, non-linear relationships are only included if deemed important in improving prediction performance. RGAMs working with quantitative, binary, count and survival responses and is computationally efficient. Consider the supervised learning … Read more relgam: Fitting reluctant generalized additive models

Getting Started With Jupyter Notebooks in Visual Studio Code

Insert and Delete Cells To insert a cell, click the plus sign in the toolbar or the one to the left side of the cell. To delete a cell, click the delete sign (i.e., the trash symbol) on the cell’s right side. Insert and Delete Cells Switch Cell Content Type and State To switch the … Read more Getting Started With Jupyter Notebooks in Visual Studio Code

A Beginner’s Guide to Simulating Dynamical Systems with Python

Numerically Integrate ODEs in Python Photo by Dan Meyers on Unsplash Consider the simple pendulum. We’ve just got a mass of m hanging from a string with length L that is swinging back and forth. It’s basically as simple of a system as we can work with. Don’t let this simplicity fool you though, it … Read more A Beginner’s Guide to Simulating Dynamical Systems with Python

A Succinct TensorFlow 2.0 Solution for Kaggle House Prices Prediction Challenge

Following the same way as how I model the house price dataset using Scikit-learn in the previous post, exploratory data analysis (EDA) and data transformation is firstly done in the EDA file. Then to model with tf.keras and tf.estimator, the transformed training and testing datasets are loaded using pickle. Process flow of building model with … Read more A Succinct TensorFlow 2.0 Solution for Kaggle House Prices Prediction Challenge

The Demographic Crisis Due to China’s One-Child Policy

Data Doesn’t Lie. Source : IBTimes UK China is massive. The kind of massive that makes it difficult to wrap your head around. Beginning from 1950, China’s population ballooned from an already respectable 540 million to 940 million by 1976 to a staggering 1.4 billion today. To put that into context, this population upsurge has … Read more The Demographic Crisis Due to China’s One-Child Policy

Using ColumnTransformer to combine data processing steps

Create cohesive pipelines for processing data where different columns require different techniques This scikit-learn tool comes in extremely handy, but also has some quirks of its own. Today we’ll be using it to transform data on ferry wait time for the Edmonds-Kingston route of the Washington State Ferries. (Thank you WSF for the data!). Full … Read more Using ColumnTransformer to combine data processing steps

Find and play with ‘molecule’ datasets

Machine learning has become popular in all fields. It has helped make industrial processes more efficient in all industries, be it logistics or defense. Although the adoption of machine learning has been slow by pharmaceutical companies — they are surely catching up. In this article, I talk about some of the standard datasets which can … Read more Find and play with ‘molecule’ datasets

Object-Oriented Reinforcement Learning

Reinforcement learning provides a set of tools to train a software or physical agent to take optimal actions within an environment (a real or simulated world) by trial and error (i.e. executing an action and then experiencing its effect), guided only by positives or negatives rewards; these are scalar feedback signals that the agent has … Read more Object-Oriented Reinforcement Learning

Utilizing your Data Science Project

Putting a Lending Club machine learning model into production The graphical user interface for my LendingClub dashboard application One of the strongest trends in the data science industry in the past few years is increased emphasis on deploying machine learning models in a production environment. Employers are expecting more than just feature engineering and modeling. … Read more Utilizing your Data Science Project

Designers need Augmented Intelligence not Black Box AI

The building construction industry faces an existential crisis. It is one of the least digitized industries and the world’s leading producer of C02 emissions. With rising temperatures around the globe and mass migration to urban centers, there is a dire need for a digital disruption that will enable more sustainable ways of working. We as … Read more Designers need Augmented Intelligence not Black Box AI

Understanding MC experiment by a gaming example.

Credit: Screen capture from Dota 2 Often times we encounter uncertainty in projects where we need to estimate something given varying chances of success. A very good approach to solving such scenarios is to use repeated simulation of outcomes drawn from the underlying uncertain probabilities. Such techniques are called Monte Carlo methods (or MC for … Read more Understanding MC experiment by a gaming example.

Data Crafting: How Play & Craft Changes Data Comprehension

I recently led a workshop on Data Crafting. The idea (of getting our hands dirty by crafting with data) spontaneously arose in a conversation between myself and Natalie Vladis, a Quantitative Fellow at Harvard Medical School. Here’s what happened: Natalie loves crafting and lives in data. I love data visualization, occasionally hodgepodge things, and believe … Read more Data Crafting: How Play & Craft Changes Data Comprehension

The next package release into AWS Athena

RBloggers|RBloggers-feedburner RAthena 1.7.1 and noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to AWS Athena, noticeably the performance speed and several creature comforts. These packages have both been designed to reflect one another,even down to how they connect to AWS Athena. This means … Read more The next package release into AWS Athena

Correlogram in R: how to highlight the most correlated variables in a dataset

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Photo by Pritesh Sudra Correlation, often computed as part … Read more Correlogram in R: how to highlight the most correlated variables in a dataset