What and who is IT community? What does it take to be part?

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. This blog post is long over due and has been rattling … Read more What and who is IT community? What does it take to be part?

Do not underestimate the need for DevOps in AI.

As machine learning is getting more mature, the need to build infrastructure that supports running these workflows is even greater. In a large enterprise setting on an average, there are at least 200+ data scientists/DL/ ML engineers that run their model training and inferencing jobs. Ensuring that these users get easy hardware/software access to train … Read more Do not underestimate the need for DevOps in AI.

Dropping Missing Values? You Probably Shouldn’t.

Returning Back to the Discussion About Missing Values You shouldn’t be surprised but Data Analysts are, actually, always on the offensive to deal with missing values in a dataset. In fact, more often than not, Missing Values actually represent major bits & pieces of information, albeit one that might/mightn’t differ from what the rest of … Read more Dropping Missing Values? You Probably Shouldn’t.

Deploying a Simple Machine Learning Model into a WebApp using TensorFlow.js

Let’s have a simple HTML page that uses the HTML5 Canvas component that lets us draw on it. Let’s call this file “tfjs.html”. The core drawing code comes from this website: Using the HTML5 Canvas component, we can hook mouse events to draw into the Canvas. canvas.addEventListener(‘mousedown’, function(e) {context.moveTo(mouse.x, mouse.y);context.beginPath();canvas.addEventListener(‘mousemove’, onPaint, false);}, false);var onPaint = … Read more Deploying a Simple Machine Learning Model into a WebApp using TensorFlow.js

How the 80/20 Rule can help decide which skills you need to start a career in Data Science

Vilfredo Pareto was an Italian engineer, sociologist, economist, political scientist, and philosopher from the XIX century that first described what is now know as the 80/20 Rule, or the Pareto Principle. The idea behind the Pareto principle is that some observable phenomena follow an uneven distribution, with 80% of the results (or effect) coming from … Read more How the 80/20 Rule can help decide which skills you need to start a career in Data Science

Paper Tuesday: Image reconstruction without data

Every Tuesday I highlight an interesting paper that I came across in research or work. I hope that my review can help you get the juiciest part of the paper under 2 minutes! Image reconstruction is a challenging learning task because no one knows what the original image looks like. Therefore, it seems that the … Read more Paper Tuesday: Image reconstruction without data

The Titanic: Did Anyone Get Lucky?

Introduction Kaggle’s famous (or infamous?) introductory data-science project is a heavy one — predicting the survival of each passenger on the Titanic based on a few personal characteristics and some details about their accommodations. In this article, I’m going to take an unusual angle on this project — looking at who aboard the Titanic got … Read more The Titanic: Did Anyone Get Lucky?

How is information gain calculated?

This post will explore the mathematics behind information gain. We’ll start with the base intuition behind information gain, but then explain why it has the calculation that it does. What is information gain? Information gain is a measure frequently used in decision trees to determine which variable to split the input dataset on at each … Read more How is information gain calculated?

Lasso Regression (home made)

To compute Lasso regression, \frac{1}{2}\|\mathbf{y}-\mathbf{X}\mathbf{\beta}\|_{\ell_2}^2+\lambda\|\mathbf{\beta}\|_{\ell_1}define the soft-thresholding functionS(z,\gamma)=\text{sign}(z)\cdot(|z|-\gamma)_+=\begin{cases}z-\gamma&\text{ if }\gamma>|z|\text{ and }z<0\\z+\gamma&\text{ if }\gamma<|z|\text{ and }z<0 \\0&\text{ if }\gamma\geq|z|\end{cases}[/latex]The R function would be</p> <p>57f5ffabb8ff1e7b3d8dbb37273d56d4000</p> <p>To solve our optimization problem, set[latex display=”true”]\mathbf{r}_j=\mathbf{y} – \left(\beta_0\mathbf{1}+\sum_{k\neq j}\beta_k\mathbf{x}_k\right)=\mathbf{y}-\widehat{\mathbf{y}}^{(j)}so that the optimization problem can be written, equivalently\min\left\lbrace\frac{1}{2n}\sum_{j=1}^p [\mathbf{r}_j-\beta_j\mathbf{x}_j]^2+\lambda |\beta_j|\right\rbracehence\min\left\lbrace\frac{1}{2n}\sum_{j=1}^p \beta_j^2\|\mathbf{x}_j\|-2\beta_j\mathbf{r}_j^T\mathbf{x}_j+\lambda |\beta_j|\right\rbraceand one gets\beta_{j,\lambda} = \frac{1}{\|\mathbf{x}_j\|^2}S(\mathbf{r}_j^T\mathbf{x}_j,n\lambda)or, if we develop\beta_{j,\lambda} = \frac{1}{\sum_i … Read more Lasso Regression (home made)

Clustered randomized trials and the design effect

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I am always saying that simulation can help illuminate interesting statistical … Read more Clustered randomized trials and the design effect

Hyperparameter tuning and #TidyTuesday food consumption

[This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Last week I published a screencast demonstrating how to use … Read more Hyperparameter tuning and #TidyTuesday food consumption

1 Week until the Berlin Rent Freeze: How many illegal overpriced offers can I find online?

Photo by Jonas Tebbe on Unsplash Having moved to Berlin, finding a reasonably priced flat was definitely one of the hardest parts. To get an idea of how competitive it can be, in November 1,749 flat-hunters queued outside to visit a vacant apartment in the popular Schöneberg district. The flat was advertised online just 12 … Read more 1 Week until the Berlin Rent Freeze: How many illegal overpriced offers can I find online?

DevOps for Data Science with GCP

Source: https://pixabay.com/photos/dock-ship-container-port-boat-1277744/ Deploying Production-Grade Containers for Model Serving One of the functions of data science teams is building machine learning (ML) models that provide predictive signals for products and personalization. While DevOps has not always been considered a core responsibility of data science teams, it is becoming increasingly important as these teams start to take … Read more DevOps for Data Science with GCP

Combining DataFrames using Pandas

Another widely used function to combine DataFrames is merge(). Concat() function simply adds DataFrames on top of each other or adds them side-by-side. It is more like appending DataFrames. Merge() combines DataFrames based on values in shared columns. Merge() function offers more flexibility compared to concat() function. It will be clearer when you see the … Read more Combining DataFrames using Pandas

AWS Firewall Manager now supports AWS CloudFormation

AWS Firewall Manager now supports AWS CloudFormation, allowing customers to manage all Firewall Manager policy types and resources with CloudFormation stack templates. AWS Firewall Manager is a security management service which allows you to centrally configure and manage firewall rules across your accounts and applications in AWS Organization. With Firewall Manager, you can manage AWS … Read more AWS Firewall Manager now supports AWS CloudFormation

The Open Cities AI Challenge

There is now a growing abundance of locally-validated open map data and high resolution drone imagery in diverse built environments. How might we best address these obstacles and enhance the state of practice in machine learning to support mapping for urban development and risk reduction for Africa’s cities? Dataset Working with partners Azavea and DrivenData, … Read more The Open Cities AI Challenge

Stop making data scientists manage Kubernetes clusters

Building models is hard enough Source: Pexels Disclaimer: The following is based on my observations of machine learning teams—not an academic survey of the industry. For context, I’m a contributor to Cortex, an open source platform for deploying models in production. Production machine learning has an organizational problem, one that is a byproduct of its … Read more Stop making data scientists manage Kubernetes clusters

Create effective data visualizations of proportions

Best ways to see individual contributions to a whole and changes over time, at various dataset sizes — (includes simple, visual demonstrations, code & data) Various visualisations of proportions Plotting proportions of a whole might be one of the most common tasks in data visualisation. Examples include regional differences in happiness, economic indicators or crime, … Read more Create effective data visualizations of proportions

Reinforcement Learning, Brain, and Psychology: Introduction

Reinforcement Learning, Artificial Intelligence, and Humans Introduction to series on connection between Reinforcement learning and humans. “Inspiration can be found even in weather forecasts.” Human brain is probably one of the most complex systems in the world and thus it’s a bottomless source of inspiration for any AI researcher. For decades reinforcement learning has been … Read more Reinforcement Learning, Brain, and Psychology: Introduction

Part 6: How not to validate your model with optimism corrected bootstrapping

When evaluating a machine learning model if the same data is used to train and test the model this results in overfitting. So the model performs much better in predictive ability  than it would if it was applied on completely new data, this is because the model uses random noise within the data to learn … Read more Part 6: How not to validate your model with optimism corrected bootstrapping

String Functions in SQL

There’s far more to analysing strings in SQL than LIKE ‘%…%’ Untangling strings using SQL can seem difficult but is made easier by knowing the right functions Although SQL sometimes has a ‘perpetual bridesmaid’ reputation next to more richly featured analytics environments, such as R or Python with their richly featured libraries, in the last … Read more String Functions in SQL

How to build your Ultimate Data Science Portfolios

My advice to Fellow Data Science Colleagues and Juniors A Great Data Scientist Builds Products that Matter Build your portfolio (Unsplash) “I am going to build a fitness tracker to analyse my fitness/diet metrics” “I have these Tableau dashboards that I worked on. No plan, just for fun” A few days ago, I had a … Read more How to build your Ultimate Data Science Portfolios

Pytolemaic — A Toolbox for Model Quality Analysis

Image by PxFuel A short intro to Pytolemaic package This blog post provides a short introduction to Pytolemaic package and its capabilities. The post covers the following components: Model analysis techniques Feature sensitivity Scoring and confidence intervals Covariate shift measurement Model’s predictions analysis Prediction’s uncertainty Lime explanations Building a Machine Learning (ML) model is quite … Read more Pytolemaic — A Toolbox for Model Quality Analysis

Creating MS Word reports using the officer package

Commonly, the final product that a data scientist or a statistician generates is a report, usually in MS Word format. The officer package enables generating such a report from within R. It also enables generating PowerPoint presentations, but this is beyond the scope of this post. While the package has many great features,  using the … Read more Creating MS Word reports using the officer package

Exploring Moving Averages to Build Trend Following Strategies in Python

Generated in Python using Plotly How moving averages can be used to improve the portfolio performance over the benchmark “Buy low, sell high” is a common goal everyone in finance wants to achieve. This, however, is more difficult than appears, since it is almost impossible to predict what direction the market is going. Many investors … Read more Exploring Moving Averages to Build Trend Following Strategies in Python

An Overview Of Importing Data In Python

Python build-in functions (read(), readline(), and readlines()) In general, a text file (.txt) is the most common file we will deal with. Text files are structured as a sequence of lines, where each line includes a sequence of characters. Let’s assume we need to import in Python the following text file (sample.txt). Country/RegionMainland ChinaJapanSingaporeHong KongJapanThailandSouth … Read more An Overview Of Importing Data In Python

An Impossible AI Challenge?

François Chollet is looking for a Unicorn Image licensed from Adobe Stock® Abstraction and Reasoning Challenge is the title of a just-released Kaggle competition hosted by François Chollet. Sub-titled “Create an AI capable of solving reasoning tasks it has never seen before,” this one is for those truly dedicated to advancing AI. Here’s the problem … Read more An Impossible AI Challenge?

Python ETL Tools: Best 8 Options

ETL is the process of fetching data from one or many systems and loading it into a target data warehouse after doing some intermediate transformations. The market has various ETL tools that can carry out this process. Some tools offer a complete end-to-end ETL implementation out of the box and some tools help you to … Read more Python ETL Tools: Best 8 Options

Reordering Pandas DataFrame Columns: Thumbs Down On Standard Solutions

Photo Credit: https://www.flickr.com/photos/takomabibelot/ A solution that simplifies the process of changing the order of DataFrame columns Dozens of blog posts, stackoverflow.com threads, quora.com articles, or other resources show the similar standard methods for moving columns around (changing their order) in a Pandas DataFrame. This article first provides example data. Following the example data, the article … Read more Reordering Pandas DataFrame Columns: Thumbs Down On Standard Solutions

Build a custom-trained object detection model with 5 lines of code

Ideally, you’ll want at least 100 images of each class. The good thing is that you can have multiple objects in each image, so you could theoretically get away with 100 total images if each image contains every class of object you want to detect. Also, if you have video footage, Detecto makes it easy … Read more Build a custom-trained object detection model with 5 lines of code

A better way for asynchronous programming: asyncio over multi-threading

A brief introduction to asyncio import asynciofrom aiohttp import ClientSessionasync def fetch(url):async with ClientSession() as session:async with session.get(url) as response:return await response.read() This is basically asyncio version of fetch_url. I use aiohttp because it provides an excellent client session where we can make HTTP requests asynchronously. Besides aiohttp.ClientSession, the code probably looks strange with async … Read more A better way for asynchronous programming: asyncio over multi-threading

Restoring intuition over multi-dimensional space

We would not be human if we did not curse things. As beings that are confined in a three-dimensional world, we tend to blame space whenever we have a problem to visualize data that extend to more than three dimensions. From scientific books and journal papers to simple blog articles and comments the term: “curse … Read more Restoring intuition over multi-dimensional space

Confusion Matrix “Un-confused”

Breaking down the confusion matrix The goal of applied machine learning in industry is to drive business value. Therefore being able to evaluate your machine learning algorithms performance is extremely important for deriving insights into your model. In this post, I aim to dive into the confusion matrix in a way that is accessible for … Read more Confusion Matrix “Un-confused”

The easiest way to download YouTube videos using Python

Source: Unsplash And how to use a custom class to extract frames as images In one of my first articles on Medium, I showed how to train a Convolutional Neural Network to classify images coming from old GameBoy games — Mario and Wario. After over a year, I wanted to revisit one aspect of the … Read more The easiest way to download YouTube videos using Python

9 Time-Saving Tricks for your Command Line

You’ve already encountered a few environment variables — PS1, HISTSIZE and HISTFILESIZE. In general, these are variables written in CAPITAL LETTERS that define important properties of the system. You can access the complete list of them with the set command. Another example (of many) is SHELLOPTS. It lists all the programs that are set to … Read more 9 Time-Saving Tricks for your Command Line

9 fascinating Novel Coronavirus statistics and data visualizations

Here’s what you should know about the coronavirus as of today Photo by Macau Photo Agency on Unsplash These numbers are as of February 15, 2020 and are taken from WHO’s situation reports, the National Health Commission (NHC) of the People’s Republic of China, and the Health Commission of Hubei Province, China. Links are provided … Read more 9 fascinating Novel Coronavirus statistics and data visualizations

Building an Incremental Recommender System

A recommender system should ideally adapt to changes as they happen. Although I will try to keep the math jargon to a minimum, this story expects that the reader is familiar with concepts like user-item interaction matrix, matrix factorization, embedding spaces, as well as basic machine learning terminology. This story is not an introduction to … Read more Building an Incremental Recommender System