Using custom images as maps in Tableau

To do this, you need to first prepare the data source, add image and add the required coordinates in the data source. Here are the steps in detail. 1. Prepare the data source Open a spreadsheet file and add three columns — X, Y, and an identifier. I used Year. Enter random numbers in X … Read more

5 Python Best Practices That Every Programmer Should Follow

Whether you are writing a small script for making a detailed project, having a well-structured code with proper names of the modules, correct indentations, and documentation improves the code’s usability in the future. Especially while making projects, you must include a README file for describing your project, the file for properly setting up your … Read more

NHS English Prescribing Data (EPD) Analysis Using Python (Part 2)

Distribution As can be seen above, the quantity of oral antihistamines each year appears to follow a normal distribution. There also appears to be some relationship between the highest pollen count across the types of pollen each day and the quantity of oral antihistamines supplied. This relationship and distrubution must be confirmed and/or quantified. To … Read more

How To Use “yield” in Python?

Python Generator — from basic to advanced usage If you’re a Python developer, I believe that you must know the Generator in Python. The key to defining a Python generator is to use the “yield” keyword. The Python generator is ubiquitously used in scenarios when we need a large collection, improve the readability of our … Read more

How to think about probability

Flip a coin, and you will have one of only two possible outcomes (heads or tails). Pull a card from a standard deck, and you will have one of only fifty-two possible outcomes. When the outcome is one of a fixed number of possible results, we call it a “categorical” outcome. Discrete probability deals with … Read more

What is IMU?

IMU (Inertial Measurement Unit) is one of the common sensor to provide motion data in a time-series format. In this post we review it. IMU (Inertial Measurement Unit) sensor provides time-series data, used in Human Activity Recognition problems, Tracking & Navigation problems, and many more. In the AI era, this cheap and reliable sensor can … Read more

5 Must-Know AI Concepts In 2021

Low-code and no-code initiatives appeared a few decades ago as a reaction to the increasingly large skill gap in the coding world. The technical ability to create good code and know how to handle tasks at different points in the design-production pipeline was expensive. As software products got more complex, so did the programming languages. … Read more

Python Web Apps Are a Terrible Idea for Analytics Projects.

As long as you have only short-living requests or long-running tasks with only a few anticipated requests from users, you’re good. Python frameworks like Flask and Django are ideal because you can keep everything in one language — Python. The complexity arises when you have long-running tasks with significant demand. To keep the server up, … Read more

Data Transformation and Feature Engineering

When the data sample follows the power law distribution, we can use log scaling to transform the right skewed distribution into normal distribution. To achieve this, simply use the np.log() function. In this dataset, most variables fall under this category. before transformation (image by author) ## log transformation – power law distribution ##log_var = [‘Income_M$’, … Read more

Cleaning & Preprocessing Text Data by Building NLP Pipeline

What are the main NLP text preprocessing steps? The below list of text preprocessing steps is really important and I have written all these steps in a sequence how they should be. Step-1: Remove Accented Characters This is a crucial step to convert all characters like accented characters into machine-understandable language. So that further steps … Read more

Understanding Flask vs FastAPI Web Framework

A comparison of two different RestAPI frameworks Photo by Daria Shevtsova taken from Unsplash Introduction Being a Data Scientist does not end with Model Building, but working towards the next step, Model Deployment. Model Deployment is important to show your final results to others (or it could be your clients). This group of people might … Read more

AWS Lambda integration with Snowflake

Resource policy After replacing the relevant fields in the following JSON add the same resource policy for the API. Finally, in the lambda console, you should observe API triggered Lambda function. API Integration We create an API integration in Snowflake. This integration will create a user and allow that user to assume the role we … Read more

Adversarial Machine Learning: Attacks and Possible Defense Strategies

Information Theory An overview regarding one of the emerging research field for Machine Learning and Artificial Intelligence. Image by Author Research on Machine Learning (ML) models has evolved in recent years, leading to the definition of very precise models. In fact, the primary goal of the ML researchers has always been to develop ever more … Read more

Categorizing user-uploaded documents

How insights from data were used to help build the taxonomy and our approach to assign categories to the user-uploaded documents. Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects … Read more

Supercharge your Vim Skills

8 Vim tips to edit your files faster. (Image by author) Vim is a text editor that in the hands of a skilled user, can enable blazing fast edits closer to the speed of thought — much faster than what’s usually achievable with a traditional text editor. For everything we do on the computer, there … Read more

The Easiest Headless Raspberry Pi Setup

Let’s get started. I have a Raspberry Pi 3, but any Raspberry Pi will work with this setup. All we’ll need is the following to get setup. Raspberry Pi 4GB or greater microSD card Windows, Mac, or Linux computer Adapter(s) to plug in your microSD card into your computer iPhone or Android device Adapter hell. … Read more

Understanding LIME

First things first, we need to install LIME using pip. You can find the source code for LIME in [2]. pip install lime We will use the iris dataset provided to us by Scikit-learn [3] as an example to demonstrate the package usages. First things first, we need to import the different packages which we … Read more

Album covers by GANs

A step-by-step code and intuition guide to generating album covers. A random sample of generated album covers from WGAN Yeah, GANs can be pretty cool. If you somehow managed to stumble upon this little article, it’s probably safe to say that you’re somewhat interested in generative adversarial networks — GANs. I definitely was. In seeing … Read more

A Better Way for Data Preprocessing: Pandas Pipe

Efficient, organized, and elegant. Photo by Sigmund on Unsplash Real-life data is usually messy. It requires a lot of preprocessing to be ready for use. Pandas being one of the most-widely used data analysis and manipulation libraries offers several functions to preprocess the raw data. In this article, we will focus on one particular function … Read more

Thinking Like a Chef Will Make You a Better Data Scientist

The prominent chefs all have their own restaurant and/or unique style of cooking, and they still practice all their fundamentals everyday. There isn’t a single chef breaking the rules that hadn’t mastered the rules in the first place. Essentially, they know the how and why to break rules in a way that’s meaningful. That simply … Read more

What is Deep Analytics?

And why we need to rethink business intelligence Going for a dive. Photo by Joe Pohle from Unsplashed. As data analysts, we waste too much time on making dashboards for other people and not enough time on answering deep questions about critical business issues. This is a waste of resources for the individual, and a … Read more

Self-Supervised Learning in Vision Transformers

Anyone who has ever approached the world of machine learning has certainly heard of supervised learning and unsupervised learning. These are in fact two important possible approaches to Machine Learning that have been widely used for years. Only recently, however, has there been an explosion of a new term, Self-Supervised Learning! But let’s get there … Read more

What are the Most Popular Skills for Data Science Jobs? Ask a Graph Database!

Finding your Next Job by Building an Job Graph with TigerGraph, Indeed Job Data, and Kaggle API Graphs are everywhere and can help with so much, including finding a job. Platforms like LinkedIn are powered by graph databases to help recommend jobs to you. In this blog, we’ll create an Indeed Graph that can … Read more

Can Github’s Copilot replace developers?

In simple words, Copilot really understands what you want to code in the next line. In my case, it even understands bad comments perfectly. Sometimes, it makes a few silly mistakes like declaring the same variable repeatedly; these kinds of bugs were already expected, which is why Github initially gave developers access to give their … Read more

Introduction to Time Series Forecasting — Part 2 (ARIMA Models)

Most time series forecasting methods assume that the data is ‘stationary,’ but in reality it often needs certain transformations for further processing. Photo by Miguel Luis on Unsplash In the first article, we looked at Simple Moving Average and Exponential Smoothing methods. In this article we will look at more complex methods like ARIMA and … Read more

How Big Is Cost Overrun for the Olympics?

All Games, without exception, have had cost overruns. For no other type of mega-project is this the case. With Alexander Budzier and Daniel Lunn Photo by Bryan Turner on Unsplash Percentage cost overrun for the Olympic Games 1960–2016 is shown in real terms in the table below. Data on cost overrun were available for 19 … Read more

5 Things I (didn’t) learn at University

Opinion — University failed me badly to prepare me for my IT and Data Science Career How much worth is a Bachelor’s Degree in IT and Data Science? Photo by Raychan on Unsplash In my current career, I am dealing with several trending topics regarding digitalization such as the renewal of IT through the cloud, … Read more

Predicting Electric Vehicle & Commercial Charger Demand in Washington State

Which Washington counties will have the most EVs and need the most commercial chargers? If you’ve stepped outside in the past couple months, chances are you’ve felt like a melting scoop of ice cream more than any other summer. Well, it is no coincidence that the global land-only surface temperature for June 2021 was the … Read more

How 400k+ Tweets Show That Simone Biles Wins

Here are the top 10 retweeted tweets. Top retweeted tweets referencing ‘Simone Biles’ | Skanda Vivek All of the top 10 retweeted tweets are in support of Simone Biles! And here are the top 10 liked tweets. Top 10 liked tweets referencing ‘Simone Biles’ | Skanda Vivek Same in this case — All of the … Read more

Practical Guide to Ensemble Learning

The intuition behind ensemble learning is often described with a phenomenon called the Wisdom of the Crowd which means aggregated decisions made by a group of individuals are often better than the individual decisions. There are multiple methods for creating aggregated models (or ensembles) which we can categorize as heterogenous and homogenous ensembles. In heterogeneous … Read more

Automatic Parallel Parking: Path Planning, Path Tracking & Control

Path Tracking The kinematic model of the car is: x = vcos(ϕ) y = vsin(ϕ) v = a ϕ = vtan(δ)/L The state vector is: z=[x,y,v,ϕ] x: x-position, y: y-position, v: velocity, φ: yaw angle The input vector is: u=[a,δ] a: acceleration, δ: steering angle Control The MPC controller controls vehicle speed and steering based … Read more

How to Run Animations in Altair and Streamlit

Data Visualisation A ready-to-run tutorial, which describes how to build an animated line chart using Altair and Streamlit. Image by Author Altair is a very popular Python library for data visualisation. Through Altair, you can build very complex charts with few lines of code, since the library follows the guide lines provided by the Vega-lite … Read more

Top Surprising Data Science Trends

Introduction Arts and Entertainment Utility Script Earth and Nature Summary References This article will outline the most popular data science trends that are designated as tags on Kaggle [2]. From those popular tags, I have picked three that I think are the most surprising. Understanding trends in data science can be helpful in a variety … Read more

Who are you Data Engineer?

In this post, I will explain the data roles that exist today and in particular — who is a data engineer? What are the role definition, responsibilities, and challenges contained in it? Photo by Christina @ on Unsplash For the past few years, I have been working as a big-data engineer, and although it … Read more

Explaining a BigQuery ML model

How to obtain and interpret explanations of predictions BigQuery ML is an easy-to-use way to invoke machine learning models on structured data using just SQL. Although it started with only linear regression, more sophisticated models like Deep Neural Networks and AutoML Tables have been added by connecting BigQuery ML with TensorFlow and Vertex AI as … Read more

Pre-Pruning or Post-Pruning

In a previous article, we talked about post pruning decision trees. In this article, we will focus on pre-pruning decision trees. Let’s briefly review our motivations for pruning decision trees, how and why post-pruning works, and its advantages and disadvantages. If you’d like some more details, check out this article. Decision Trees are grown using … Read more

A Detailed, Novice Introduction to Natural Language Processing (NLP)

There are a total of 5 execution steps when building a Natural Language Processor: Lexical Analysis: Processing of Natural Languages by the NLP algorithm starts with identifying and analyzing the input words’ structure. This part is called Lexical Analysis and Lexicon stands for an anthology of the various words and phrases used in a language. … Read more

Automating EDA & Machine Learning

Using MLJAR-Supervised for Automating EDA Machine Learning Models and Creating Markdown Reports Source: By Author Exploratory Data Analysis is an important step for understanding the data that we are working on it helps us in identifying any hidden pattern in the data, the correlation between different columns of the data, and in analyzing the properties … Read more

Detecting Semantic Drift within Image Data

1. Metadata and Features∘ Image features∘ Image metadata2. Semantic Drifts∘ Custom Features — Distances from Cluster Centers3. Conclusion Even though we don’t have an actual model for prediction, let’s assume that our model input is expected to consist mainly of landscape images. Using a simulated production stage, we can test if it’s possible to detect … Read more

It’s Time to Use AI and Machine Learning Like Bar Charts.

Organizations need to deploy AI and machine learning more widely, beyond just data science teams. Yes, democratizing ML will lead to imperfect models, and sometimes even the wrong decision. But imperfect ML is not worse than imperfect Excel business analysis. The available size and scale of data requires more analysts empowered with an upgraded suite … Read more

Word2Vec Explained

Table of Contents Introduction What is a Word Embedding? Word2Vec Architecture– CBOW (Continuous Bag of Words) Model– Continuous Skip-Gram Model Implementation– Data– Requirements– Import Data– Preprocess Data– Embed– PCA on Embeddings Concluding Remarks Resources Word2Vec is a recent breakthrough in the world of NLP. Tomas Mikolov a Czech computer scientist and currently a researcher at … Read more

Heteroscedasticity in Regression Model

Use of Statsmodels to check Heteroscedasticity Image from Unsplash Introduction Oftentimes, regression analysis is carried out on data that may have a built-in feature of high variance across different independent variable values. One of the artifacts of this type of data is heteroscedasticity which indicates variable variances around the fitted values. When we observe heteroscedasticity, … Read more

Building a Streamlit app to visualise Covid-19 data

Photo by Mika Baumeister on Unsplash There are many datasets on the OWID GitHub repository however they have aggregated many of the key data points into one combined structure. This makes life considerably easier as transformations are in less intensive. Let’s have a look at the CSV. import pandas as pddf = pd.read_csv(‘owid-covid-data.csv’)print(df.shape)(101752, 60)print(‘Total unique … Read more