Working with Hive using AWS S3 and Python

The main objective of this article is to provide a guide to connect Hive through python and execute queries. I’m using “Pyhive” library for that. I’m creating my connection class as “HiveConnection” and Hive queries will be passed into the functions. AWS S3 will be used as the file storage for Hive tables. import pandas … Read more Working with Hive using AWS S3 and Python

Reasoning With Probability — Is My Model Good Enough?

# Check the model accuracy score againaccuracy = my_tree.score(X_100_test, y_100_test)print(f’Accuracy is {“{0:.2f}”.format(accuracy*100)}%’)>>> Accuracy is 86.00% We got a slightly lower number, but still comparable. That said, while the numbers are close, having faith in the model based on 100 examples is harder than having faith based on 14,000 examples. Intuitively, our confidence in the test … Read more Reasoning With Probability — Is My Model Good Enough?

RStudio Blogs 2019

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. If you are lucky enough to have some extra time for discretionary … Read more RStudio Blogs 2019

Training on batch: how to split data effectively?

As for the spectrogram, you can think of it as a way of describing how much of each “tune” is present within the audio track. For instance, when a bass guitar is being played, the spectrogram would reveal high intensity more concentrated on the lower side of the spectrum. Conversely, with a soprano singer, we … Read more Training on batch: how to split data effectively?

Cluster analysis: theory and implementation of unsupervised algorithms

Industry applications Why is clustering so popular in statistics and machine learning fields? This is because cluster analysis is a powerful data mining tool in a wide range of business application cases. Here are just a few of many applications: Exploratory data analysis (EDA): Clustering is part of the most basic data analysis techniques employed … Read more Cluster analysis: theory and implementation of unsupervised algorithms

An Implementation of Distributed ACID Transactions

By Daniel Goméz Ferro and Monte Zweben Splice Machine is a Hybrid Transactional/Analytical Processing database (HTAP) that is designed to modernize legacy applications. By combining aspects of a traditional RDBMS database, such as ANSI SQL support and ACID (Atomicity, Consistency, Isolation, Durability) transactions, with the scalability, efficiency, and availability of in-memory analytics and machine learning, … Read more An Implementation of Distributed ACID Transactions

The Importance of Ethics in Artificial Intelligence

(Or any other form of technology for that matter) “Just because we can, doesn’t mean we should” could be something to keep in mind when it comes to innovating with technology. The arrival of The Internet has 10x’d the speed of innovation and allows us to pretty much create anything we can think of. Artificial … Read more The Importance of Ethics in Artificial Intelligence

Let me recall you this: accuracy isn’t everything in Machine Learning.

Why recall is so important when evaluating your Machine Learning model? source Everyday, data science professionals can’t stop thinking one thing: is that model really working? Data is like a live creature, and it changes and get messy almost everyday. At the end, all we want is to find a way to handle it and … Read more Let me recall you this: accuracy isn’t everything in Machine Learning.

Write Clean and SOLID Scala Spark Jobs

Creating data pipelines by writing spark jobs is nowadays easier due to the growth of new tools and data platforms that allow multiple data parties (analysts, engineers, scientists, etc.) to focus on understanding data and writing logic to get insights. Nevertheless, new tools like notebooks that allow easy scripting, sometimes are not well used and … Read more Write Clean and SOLID Scala Spark Jobs

Introduction to Data Science in R, Free for 3 days

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. To celebrate the new year and the recent release of … Read more Introduction to Data Science in R, Free for 3 days

A quick and dirty guide to visualization in Plotly for Python

There are endless options to create plots in Python, but one of the libraries that I have started to gravitate towards is Plotly. It is a solid option to create beautiful and presentable plots easily within your Jupyter workflow. It has vast selection of customizable plots that you can use for your visualizations (the entire … Read more A quick and dirty guide to visualization in Plotly for Python

Animated Information Graphics

with Python and Plotly Animated information graphics of various datasets are a popular topic on youtube, for example, the channel Data Is Beautiful This channel is my passion project taking us on a fun trip down memory lane together so we can relive the colorful… www.youtube.com has almost a million subscribers. I will show in … Read more Animated Information Graphics

How to Create a Simple Cancer Survival Prediction Model with EDA

A Thorough Walkthrough of Exploratory Data Analysis Techniques with Haberman’s Cancer Survival Dataset using Python to Create a Simple Predictive Cancer Survival Model Illustration by John Flores The impetus for this blog and the resultant cancer survival prediction model is to provide a glimpse into the potential of the healthcare industry. Healthcare continues to learn … Read more How to Create a Simple Cancer Survival Prediction Model with EDA

Everything you need to know about “Activation Functions” in Deep learning models

This article is your one-stop solution to every possible question related to activation functions that can come into your mind that are used in deep learning models. These are basically my notes on activation functions and all the knowledge that I have about this topic summed together in one place. So, without going into any … Read more Everything you need to know about “Activation Functions” in Deep learning models

A Beginner’s Guide to Preprocessing Text Data Using NLP Tools

Code used to preprocess Tweets obtained from Twitter Source: https://www.blumeglobal.com/learning/natural-language-processing/ Below I’ve outlined the code I used to preprocess my Natural Language Processing projects. This code has mostly been used on tweets obtained from Twitter for classifers. My hope is that this guide assists aspiring data scientists and machine learning engineers by familiarizing them with … Read more A Beginner’s Guide to Preprocessing Text Data Using NLP Tools

How Machine Learning Enhances Business Automation

From Predictive Analytics to predictive HR Support Image Source: Pexels.com The machines are here. They’re learning. And they’re coming for your business — with the power to build or destroy your ability to compete in the near future. As Margaret Laffan, Machine Learning Business Development Director for SAP says in Forbes, “Those companies not considering … Read more How Machine Learning Enhances Business Automation

Redesigning a Bad Graph — Spaghetti to Micromaps

Redesigning a Bad Graph — Spaghetti to Micromaps Authors: Chaithanya Pramodh Kasula and Aishwarya Varala Introduction The primary focus of the current report is redesigning a bad graph depicting Opioid overdose death rates per 100,000 (Age-Adjusted), in different states of the US. It also discusses detailed approaches and techniques used to obtain meaningful insights form … Read more Redesigning a Bad Graph — Spaghetti to Micromaps

Costa Rican Household Poverty Level Prediction in R

A comprehensive data analysis and prediction in R using Machine Learning Authors: Chaithanya Pramodh Kasula and Aishwarya Varala A map of Costa Rica Introduction: The current report details the process of answering several research questions related to the poverty levels of Costa Rican households. It is comprised of data sources, exploratory data analysis through visualization, … Read more Costa Rican Household Poverty Level Prediction in R

The Biggest AI Risk of the Next Decade is not a Robot Uprising

Grand generalisations about the future impacts of artificial general intelligence overshadow the more pressing issues we face today. Finally, robotic beings rule the world — pictures of the Terminator and HAL are just played out at this point. (Flight of The Conchords, Robots) The past decade has seen many interesting and impressive developments in tech … Read more The Biggest AI Risk of the Next Decade is not a Robot Uprising

Multi-Armed Bandits and Reinforcement Learning

A Gentle Introduction to the Classic Problem with Python Examples Photo by Carl Raw on Unsplash Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. We have an agent which we allow to choose actions, and each action has a reward that is returned according to a given, underlying probability … Read more Multi-Armed Bandits and Reinforcement Learning

The Competition Mindset: where Kaggle and Industry diverge

There are three major reasons why Kaggle is not Industry Data Science: The Objective: Kaggle and Industry Data Science share radically different objectives and aims. The Techniques: Kaggle competitions prioritize techniques not readily utilized in industry. The Data: Kaggle provides you the data; the real world doesn’t. 1. The Objective of Kaggle You might have … Read more The Competition Mindset: where Kaggle and Industry diverge

mapply and Map in R

An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously. … Read more mapply and Map in R

Image Similarity Detection in Action with Tensorflow 2.0

The main purpose of this script is to calculate image similarity scores using image feature vectors we have just generated in the previous chapter. It has two functions: match_id(filename) and cluster(). cluster() function does the image similarity calculation with the following process flow: Builds an annoy index by appending all image feature vectors stored in … Read more Image Similarity Detection in Action with Tensorflow 2.0

Dimensional Data Modeling

Why do you need dimensional data modeling and how to implement it? Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and … Read more Dimensional Data Modeling

Using GraphSAGE to Learn Paper Embeddings in CORA

Here we use stellargraph library to learn paper embeddings on CORA via GraphSAGE algorithm. CORA[1] is a dataset of academic papers of seven different classes. It contains the citation relations between the papers as well as a binary vector for each paper that specifies if a word occurs in the paper. Thus, CORA contains both … Read more Using GraphSAGE to Learn Paper Embeddings in CORA

Reliability chapter of ‘evidence-based software engineering’ updated

[This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. The Reliability chapter of my evidence-based software engineering … Read more Reliability chapter of ‘evidence-based software engineering’ updated

Fast Fourier Transform

Let’s take a look at how we could go about implementing the Fast Fourier Transform algorithm from scratch using Python. To begin, we import the numpy library. import numpy as np Next, we define a function to calculate the Discrete Fourier Transform directly. def dft(x):x = np.asarray(x, dtype=float)N = x.shape[0]n = np.arange(N)k = n.reshape((N, 1))M … Read more Fast Fourier Transform

Get Your Own Website Online In Four Steps

A simple guide for using R Blogdown I like to collect blog websites that have a nice design or interesting topics. Not only does the website itself attract me, but also people who built the website. I admire them for spending time on designing their own website and publishing contents constantly. I always imagine how … Read more Get Your Own Website Online In Four Steps

State Space Model and Kalman Filter for Time Series Prediction: Basic Structural & Dynamic Linear…

Time series consist of four major components: Seasonal variations (SV), Trend variations (TV), Cyclical variations (CV), and Random variations (RV). Here, we will perform predictive analytics using state space model on uni-variate time series data. I have used historical data of Schlumberger Limited (SLB) from 1986 on wards. Let’s load the data in our work … Read more State Space Model and Kalman Filter for Time Series Prediction: Basic Structural & Dynamic Linear…

3 techniques to make your Python code faster

You won’t even break a sweat In this post I’ll be sharing 3 Python efficiency techniques that you may use in your daily scripting, and how to measure the performance improvement between 2 solutions. Let’s get started! Performance may refer to many different factors in a solution (e.g. execution time, CPU usage, memory usage, etc.). … Read more 3 techniques to make your Python code faster

Three ways to build a Neural Network in Pytorch

TLDR;This isn’t meant to be a tutorial on PyTorch nor an article that explains how a neural network works. Instead, I thought it would be a good idea to share some of the stuff I’ve learned in the Udacity Bertelsmann Scholarship, AI Program. Having said this, the goal of this article is to illustrate a … Read more Three ways to build a Neural Network in Pytorch

Can You Become a Data Scientist Without a Quantitative Degree?

2. Find a great team. It matters, a lot. I almost put this one as #1 because in my own experience, being able to work with great people who recognize your unique value and genuinely want to see you succeed has been the single most important factor that has contributed to both my satisfaction and … Read more Can You Become a Data Scientist Without a Quantitative Degree?

How Insurance Companies use AI Data and Machine Learning to Enhance Services

Could your insurance company offer better services by harnessing big data, AI, and machine learning? Image Source: UnSplash The take-up of A.I. has become a key feature in driving business changes across the insurance journey lifecycle. Early adopters use it obtain better lead scoring, higher conversion rates, more effective cross-sell and upsell, increased retention, and … Read more How Insurance Companies use AI Data and Machine Learning to Enhance Services

Liability for Artificial Intelligence in the European Union

A Summary of The Report From the EU Expert Group on Liability and New Technologies Released in 2019 This article is a brief summary of the report from the EU Expert Group on Liability and New Technologies. In short, it outlines potential regulation in regards to emerging technologies, and although stated in the form of … Read more Liability for Artificial Intelligence in the European Union

A Visual Introduction to Clustering with KMeans

Getting an intuitive understanding of KMeans clustering with visual means in a 2- and 3-dimensional space and interactive graphics Clustering is a type of unsupervised learning that we use when we do not know beforehand what groups each observation in the data belongs to. The KMeans algorithm for clustering basically consists of 5 steps: Step … Read more A Visual Introduction to Clustering with KMeans

Seven Ways Machine Learning Detects Anomalies in Healthcare

Identifying Insurance Fraud and Predicting Outcomes in the Medical Industry Using Data-Driven Technology Image Source: UnSplash The digital revolution has changed the healthcare landscape irrevocably. Patients expect faster, more efficient care that costs less, which is where artificial intelligence (AI) can help. AI and machine learning allow healthcare organizations to evolve and keep up with … Read more Seven Ways Machine Learning Detects Anomalies in Healthcare

AutoViz: A New Tool for Automated Visualization

Data scientists are often tasked with working through massive data stores to provide workable insights. These insights are then analyzed in order to identify patterns related to business intelligence or even human behavior. However, it may be one thing to construct data queries and machine learning pipelines, employing all types of optimizations and clever algorithms. … Read more AutoViz: A New Tool for Automated Visualization

Types of Neural Networks (and what each one does!) Explained

Machine learning — a subset of Artificial Intelligence — incorporates neural networks to create some amazing software that we use on a daily basis. If you used Google to find this medium article, you used Google’s neural network that ranks the most relevant pages based on the keyword(s) you gave it. If you recently went … Read more Types of Neural Networks (and what each one does!) Explained

Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way

Hi All, welcome to my blog “Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way” this is my last blog of the year 2019. My name is Niranjan Kumar and I’m a Senior Consultant Data Science at Allstate India. Recurrent Neural Networks(RNN) are a type of Neural Network where the output from … Read more Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way

Let’s POP those filters in Tableau!

It’s Datafied! — Tableau Playbook Photo by Nicolas Picard on Unsplash So here we are, back on the journey towards learning interesting data visualization concepts using Tableau, dash-boarding best practices, and a few handy tips/tricks along the way. In this write up, we go through the following activities: Using Level of Detail (LOD) expressions to identify whether … Read more Let’s POP those filters in Tableau!

Meta-learners for Estimating Treatment Effect in Causal Inference

Meta-algorithms are used to calculate CATE. The most common meta-algorithms take two steps: It uses base-learners to estimate the conditional expectations of the outcomes separately for control and treatment groups. It takes the difference between these estimates. To be more specific, here are the few meta-learners used in the repo. T-Learner: When the base-learner is … Read more Meta-learners for Estimating Treatment Effect in Causal Inference

Fit a Linear Regression Model with Gradient Descent from Scratch

Implement gradient descent to find optimal weights for a simple linear regression. We all know sklearn can fit models for us. But do we know what it’s actually doing when we call .fit(). Keep reading to find out. Today we’ll write a set of functions which implement gradient descent to fit a linear regression model. … Read more Fit a Linear Regression Model with Gradient Descent from Scratch