From Data Lakes to Data Reservoirs

The Emergence of Standards Good ideas take hold and quickly spread like wildfire. Recently the data community has standardized on at least one core data format that is good enough to get behind. That is the file storage format Parquet and we are going to learn a little more about why this is such an … Read more

Clear charts with Matplotlib

How to build Matplotlib charts that people actually want to read and use. Data science is a lot about convincing people, showing what you have found, what patterns you actually understood. Charting comes a long way in this exercise. Sadly many scientific libraries allow you to produce charts out of the box such as the … Read more

Amazon Personalize enhances Recommendation Filters with filtering on item metadata

Today, we are pleased to announce enhancements to Recommendation Filters in Amazon Personalize, which provide you greater control on recommendations your users receive by allowing you to exclude or include items to recommend based on criteria that you define. For example, when recommending products for your e-retail store you can exclude unavailable items from recommendations; … Read more

Categories AWS ExcerptFavorite

PANDAS: Put Away Novice Data Analyst Status

How Pandas can make you a better data analyst. Learn about one-liners for different steps in the Data Analysis process. Photo by cheese yang on Unsplash Pandas as I call it Put Away Novice Data Analyst Status is a powerful open-source data analysis and manipulation library. It can help you to do various operations on … Read more

Pandas Time/Date Series Functionality

Extensive capabilities and features for time series analysis Photo by Markus Winkler on Unsplash Expanding the Time arrangement, Date functionalities play a major part in monetary information examination. Pandas contain broad capabilities and highlights for working with time series information for all spaces. Utilizing the NumPy datetime64 and timedelta64 dtypes, pandas have solidified a huge … Read more

Manage access to AWS centrally for OneLogin users with AWS Single Sign-On

The interoperability of AWS SSO and OneLogin enables administrators to assign users and groups access centrally to their AWS Organizations accounts and AWS SSO integrated applications. This makes it easier for an AWS administrator to manage access to AWS and ensure OneLogin users have the right access to the right AWS accounts. Ongoing management is … Read more

Categories AWS ExcerptFavorite

Custom PySpark Accumulators

Photo by Joshua Sortino on Unsplash dict, list and set type of pyspark accumulators Spark, by default, provides accumulators that are int/float that supports the commutative and associative operations. Though spark also provides a class AccumulatorParam to inherit from to support different types of accumulators. One just needs to implement two methods zero and addInPlace. … Read more

Amazon GuardDuty now available in AWS Africa (Cape Town) and Europe (Milan) Regions

Available globally, Amazon GuardDuty continuously monitors for malicious or unauthorized behavior to help protect your AWS resources, including your AWS accounts, access keys, and data stored in Amazon S3. GuardDuty identifies unusual or unauthorized activity, like crypto-currency mining, access to data stores in S3 from unusual locations, or infrastructure deployments in a region that has … Read more

Categories AWS ExcerptFavorite

Reinforcement Learning — Part 2

Markov Decision Processes Deep Learning at FAU. Image under CC BY 4.0 from the Deep Learning Lecture These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was … Read more

NLP: Classification & Recommendation Project

There are various algorithms that can be used for text classification. Well, I started by exploring these models: Logistic Regression, Naive Bayes, Linear SVC, and Random Forest. My method was, choosing the best model to optimize, after running all my models in this section. Hence, I ran all the models with their default parameters to … Read more

Word Embedding in NLP: One-Hot Encoding and Skip-Gram Neural Network

I’m a poet-turned-programmer who has just begun learning about the wonderful world of natural language processing. In this post, I’ll be sharing what I’ve come to understand about word embedding, with the focus on two embedding methods: one-hot encoding and skip-gram neural network model. Last year, OpenAI released a (restricted) version of GPT-2, an AI … Read more

Why R? 2020 (Remote) Call for Papers Extended

[This article was first published on, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. This decided to give you one more week to submit a talk or … Read more

Categories R Tags ExcerptFavorite

The math behind Machine Learning Algorithms

How do different machine learning algorithms learn from the data and predict on the unseen data? Photo by Roman Mager on Unsplash The machine learning algorithms are designed in such a manner that they learn from experience and their performance improves as they feed on more and more data. Every algorithm has its own way … Read more

fairmodels: let’s fight with biased Machine Learning models (part 1 — detection)

fairmodels: let’s fight with biased Machine Learning models (part 1 — detection) Author: Jakub Wiśniewski TL;DR The fairmodels R Package facilitates bias detection through model visualizations. It implements few mitigation strategies that could reduce the bias. It enables easy to use checks for fairness metrics and comparison between different Machine Learning (ML) models. Longer version Fairness in ML … Read more

Categories R Tags ExcerptFavorite

How to Calculate and Analyze Relative Strength Index (RSI) Using Python

The relative strength index is a momentum oscillator commonly used to predict when a company is oversold or overbought. The calculation process is straightforward: Observe the last 14 closing prices of a stock. Determine whether the current day’s closing price is higher or lower than the previous day. Calculate the average gain and loss over … Read more

Bayesnote v0.0.1 release note

Bayesnote is a frictionless integrated notebook environment for data scientists and data engineers. Bayesnote is a frictionless integrated notebook environment for data scientists and data engineers. It provides a user interface to build dashboards and deploy machine learning models right from a notebook. It also supports the operation of notebooks by a workflow system, Noteflow. … Read more

The Three Questions about AI that Startups Need to Ask

Billion-dollar investments in AI are booming. What does this mean for startups looking to AI for their innovative and competitive edge? The strategy seems simple: take one of humanity’s perennial problems and fix it with machine learning. Google, Facebook, Netflix, and Uber did it. It can often seem like the obvious question is why not … Read more

Predicting Sentiment of Employee Reviews

In my previous articles, we learned how to scrape, process, and analyze employee reviews from Feel free to take a look and offer feedback. I would love to hear how you would improve the code. In particular, how to dynamically overcome changes to the website’s HTML. In this article, I would like to take … Read more

Impressive Medium Articles on AI/ML This Month

One-stop-shop to get information into the history, development and potential of GPT-3. Julien Lauret’s article is a comprehensive summary of the journey taken so far to create GPT-3. Julien has managed to summarize years of development and introductions of methodology and techniques to model language and solve natural language processing into several small, concise paragraphs. … Read more

How to Create a GraphQL API using AWS AppSync

Nowadays whenever we talk or think about creating/designing an API what pops to the mind at first is REST. REST(REpresentational State Transfer) has been the go-to standard until recently when developing an API platform. Even though REST became the standard, it did have its own disadvantages. One of the main disadvantages is the inflexibility for … Read more

What is Data Science?

Exploring the history of data science and understanding what it is now Image by Trist’n Joseph Data has become the driving force behind the world’s industries. Now, more than ever, businesses need individuals who can help them optimize their operations. Because of this, Data Science jobs have been ranked Glassdoor’s number one best job consecutively … Read more

Classification Model from Scratch

Beginner’s guide in building a Naive Bayes classifier model (simple classification model) from scratch using Python. CAMERON FOXLY “BASIC programming into an old computer” In machine learning, we can use probability to make predictions. Perhaps the most widely used example is called the Naive Bayes algorithm. Not only it is straightforward to understand, but it … Read more

Best Free Resources to Learn Programming, Software Engineering, Machine Learning, And More

All you need to learn… Source: Unsplash by 🇸🇮 Janko Ferlič Do you know that you can take the courses from MIT, Stanford, and Harvard for free? Lots of their undergraduate and graduate-level course materials are for the students around the globe to use for free. I am going to talk about some of the … Read more

The Sardinas-Patterson Algorithm in Simple Python

Checking for Unique Decodability in Variable-Length Codes Image by S. Hermann & F. Richter from Pixabay Two fields that often get left on the sidelines in conversations about data science are Information Theory, which studies the quantification, storage, and communication of information, Coding Theory, which studies the properties of codes and their respective fitness for … Read more

How to Draw Venn Diagrams on Jupyter

➡️ Introduction➡️ A, B➡️ TRUE, FALSE➡️ A AND B, A NAND B➡️ A OR B️, A NOR B️➡️ A XOR B️, A XNOR B️➡️ NOT A, NOT B➡️ A NOT B, B NOT A➡️ Implication, A → B, B → A➡️ Mutually exclusive➡️ Complement➡️ Subset➡️ Conclusion In this article, you will find how to draw … Read more

Demand Forecasting using FB-Prophet

A seasonal decomposition is performed of the time-series using the statsmodels.tsa.seasonal_decompose function. The charts above show a linear growth in sales over time (across categories and states) along with seasonal effects. Linearity is particularly evident in the latter half of the time-series starting from the year 2014. A yearly seasonality is seen in all states … Read more

I like to MVO it!

[This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. In our last post, we ran through a bunch of weighting … Read more

Categories R Tags ExcerptFavorite

Handling R6 objects in C++

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Introduction When we are using R6 objects and want to introduce some … Read more

Categories R Tags ExcerptFavorite

rfm 0.2.2

We’re excited to announce the release of rfm 0.2.2 on CRAN! rfm provides tools for customer segmentation using Recency Frequency Monetary value analysis. It includes a Shiny app for interactive segmentation. You can install rfm with: install.packages(“rfm”) In this blog post, we will summarize the changes implemented in the current (0.2.2) and previous release (0.2.1). … Read more

Categories R Tags ExcerptFavorite

Spatial GLMM(s) using the INLA Approximation

Model setup We have a count outcome (deaths and births), in counties over time, and a set of time-constant covariates. We have several options in the GLM framework with which to model these data, for example: Binomial – \[y_{ij} \sim Bin(\pi_{ij}) \text{: } logit(\pi_{ij} ) = \beta_{0}+ x’\beta_k \] Poisson – \[y_{ij} \sim Pois(\lambda_{ij} E_{ij}) … Read more

Categories R Tags ExcerptFavorite

Feature Leakage, and identifying it with Exploratory data analysis and Machine Learning

library(tidyverse) # Loading some data loan_data <- structure(list(finalClass = c(“Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Success”, “Reject/Cancel”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Success”, “Reject/Cancel”, “Reject/Cancel”, “Success”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Reject/Cancel”, “Success”, “Reject/Cancel”, “Reject/Cancel”, “Success”), balance_new_bracket = c(“01. <= 10k”, “01. <= … Read more

Categories R Tags ExcerptFavorite

Explainable ‘AI’ using Gradient Boosted randomized networks Pt2 (the Lasso)

This post is about LSBoost, an Explainable ‘AI’ algorithm which uses Gradient Boosted randomized networks for pattern recognition. As we’ve discussed it last week LSBoost is a cousin of GFAGBM’s LS_Boost. In LSBoost, more specifically, the so called weak learners from LS_Boost are based on randomized neural networks’ components and variants of Least Squares regression … Read more

Categories R Tags ExcerptFavorite

Decentralized Reinforcement Learning

Detailed overview of a new paradigm in Reinforcement Learning Many associations in the world like the biological ecosystems, government and corporations are physically decentralized however they are unified in the sense of their functionality. For instance, a financial institution operates with a global policy of maximizing their profits, hence appearing as a single entity; however, … Read more

Random Forest on GPUs: 2000x Faster than Apache Spark

Lightning-fast model training with RAPIDS Photo by bady abbas on Unsplash Disclaimer: I’m a Senior Data Scientist at Saturn Cloud — we make enterprise data science fast and easy with Python, Dask, and RAPIDS. Prefer to watch? Check out a video walkthrough here. Random forest is a machine learning algorithm trusted by many data scientists … Read more

An In-Depth Crash Course on Random Variables

For every random variable is an associated probability distribution function. A probability distribution function essentially gives the probabilities associated with obtaining each possible value or an interval of values. There are three types of probability distribution functions: probability mass function (pmf), probability density function (pdf), and the cumulative distribution function (cdf). Probability Mass Function (pmf) … Read more

They all got crowns

Debunking the dynamic of pitting pop divas against each other, with an exploratory data analysis on their strengths If you, like me, are the kind of person that takes pop music seriously, chances are you have already been involved in heated discussions about pop divas. More often than not, arguments regarding this matter end up … Read more

How to Verify the Distribution of Data using Q-Q Plots?

Given a random distribution, that needs to be verified if it is a normal/gaussian distribution or not. For understanding, we will name this unknown distribution X, and known normal distribution as Y. Generate unknown distribution X: X = np.random.normal(loc=50, scale=25, size=1000) we are generating a normal distribution having 1000 values with mean=50 and standard deviation=25. … Read more

Structure from Motion

Stereo vision, Triangulation, Feature Correspondence, Visual SLAM Structure from Motion (SFM) is to determine the spatial and geometric relationship of the target through the movement of the camera, which is a common method of 3D reconstruction. It only needs an ordinary RGB camera, so the cost is lower, and the environment is less restricted, and … Read more

EC2 Hibernation feature is now available in the Africa (Cape Town) and Europe (Milan) AWS Regions

Hibernation requires an EC2 instance to be an encrypted Amazon EBS-backed instance. This ensures protection of sensitive contents in memory (RAM) as they get copied to EBS upon hibernation. You can now enable Amazon EBS Encryption by Default, to ensure all new EBS volumes created in your account are encrypted. Hibernation is available for On-Demand … Read more

Categories AWS ExcerptFavorite

Preventing lateral movement in Google Compute EnginePreventing lateral movement in Google Compute EngineSoftware Engineer, Google CloudSoftware Engineer, Google Cloud

When you do have to directly expose a VM with an external IP address, ensure that your firewall rules restrict network access to only the ports and IP addresses that your application needs.  Don’t Do Assign private IP addresses to your VMs; don’t give them public IP addresses at all. Use IAP TCP forwarding to … Read more

Twitter analysis of the current political situation in Belarus

Then, using the library SpaCy I separated the most recent tweets (July, 18–26, 2020) related to Lukashenko and his main opponents, Babariko, Tsepkalo and Tikhanovskaya. In order to use sentiment analysis and entity extraction libraries, I had to translate tweets into English. I used Google Translation API. Here is a simple way to do it: … Read more

Will Deep Learning Hit the Wall?

Better algorithms or more computing power? If you are interested in deep learning, then you could already heard about recent paper published by researchers from USA, Korean and Brazilian universities and labs. Neil C. Thompson, MIT Computer Science and A.I. Lab, Kristjan Greenewald, MIT Initiative on the Digital Economy, Keeheon Lee, Underwood International College, Yonsei … Read more

Implementing SGD From Scratch

Custom Implementation of Stochastic Gradient Descent without SKlearn Before implementing Stochastic Gradient Descent let’s talk about what a Gradient Descent is. Gradient Descent Algorithm is an iterative algorithm used to solve the optimization problem. In almost every Machine Learning and Deep Learning models Gradient Descent is actively used to improve the learning of our algorithm. … Read more

My journey as a Data Science Blogger

Future plans My current goals are to keep publishing good quality articles once a week on Medium and my blog. I do not have a number of views or followers that I would like to reach. I would rather focus on writing good quality content. This is what I have been doing so far and … Read more

Beginner’s Guide to PyThaiNLP

Utility Add the following import declaration at the top of your Python file. I will be using Jupyter Notebook for this tutorial. import pythainlp.util PyThaiNLP provides us with quite a lot of built-in functions. For example, you can use the following to determine if an input text is Thai. pythainlp.util.isthai(“สวัสดี”)#True Furthermore, you can even get … Read more