How to Use Different Data Models and Visual Representation of Databases

Beginner course to Databases and SQL As you get into the Databases and Data Science, the first thing that you have to master is the relations between entities in your database. That is important because the data that you use has to be absolutely efficient for its further implementations. Photo by JESHOOTS.COM on Unsplash Let’s … Read more How to Use Different Data Models and Visual Representation of Databases

Top 5 Natural Language Processing Python Libraries for Data Scientist.

A Complete Overview Of popular python libraries for Natural Language Processing in a Non-verbose Manner. Around more than 70 percent of the data available on the internet is not in a structured format. since data is very essential organ for the data science, researchers are really worked hard to push out our limits from structured … Read more Top 5 Natural Language Processing Python Libraries for Data Scientist.

Generating Titles for Kaggle Kernels with LSTM

Small Deep Learning Project with PyTorch When I first found out about sequence models, I was amazed by how easily we can apply them to a wide range of problems: text classification, text generation, music generation, machine translation, and others. In this article, I would like to focus on the step-by-step process of creating a … Read more Generating Titles for Kaggle Kernels with LSTM

Understanding Input and Output shapes in Convolution Neural Network | Keras

Let’s look at the following code snippet. Snippet-1 Don’t get tricked by input_shape argument here. Thought it looks like out input shape is 3D, but you have to pass a 4D array at the time of fitting the data which should be like (batch_size, 10, 10, 3). Since there is no batch size value in … Read more Understanding Input and Output shapes in Convolution Neural Network | Keras

Basics of AI Product Management: Orchestrating the ML Workflow

The theory of ML is hard, the application is even harder! I’ve spent the last few years applying data science in different aspects of business. Some use cases are internal machine learning (ML) tools, analytics reports, data pipelines, prediction APIs, and more recently, end-to-end ML products. I’ve had my fair share of successful and unsuccessful … Read more Basics of AI Product Management: Orchestrating the ML Workflow

Using Spark from R for performance with arbitrary code – Part 1 – Spark SQL translation, custom functions, and Arrow

Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining … Read more Using Spark from R for performance with arbitrary code – Part 1 – Spark SQL translation, custom functions, and Arrow

Midnight Hack Episode 1: Visualizing my Swiggy Order History

It’s 11 pm and as I’m on my way to order multiple midnight meals, I thought to myself : I don’t really know my ordering habits. is there a trend? Can I derive some meaningful insights from my ordering history? As I browse through the dessert options on Swiggy (An Indian Food Order & Delivery … Read more Midnight Hack Episode 1: Visualizing my Swiggy Order History

Easy Bar Charts from Simple to Sophisticated

Tell your story with data visualizations Imagine the simplest code possible to generate data visualizations. Let’s start with a visualization we have all seen, and all need, the bar chart. Beach bar chart, horizontal orientation. Each bar in a bar chart represents a category, or level, of a variable with relatively few unique values, such … Read more Easy Bar Charts from Simple to Sophisticated

Importance Sampling Introduction

Estimate Expectations from a Different Distribution Importance sampling is an approximation method instead of sampling method. It derives from a little mathematic transformation and is able to formulate the problem in another way. In this post, we are going to: Learn the idea of importance sampling Get deeper understanding by implementing the process Compare results … Read more Importance Sampling Introduction

How To Deploy A Neural Network From Beirut

Beirut is Lebanon’s gorgeous capital and comes with the typical problems of a bustling city. On top of that it suffers from frequent power cuts and one of the slowest internet connections in the world. It is also where I spent my summer vacation and an ideal testing ground for the purpose of this article: … Read more How To Deploy A Neural Network From Beirut

Implementing Prophet Time Series Forecasting Model

A step-by-step approach to predict the Bitcoin price for the dummies Photo by Aleksi Räisä on Unsplash Understanding time series data is very critical to any kinds of business. If you are working with numbers and analytics, more often than not, you will need to solve questions like how many customers will continue buying in … Read more Implementing Prophet Time Series Forecasting Model

‘There is a game I play’ – Analyzing Metacritic scores for video games

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. There is a game I play / try to make myself okay / … Read more ‘There is a game I play’ – Analyzing Metacritic scores for video games

Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)

[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Similar to the previous posts, the Cleveland heart dataset will be … Read more Explaining Predictions: Random Forest Post-hoc Analysis (randomForestExplainer package)

Great Developers Never Stop Learning

Proof of Concepts (POC) As an architect I need to justify technical project decisions so I resort to developing POCs. They help me experience the challenges or benefits of the technology in question in order to provide forward looking research, as well as I get better at estimating (and not trivialise how long ‘easy’ tasks … Read more Great Developers Never Stop Learning

Reflections on Applying Data Science for Financial Freedom

At the time of writing this article, it’s a casual Friday due to the long weekend, so I thought of doing a nice little reflection article on my thoughts of applying data science for financial freedom. Sorry to my normal peeps that wanted some coding samples. Today’s article is going to feel like a relaxing … Read more Reflections on Applying Data Science for Financial Freedom

Why the current AI gold rush must not fail

How our investment in the field has made it too important to fail Everyone is talking about the impeding dangers of artificial intelligence. From machines taking over our jobs, to Stephen Hawkins’ fear of the existential threat they pose to mankind, there’s a lot of people talking about what will happen if the current race … Read more Why the current AI gold rush must not fail

4 Lessons Learned From a Prospective Data Strategist

The revolution of data science has always been predicated on the power of data as a strategic decision-maker. Interestingly enough, no matter how long I stare at my SQL table, it can’t hear it telling me what to do. This attacks the fundamental misconception that data itself is some kind of Oracle able to tell … Read more 4 Lessons Learned From a Prospective Data Strategist

Cryptocurrency Analysis with Python — Buy and Hold

In this part, I am going to analyze which coin (Bitcoin, Ethereum or Litecoin) was the most profitable in the last two months using buy and hold strategy. We’ll go through the analysis of these 3 cryptocurrencies and try to give an objective answer. To run this code download the Jupyter notebook. Bitcoin, Ethereum, and … Read more Cryptocurrency Analysis with Python — Buy and Hold

Enhancing Static Plots with Animations

Using gganimate to spice up ggplot2 visualisations This post aims to introduce you to animating ggplot2 visualisations in r using the gganimate package by Thomas Lin Pedersen. The post will visualise the theoretical winnings I would’ve had, had I followed the simple model to predict (or tip as it’s known in Australia) winners in the … Read more Enhancing Static Plots with Animations

Lesser known dplyr functions

The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself): filter() to select cases based on their values. arrange() to reorder the cases. select() and rename() to select variables based on their names. mutate() and transmute() to add new variables that … Read more Lesser known dplyr functions

Amazon SageMaker Notebooks now export Jupyter logs to Amazon Cloudwatch

With this launch, you no longer need to log into your notebook terminal to access logs and can instead view and analyze the logs directly from CloudWatch. You can use the built-in functionality of CloudWatch to detect anomalies and also set alarms to be automatically notified based on specific conditions. Also, you have the benefit … Read more Amazon SageMaker Notebooks now export Jupyter logs to Amazon Cloudwatch

Problems in Machine Learning Models? Check your Data First

A brief analysis of cleaning data before it is fed into Machine Learning Algorithms Photo by Eleonora Patricola on Unsplash Data Preprocessing is often said by multitudes of people to be the most important part of the Machine Learning Algorithm. It is often said that Machine Learning algorithms will get burst wide open if you … Read more Problems in Machine Learning Models? Check your Data First

Seeking postdoc (or contractor) for next generation Stan language research and development

The Stan group at Columbia is looking to hire a postdoc* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation. The language … Read more Seeking postdoc (or contractor) for next generation Stan language research and development

Container monitoring for Amazon ECS, EKS, and Kubernetes is now available in Amazon CloudWatch

CloudWatch Container Insights helps you troubleshoot infrastructure and performance issues in your containers environment to increase development velocity.   It’s easy to get started. Start collecting detailed performance metrics, logs, and meta-data from your containers and clusters in just a few clicks by following these steps in the CloudWatch Container Insights documentation. Favorite

Cracking an 82-year-old stock trading board game using Monte Carlo simulation

Board games are fun. Stock trading is fun. Putting them together, we get the 1937 classic from Copp-Clark Publishing, Stock Ticker. The core gameplay is simple: buy and sell stocks from the broker in a fluctuating market and try to finish with more total assets than everyone around the table. If you’ve never heard of … Read more Cracking an 82-year-old stock trading board game using Monte Carlo simulation

Jacobian regularization

Generalization of L1 and L2 regularization L1 and L2 regularization, also known as Lasso and Ridge, are well known regularization techniques, used for a variety of algorithms. The idea of these methods is to impose smoothness of the prediction function and avoid overfitting. Consider this example of Polynomial Regression: In this example we fit polynomials … Read more Jacobian regularization

Why R?

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I was working with our copy editor on Appendix A … Read more Why R?

Installing Apache PySpark on Windows 10

Apache Spark Installation Instructions for Product Recommender Data Science Project Over the last few months, I was working on a Data Science project which handles a huge dataset and it became necessary to use the distributed environment provided by Apache PySpark. I struggled a lot while installing PySpark on Windows 10. So I decided to … Read more Installing Apache PySpark on Windows 10

AWS DataSync is now available in the Middle East (Bahrain) Region

DataSync is an online data transfer service that provides you a simple way to automate and accelerate copying data over the Internet or AWS Direct Connect between Network File System (NFS) or Server Message Block (SMB) file servers, Amazon Simple Storage Service (Amazon S3) buckets, and Amazon Elastic File System (Amazon EFS) file systems. You … Read more AWS DataSync is now available in the Middle East (Bahrain) Region

Amazon EKS Available in Bahrain Region

Amazon EKS is a highly-available, scalable, and secure Kubernetes service. Amazon EKS runs the Kubernetes management infrastructure (control plane) for you and is certified Kubernetes conformant so you can use existing tooling and plugins from the Kubernetes community and AWS partners.  Favorite

Flower Species Classifier

Build an image classifier to recognize 102 different species of flowers • Artificial Intelligence • Deep Learning • Convolutional Neural Networks• Python • PyTorch • Numpy • Matplotlib • Jupyter Notebooks In this article, I give an overview of the project I developed that led me to be awarded a scholarship to the Deep Learning … Read more Flower Species Classifier

It is Time for CRAN to Ban Package Ads

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. NPM (a popular Javascript package repository) just banned package advertisements. … Read more It is Time for CRAN to Ban Package Ads

NLP Text Preprocessing: A Practical Guide and Template

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better. To illustrate the importance of text preprocessing, let’s consider a task on sentiment analysis for customer reviews. Suppose a customer feedbacked that “their customer support service … Read more NLP Text Preprocessing: A Practical Guide and Template

Break up with Excel: Intro and Advanced R Data Science Courses at MSACL.org Salzburg Austria, September 21–23, 2019

[This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. MSACL Conference There are two RStats Data Science courses happening in Salzburg … Read more Break up with Excel: Intro and Advanced R Data Science Courses at MSACL.org Salzburg Austria, September 21–23, 2019

New release of Cloud Storage Connector for Hadoop: Improving performance, throughput and moreNew release of Cloud Storage Connector for Hadoop: Improving performance, throughput and moreSoftware EngineerCloud Data Engineer

Cloud Storage Connector is an open source Apache 2.0 implementation of an HCFS interface for Cloud Storage. Architecturally, it is composed of four major components: In the following sections, we highlight a few of the major features in this new release of Cloud Storage Connector. For a full list of settings and how to use … Read more New release of Cloud Storage Connector for Hadoop: Improving performance, throughput and moreNew release of Cloud Storage Connector for Hadoop: Improving performance, throughput and moreSoftware EngineerCloud Data Engineer

How to quickly solve machine learning forecasting problems using Pandas and BigQueryHow to quickly solve machine learning forecasting problems using Pandas and BigQueryML Solutions EngineerMachine Learning Solutions Engineer

We pass the table name that contains our data, the value name that we are interested in, the window size (which is the input sequence length), the horizon of how far ahead in time we skip between our features and our labels, and the labels_size (which is the output sequence length). Labels size is equal … Read more How to quickly solve machine learning forecasting problems using Pandas and BigQueryHow to quickly solve machine learning forecasting problems using Pandas and BigQueryML Solutions EngineerMachine Learning Solutions Engineer

Expanding your patent set with ML and BigQueryExpanding your patent set with ML and BigQueryData Scientist, Global PatentsHead of Data Science, Global Patents at Google

2. Organize the seed setWith the input set determined and the embedding representations retrieved, you have a few options for determining similarity to the seed set of patents. Let’s go through each of the options in more detail. 1. Calculating an overall embedding point—centroid, medoid, etc.— for the entire input set and performing similarity to … Read more Expanding your patent set with ML and BigQueryExpanding your patent set with ML and BigQueryData Scientist, Global PatentsHead of Data Science, Global Patents at Google

Kubernetes security audit: What GKE and Anthos users need to knowKubernetes security audit: What GKE and Anthos users need to knowProduct Manager, Container security

Performing this security audit was a big effort on behalf of the CNCF, which has a mandate to improve the security of its projects via its Best Practices Badge Program. To take Kubernetes through this first security audit, the Kubernetes Steering Committee formed a working group, developed an RFP, worked with vendors, reviewed and then … Read more Kubernetes security audit: What GKE and Anthos users need to knowKubernetes security audit: What GKE and Anthos users need to knowProduct Manager, Container security

From scratch to search: setup Elasticsearch under 4 minutes, load a CSV with Python and read…

{“_index” : “test-csv”,”_type” : “_doc”,”_id” : “1”,”_version” : 1,”result” : “created”,”_shards” : {“total” : 3,”successful” : 3,”failed” : 0},”_seq_no” : 0,”_primary_term” : 1} Document indexing… checked! We have indexed our first document to our test-csv index, all shards responded correctly. We have indexed a very simple json document with only one field, but you can … Read more From scratch to search: setup Elasticsearch under 4 minutes, load a CSV with Python and read…

SVD: Where Model Tuning Goes Wrong

1 — Dataset Prerequisites from surprise import Datasetdata = Dataset.load_builtin(‘ml-100k’) Surprise is a scikit package for building and analysing recommender systems maintained by Nicolas Hug. Reading its documentation page, an objective of the package is to “alleviate the pain of dataset handling”. One way it does so is through built-in datasets. Movie-Lens 100k is one … Read more SVD: Where Model Tuning Goes Wrong

A to Z of SVM — Machine Learning For Everyone

We aim to provide simplest and effective understanding of machine learning algorithms with simple theory, simple math and simple code We are going to cover the following topics: 1- Overview 2- Introduction 3- How does SVM Work 4- Support Vectors & Margin 5- Linear & Non-Linear SVM 6- Hard Margin & Soft Margin in SVM … Read more A to Z of SVM — Machine Learning For Everyone