The ‘Ingredients’ of Machine Learning Algorithms

The components that most machine learning algorithms have in common. Photo by Dan Gold on Unsplash What’s a cost function, optimization, a model, or an algorithm? The esoteric nuances of machine learning algorithms and terminology can easily overwhelm the machine learning novice. As I was reading the Deep Learning book by Yoshua Bengio, Aaron Courville, … Read moreThe ‘Ingredients’ of Machine Learning Algorithms

Paper review: DenseNet -Densely Connected Convolutional Networks

CVPR 2017, Best Paper Award winner Dense connections “Simple models and a lot of data trump more elaborate models based on less data. “ — Peter Norvig ‘Densely Connected Convolutional Networks’ received the Best Paper Award at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. The paper can be read here. The … Read morePaper review: DenseNet -Densely Connected Convolutional Networks

How to visualize data on top of a map in python using the geoviews library

For the purposes of this tutorial, we are going to make a plot to visualize the passengers volume for the busiest airports in my country, Greece, and the neighbor country, Turkey, for comparison reasons. First, we need to import the libraries and the methods we are about to use. import pandas as pdimport numpy as … Read moreHow to visualize data on top of a map in python using the geoviews library

A brief intro to the Central Limit Theorem

According to wikipedia. In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. Translation: If you take enough samples from a population, the … Read moreA brief intro to the Central Limit Theorem

Using TF-IDF to form descriptive chapter summaries via keyword extraction.

Source: https://pixabay.com/photos/library-books-education-literature-869061/ TF IDF is a natural language processing technique useful for the extraction of important keywords within a set of documents or chapters. The acronym stands for “term frequency-inverse document frequency” and describes how the algorithm works. The dataset As our dataset, we shall take the script of Mary Shelley’s Frankenstein (provided by Project … Read moreUsing TF-IDF to form descriptive chapter summaries via keyword extraction.

The Easy Way to Extend Pandas API

In this article, you’ll learn how to tailor pandas API to your business, research, or personal workflow using by using pandas_flavour. Pandas-flavor is a library that introduces API for extending Pandas. This API handled the boilerplate code for registering custom accessors onto Pandas objects. There are plenty of examples of extensions in the wild including: … Read moreThe Easy Way to Extend Pandas API

Cleaning Web-Scraped Data with Pandas (Part II)

As I mentioned in my previous post, cleaning data is a prerequisite to machine learning. Measuring the sanity of your data can also give you a good indication of how precise or accurate your model would be. When it comes to web-scraped data, you would often lose a lot of information in the process of … Read moreCleaning Web-Scraped Data with Pandas (Part II)

Utilize Your Self-Imposed Deadlines | Punch Today in The Face

The art of creating self-imposed deadlines is crucial not only to go above and beyond meeting requirements, but also to make our working progress in achieving small and big goals a lot smoother. Love them or hate them; they are incredibly motivational deadlines! However, here are two things we should try to avoid when dealing … Read moreUtilize Your Self-Imposed Deadlines | Punch Today in The Face

Dangerous streets of Bratislava! Animated maps using open data in R

[This article was first published on Peter Laurinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. At the work recently, I wanted to make some interesting start-up pitch … Read moreDangerous streets of Bratislava! Animated maps using open data in R

Steps to basic modern NN model from scratch

After we have defined the matrix multiplication strategy, its time to defined the ReLU function and the forward pass for the Neural Network. I would request the readers to go through the Part — 1 of the series to get the background of the data used below. The Neural Network is defined as below: output … Read moreSteps to basic modern NN model from scratch

future 1.15.0 – Lazy Futures are Now Launched if Queried

[This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. No dogs were harmed while making this release future 1.15.0 is … Read morefuture 1.15.0 – Lazy Futures are Now Launched if Queried

A Quick Short Look Into Bootstrapping

Big Questions: After an A/B testing, to what extent can we trust our small sample can represent the entire population of our customers? If we repeatedly sample the same size, how would our estimates vary? If we obtain different estimators after repeated sampling, can we gauge the distribution of the population? If we don’t know … Read moreA Quick Short Look Into Bootstrapping

Reading in Data

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Here’s a common situation: you have a folder full of similarly-formatted … Read moreReading in Data

Central Limit & Large Numbers

If you’re into math equations, let us now turn to formal representations of the theorems in order to understand their claims and the relationship between the two a bit more precisely. Let be independent and identically distributed random variables with expected value μ and finite variance σ². Then converges towards the Standard Normal Distribution in … Read moreCentral Limit & Large Numbers

How to understand Numpy documentation

When we start to learn Data Science, Machine Learning, Deep Learning or any excited fields that will be using Python as programming language, most probably all of us will be using numpy as well. In this post, I will be writing numpy basics and how to read documentation properly based on my experience of using … Read moreHow to understand Numpy documentation

Web Scrape Twitter by Python Selenium (Part 1)

Begin of tutorial PS: For a new beginner, I would suggest you work in Jupyter Notebook first because you will face more errors than anytime before. By using Jupyter Notebook you can run the script step by step so that you know where the problem is. Access to twitter frontpage The first step is to … Read moreWeb Scrape Twitter by Python Selenium (Part 1)

Predicting Heart Disease Mortality

Building a machine learning model that can identify high-risk states in 2019. According to the Center for Disease Control, “About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.” It is unlikely anyone reading this hasn’t been affected by this disease in some way. I, myself, lost … Read morePredicting Heart Disease Mortality

Reduce Memory Usage and Make Your Python Code Faster Using Generators

A hands on guide to create iterators in a very pythonic manner Photo by Createria on Unsplash When I started learning about python generators, I had no idea how important it would turn out to be. It has helped me immensely while writing custom functions throughout my machine learning journey. Generator functions allow you to … Read moreReduce Memory Usage and Make Your Python Code Faster Using Generators

5 Minute Guide to Detecting Holidays in Python

With Pandas, it’s fairly straightforward to construct a list of dates, let’s say for the whole year of 2019: Great. Now we can construct a DataFrame object from those dates — let’s put them into Dates column: Now here comes a slight problem. The dates look to be stored in a string format, just like … Read more5 Minute Guide to Detecting Holidays in Python

Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

[This article was first published on Jozef’s Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. In the previous parts of this series, we have shown how to … Read moreUsing Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

A Quick Primer on Databricks Koalas

Interact with Spark Dataframes with Pandas vocabulary Photo by Jordan Whitt on Unsplash In a project of mine, I extensively used Spark to manage working with some large data files. Though it is often known for the many benefits for use with large distributed systems, it works equally well locally for projects working with large … Read moreA Quick Primer on Databricks Koalas

How to code effectively without dying in the attempt

1. Find a comfortable working space Most programming and coding jobs are flexible enough that allow to work from home, a common space, a library or even a coffee shop, without having to be at an office 8 hours per day 5 days per week. However, the working environment will always have a highly significant … Read moreHow to code effectively without dying in the attempt

Design of Experiments for Your Change Management

A step-by-step Guide to Design of Experiments Data science professionals, have you ever faced any of the following challenges? Story 1: Machine learning does not mean experimental design You are asked to design an experiment due to your statistical expertise, but realized your machine learning tools do not help you design an experiment. Story 2: … Read moreDesign of Experiments for Your Change Management

Let’s calculate Z-scores for Airbnb prices in New York

Z-score, also called standard score, according to wikipedia. In statistics, the standard score is the signed fractional number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Translation: a measure of how far a value is from its population … Read moreLet’s calculate Z-scores for Airbnb prices in New York

Why Companies Are Using Data Science and Analytics to Inform Benefits Packages

Employee benefits packages can help candidates choose to take job offers or look elsewhere. They can also factor into how long a worker stays at a company and how happy they are while there. If they realize that other companies offer better benefits and they’re frustrated with their job already, they may decide it’s not … Read moreWhy Companies Are Using Data Science and Analytics to Inform Benefits Packages

Integrating Python & Tableau

Bring your analyses to life with engaging data visualizations. When performing in-depth analyses on large and unstructured datasets, the power of Python and relevant machine learning libraries cannot be understated. Matplotlib serves as a great tool to help us visualize results, but it’s stylization options are not always optimal for use in presentations and dashboards. … Read moreIntegrating Python & Tableau

Why your AI might be racist and what to do about it

Individually reasonable correlations can cause an AI to gain a racial bias Even well-designed AI systems can still end up with a bias. This bias can cause the AI to exhibit racism, sexism, or other types of discrimination. Entirely by accident. This is usually considered a political problem, and ignored by scientists. The result is … Read moreWhy your AI might be racist and what to do about it

An Alternative To Batch Normalization

The development of Batch Normalization(BN) as a normalization technique was a turning point in the development of deep learning models, it enabled various networks to train and converge. Despite its great success, BN exhibits drawbacks that are caused by its distinct behavior of normalizing along the batch dimension. One of the major disadvantages of BN … Read moreAn Alternative To Batch Normalization

Managing virtual environment with pyenv

Most Python developers and data scientist have already heard of virtual environments. However, managing tens of environments created for different projects can be daunting. pyenv will help you to streamline the creation, management and activating virtual environments. In the old days, before the virtualenv became popular, I would keep a single global workspace for all … Read moreManaging virtual environment with pyenv

Learning Linux – the wrong way – day 2

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Unborking the borked laptop – Recap I’m trying to learn some Linux. Ostensibly … Read moreLearning Linux – the wrong way – day 2

Intrumental variable regression and machine learning

Intro Just like the question “what’s the difference between machine learning and statistics” has shed a lot of ink (since at least Breiman (2001)), the same question but where statistics is replaced by econometrics has led to a lot of discussion, as well. I like this presentation by Hal Varian from almost 6 years ago. … Read moreIntrumental variable regression and machine learning

Amazon EC2 now supports Microsoft SQL Server 2019

Amazon EC2 now supports Microsoft SQL Server 2019, the latest release of Microsoft SQL Server. When you run SQL Server 2019 on Amazon EC2, you benefit from the scale, performance, and elasticity of the AWS Cloud, while leveraging the latest features available in Microsoft SQL Server 2019 such as enhanced PolyBase and intelligent query processing. … Read moreAmazon EC2 now supports Microsoft SQL Server 2019

Amazon CloudWatch launches cross-account cross-region dashboards

Amazon CloudWatch now includes cross-account cross-region dashboards, which enable you to create high level operational dashboards, and with one click, drill down into more specific dashboards in different AWS accounts without having to log in and out of different accounts or switch AWS Regions. It is intended for centralized operations teams, DevOps engineers, and service … Read moreAmazon CloudWatch launches cross-account cross-region dashboards

Introduction to Spark NLP: Foundations and Basic Components

As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll get in detail during this article series. Spark NLP … Read moreIntroduction to Spark NLP: Foundations and Basic Components

Best Practices for NLP Classification in TensorFlow 2.0

Use Data Pipelines, Transfer Learning and BERT to achieve 85% accuracy in Sentiment Analysis Photo by Jirsak, courtesy of Shutterstock When I first started working with Deep Learning, I went through Coursera and fast.ai courses, but afterwards I wondered where to go from here. I started asking questions like “How do I develop a data … Read moreBest Practices for NLP Classification in TensorFlow 2.0

Using K-Means Clustering Algorithm to Redefine NBA Positions and Explore Roster Construction

Conventional positions within the NBA do not accurately reflect the playing style or functional role a player provides to their team. The overall style of play has changed drastically and various era’s within the NBA indicate that. Similarly a player’s style of play is also reflective of this change. Currently the league is fast paced … Read moreUsing K-Means Clustering Algorithm to Redefine NBA Positions and Explore Roster Construction

The City of the Homeless: Humanitarian Crisis on the Streets of Los Angeles

The dominant narrative around who is living on the street — and why it is so difficult to help them — is that people experiencing homelessness are all drug addicts and/or severely mentally ill. This damaging narrative dehumanizes people experiencing homelessness in a cynical attempt to justify inaction. But it is also factually incorrect: according … Read moreThe City of the Homeless: Humanitarian Crisis on the Streets of Los Angeles

Cloud Risk Assessment through Data- log analysis in AWS

https://aws.amazon.com/getting-started/projects/analyze-big-data/ These are the high-level steps: (Note: An AWS account setup is a pre-requisite. If you try this out, ensure that clusters and buckets are deleted after use to avoid additional charges). Sample data is loaded; in real-life projects the relevant dataset would replace this. Launch a Hadoop cluster using Amazon EMR [Elastic Map Reduce], … Read moreCloud Risk Assessment through Data- log analysis in AWS

Machine Learning and Data Analysis — Inha University (Part-2)

Welcome to the second part of Machine learning and data analysis series based on a graduate course offered by Inha University, Rep. of Korea. In this part, we will discuss Data structures in python. However, if you are viewing this for the first time then we encourage you to follow the first part first where … Read moreMachine Learning and Data Analysis — Inha University (Part-2)

Automatic Speech Recognition as a Microservice on AWS

Let’s quickly get back to our LAB work and implement this highly-complex piece of work in a few easy steps. At this point, you should have your EC2 up and be SSHed into it. Please refer the Github Repository for any missing resources/links In your home directory [/home/ec2-user], maintain the following directory structure D -> … Read moreAutomatic Speech Recognition as a Microservice on AWS

How to Write Python Command-Line Interfaces like a Pro

Photo by Kelly Sikkema on Unsplash We as Data Scientists face doing many repetitive and similar tasks. That includes creating weekly reports, executing extract, transform, load (ETL) jobs, or training models using different parameter sets. Often, we end up having a bunch of Python scripts, where we change parameters in code every time we run … Read moreHow to Write Python Command-Line Interfaces like a Pro

Let’s build an Intelligent chatbot

Modern chatbots do not rely solely on text, and will often show useful cards, images, links, and forms, providing an app-like experience. Depending on way bots are programmed, we can categorize them into two variants of chatbots: Rule-Based (dumb bots) & Self Learning (smart bots). Rule-Based Chatbots: This variety of bots answer questions based on … Read moreLet’s build an Intelligent chatbot