Feature engineering

Feature engineering is the process of transforming raw, unprocessed data into a set of targeted features that best represent your underlying machine learning problem. Engineering thoughtful, optimized data is the vital first step. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition. This … Read more

Analyzing CNET’s Headlines

Exploring the news published on CNET using Python and Pandas Photo by M. B. M. on Unsplash I wrote a crawler to scrape the news headlines from CNET’s sitemap and decided to perform some exploratory analysis on it. In this post, I will walk you through my findings, some anomalies and some interesting insights. You … Read more

3 Machine Learning Books that Helped me Level Up

Source: Pixabay There is a Japanese word, tsundoku (積ん読), which means buying and keeping a growing collection of books, even though you don’t really read them all. I think we Developers and Data Scientists are particularly prone to falling into this trap. Personally, I even hoard bookmarks: my phone’s Chrome browser has so many open … Read more

Exploring the Tokyo Neighborhoods: Data-Science in Real Life

3. Visualization and Data Exploration: 3.1. Folium Library and Leaflet Map: Folium is a python library that can create interactive leaflet map using coordinate data. Since I am interested in restaurants as popular spots first I create a data-frame where the ‘Venue_Category’ column in previous data-frame contains the word ‘Restaurant’. I used the following snippet of … Read more

Analyzing Employee Reviews: Google vs Amazon vs Apple vs Microsoft

Which company is it worth working for? Overview Whether it is for their ability to offer high salaries, extravagant perks, or their exciting mission statements, it is clear that top companies like Google and Microsoft have become talent magnets. To put it into perspective, Google alone receives more than two million job applications each year. Working … Read more

Understand the problem statement to optimize your code

Python Shorts How Understanding the problem statement could help you to optimize your code Photo by Helloquence on Unsplash Whenever we talk about optimizing code we always discuss the computational complexity of the code. Is it O(n) or O(n-squared)? But, sometimes we need to look beyond the algorithm and look at how the algorithm is going to … Read more

Speed Up Your Exploratory Data Analysis With Pandas-Profiling

Get an intuition of your data’s structure with just one line of code Source: https://unsplash.com/photos/gts_Eh4g1lk Introduction When importing a new data set for the very first time, the first thing to do is to get an understanding of the data. This includes steps like determining the range of specific predictors, identifying each predictor’s data type, as … Read more

How to use Python features in your data analytics project

Python tutorial in Azure using OO, NumPy, pandas, SQL, PySpark 1. Introduction A lot of companies are moving to cloud and consider what tooling shall be used for data analytics. On-premises, companies mostly use propriety software for advanced analytics, BI and reporting. However, this tooling may not be the most logical choice in a cloud environment. … Read more

Classifying Products as Banned Or Approved using Text Mining- Part II

In this part, we will explain how to optimize the existing Machine Learning model in Part I and the deployment of this ML model using Flask. Connecting the dots -moving from M to L in Machine Learning In the previous article of this series, We have discussed the business problem, shown how to train the model using … Read more

Importance of Choosing the Correct Hyper-parameters while defining a model

Often considered the trickiest part of optimizing the Machine Learning Algorithm, Correct Hyperparameter tuning can save a lot of time and help deploy the model faster We all Machine Learning aficionados must have participated in hackathons to test our skills in Machine Learning sometime or the other. Well, Some problem statement that we need to solve … Read more

ColumnTransformer Meets Natural Language Processing

Photo credit: Pixabay How to combine several feature extraction mechanisms or transformations into a single transformer in a scikit-learn pipeline Since published several articles on text classification, I have received inquiries on how to deal with mixed input feature types, for example, how to combine numeric, categorical and text features in a classification or regression model. … Read more

Advanced candlesticks for machine learning (i): tick bars

1. — Introduction In a previous article we explored why traditional time-based candlesticks are not the most suitable price data format if we are planning to train a machine learning (ML) algorithm. Namely: (1) time-based candlesticks over-sample low activity periods and under-sample high activity periods, (2) markets are increasingly controlled by trading algorithms that no longer follow … Read more

When to ‘Buy the Dip’

Model and Outputs: Once we have our train and test sets created, we can go ahead and train our model and then fit the model to our test set. To do this, we utilize the GaussianMixture function as part of the sklearn.mixture library. We specify the n_components=3 because we are looking to model 3 discrete … Read more

10 Python Pandas tricks to make data analysis more enjoyable

If one has not yet fallen in love with Pandas, it may be because he/she has not seen enough cool examples Photo by Vashishtha Jogi on Unsplash In my previous article 10 Python Pandas tricks that make your work more efficient, I received quite a few positive feedback from the readers (appreciated!). Knowing that these Pandas tricks … Read more

Query Segmentation and Spelling Correction

In English Language, people generally type the queries which are separated by space, but sometimes and somehow this space is found to be omitted by unintentional mistake. The method of finding the word boundaries is known as Query Segmentation. For example, we can easily decipher nutfreechocolates as nut free chocolates But, the machine can’t unless … Read more

Minimize for loop usage in Python

Python Shorts How to and Why you should minimize for loop usage in your Python code? Photo by Etienne Girardet on Unsplash Python provides us with many styles of coding. In a way, it is pretty inclusive. One can come from any language and start writing Python. However, learning to write a language and writing a language … Read more

Bite-Sized Python Recipes

A collection of small useful functions in Python Photo by Jordane Mathieu on Unsplash Disclaimer: This is a collection of small useful functions I’ve found around the web, mainly on Stack Overflow or Python’s documentation page. Some may look trivial and some may not. But one way or another, I have used them all in my projects … Read more

Deploy your Data Science Model

Model Persistence with Joblib The process for model persistence with Joblib is more-or-less the same, but slightly easier in my opinion. However, you will need to import the sklearn.externals.joblib package. We set the filename in much the same way as before and perform a joblib.dump on the km model and using our just-defined filename. Boom, … Read more

Robotic Process Automation

With Python Photo by Franck V. on Unsplash A recent finding from KPMG’s Global Sourcing Advisory Pulse Survey ‘Robotic Revolution’, suggests that technology experts believe “The opportunities [from RPA] are many — so are the adoption challenges… For most organisations, taking advantage of higher-end RPA opportunities will be easier said than done.” Positioning by the three leading vendors “A … Read more

Top 10 Coding Mistakes Made by Data Scientists

A data scientist is a “person who is better at statistics than any software engineer and better at software engineering than any statistician”. Many data scientists have a statistics background and little experience with software engineering. I’m a senior data scientist ranked top 1% on Stackoverflow for python coding and work with a lot of … Read more

Connecting POIs to a road network

Scalable interpolation based on the nearest edge This article discusses the process of handling the integration of point geometries and a geospatial network. You may find a demonstration Jupyter Notebook and the script of the function here. Introduction: working with geospatial network When we work with network analysis, our objects of analysis (represented as nodes or … Read more

Why Choose Data Science for Your Career

Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the ‘sexiest job of the 21st century’, Data Science is a buzzword with very few people knowing about the technology in its true sense. While many people wish to become Data Scientists, it is essential to weigh the pros and … Read more

Autoencoders: Deep Learning with TensorFlow’s Eager API | Data Stuff

We are so deep. Source: Pixabay. Deep Learning has revolutionized the Machine Learning scene in the last years. Can we apply it to image compression? How well can a Deep Learning algorithm reconstruct pictures of kittens? What’s an autoencoder? Today we’ll find the answers to all of those questions. Image Compression: all about the patterns I’ve talked … Read more

Linear programming and discrete optimization with Python using PuLP

Linear and integer programming are key techniques for discrete optimization problems and they pop up pretty much everywhere in modern business and technology sectors. We will discuss how to tackle such problems using Python library PuLP and get a fast and robust solution. Introduction Discrete optimization is a branch of optimization methodology which deals with … Read more

Building a Collaborative Filtering Recommender System with ClickStream Data

Photo credit: Paxabay How to implement a recommendation algorithm based on prior implicit feedback. Recommender systems are everywhere, helping you find everything from books to romantic dates, hotels to restaurants. There are all kinds of recommender systems for all sorts of situations, depends on your needs and available data. Explicit vs Implicit Let’s face it, explicit … Read more

Making the Mueller Report Searchable with OCR and Elasticsearch

April 18th marked the full release of the Mueller Report — a document outlining the investigation of potential Russian interference in the 2016 presidential election. Like most government documents it is long (448 pages), and would be painfully tedious to read. Source To make matters worse, the actual PDF download is basically just an image. You cannot … Read more

3 Awesome Visualization Techniques for every dataset

Categorical Correlation with Graphs: In Simple terms, Correlation is a measure of how two variables move together. For example, In the real world, Income and Spend are positively correlated. If one increases the other also increases. Academic Performance and Video Games Usage is negatively correlated. Increase in one predicts a decrease in another. So if our … Read more

Give Or Take a Billion Years

In the following we will use data from Leda, the database of physics of galaxies, to verify Hubbles law: In other words, there is a linear relationship between the distance of a galaxy and it’s speed relating to us — or that it follows the structure of a normal linear equation: With this knowledge we can calculate … Read more

Visualizing stock trading agents using Matplotlib and Gym

We are going to be extending the code we wrote in the last tutorial to render an insightful visualization of the environment using Matplotlib. If you haven’t read my first article on Creating custom Gym environments from scratch, you should stop here and read that first. If you are unfamiliar with the matplotlib library, don’t … Read more

Calculating the Semantic Brand Score with Python

Data Collection and Text Pre-processing The calculation of the Semantic Brand Score requires combining methods and tools of text mining and social network analysis. Figure 1 illustrates the main preliminary steps, which comprise data collection, text pre-processing and construction of word co-occurrence networks. Figure 1 — From Texts to Networks For this introductory tutorial, we can assume that … Read more

Explaining probability plots

Source In this article I would like to explain the concept of probability plots — what they are, how to implement them in Python and how to interpret the results. 1. Introduction You might have already encountered one type of probability plots —Q-Q plots — while working with linear regression. One of the assumptions of the regression we should check … Read more

Bikes of New York

The Data Now that I’ve shared some of my pain with you, I should also share some of my results. As previously mentioned, from January 2018 to February 2019 there were 19,459,370 trips registered. After some cleaning, slicing, and wrangling, my final working data set was reduced to 17,437,855 trips. These are trips by subscribers … Read more

Text can be beautiful

Identifying company commitments with SpaCy In order to understand what companies are actively doing and committing to doing, we will need to create an intelligent way of identifying such commitments in each Modern Slavery return. A typical return will include a lot of non-relevant information such as background on the Company and the Modern Slavery Act. … Read more

Mathematical programming — a key habit to built up for advancing in data science

Introduction The essence of mathematical programming is that you build a habit of coding up mathematical concepts, especially the ones involving a series of computational tasks in a systematic manner. This kind of programming habit is extremely useful for a career in analytics and data science, where one is expected to encounter and make sense … Read more

Two essential Pandas add-ons

These two must-have UIs will help you level-up your Pandas skills The Python Data Analysis Library (Pandas) is the de facto analysis tool for Python. It still amazes me that such a powerful analysis library can be open-source and free to use. But it is not perfect… Yes Pandas does have some shortcomings There are a … Read more

End-To-End Topic Modeling in Python: Latent Dirichlet Allocation (LDA)

LDA Implementation The complete code is available as a Jupyter Notebook on GitHub Loading data Data cleaning Exploratory analysis Preparing data for LDA analysis LDA model training Analyzing LDA model results Loading data The logo of NIPS (Neural Information Processing Systems) For this tutorial, we’ll use dataset of papers published in NIPS conference. The NIPS … Read more

Predicting the performance of deep learning models

Power-law scaling explains how a model’s performance will change as we feed it more data It’s widely acknowledged that the recent successes of Deep Learning rest heavily upon the availability of huge amounts of data. Vision was the first domain in which the promise of DL was realised, probably because of the availability of large datasets … Read more

Bitcoin Predictive Price Modeling with Facebook’s Prophet

Two Bitcoin price predictions (blue and red lines) generated using Facebook’s Prophet package. The actual price data is in green, while the shaded areas denote the respective uncertainty in the estimate. As you can the uncertainty increases into the future. This is particularly the case with the tighter fitting price model (red). This is a quick … Read more

Using Gitlab’s CI for Periodic Data Mining

Serverless periodic mining of a news portal RSS feed with minimal code and effort One of the most time-consuming and difficult stages in a standard data science development pipeline is creating a dataset. In the case where you have already been provided with a dataset kudos for you! You have just saved yourself a good amount … Read more

Animations with Matplotlib

Animations Matplotlib’s animation base class deals with the animation part. It provides a framework around which the animation functionality is built. There are two main interfaces to achieve that using: FuncAnimation makes an animation by repeatedly calling a function func. ArtistAnimation: Animation using a fixed set of Artist objects. However, out of the two, FuncAnimation … Read more

A couple tricks for using spaCy at scale

The Python package spaCy is a great tool for natural language processing. Here are a couple things I’ve done to use it on large datasets. Me processing text on a Spark cluster (artist’s rendition). When a project I’m working on requires natural language processing, I tend to turn to SpaCy first. Python has several other … Read more

XGBoost: Predicting Life Expectancy with Supervised Learning

A beautiful forest. So random! Source: Pixabay Today we’ll use XGBoost Boosted Trees for regression over an official HDI dataset. Who said Supervised Learning was all about classification? XGBoost: What is it? XGBoost is a Python framework that allows us to train Boosted Trees exploiting multicore parallelism. It is also available in R, though we won’t be … Read more

Vaex: A DataFrame with super-strings

Vaex’ strings are super fast, not related to M-theory yet String manipulations are an essential part of Data Science. The latest release of Vaex adds incredibly fast and memory efficient support for all common string manipulations. Compared to Pandas, the most popular DataFrame library in the Python ecosystem, string operations are up to ~30–100x faster on … Read more

How to Automate Tasks on GitHub With Machine Learning for Fun and Profit

Our friends and colleagues who are data scientists would describe the ideal predictive modeling project as a situation where: There is an abundance of data, which is already labeled or where labels can be inferred. The data can be used to solve real problems. The problem relates to a domain you are passionate about or … Read more