Six Recommendations for Aspiring Data Scientists

Source: Building experience before landing a job Data science is a field with a huge demand, in part because it seems to require experience as a data scientist to be hired as a data scientist. But many of the best data scientists I’ve worked with have diverse backgrounds ranging from humanities to neuroscience, and it … Read more

Taking Google Sheets to (a) Class.

I am currently building a Flask app for teachers. Since Google Drive has been adopted by teachers, Google sheets are used by them also. One of my app’s features is to easily allow teachers to copy and paste the sheet link into the app and submit it through a form. It will then convert it … Read more

How to setup the PySpark environment for development, with good software engineering practices

In this article we will discuss about how to set up our development environment in order to create good quality python code and how to automate some of the tedious tasks to speed up deployments. We will go over the following steps: setup our dependencies in a isolated virtual environment with pipenv how to setup … Read more

Let’s build an Article Recommender using LDA

Due to keen interest in learning new topics, I decided to work on a project where a Latent Dirichlet Allocation (LDA) model can recommend Wikipedia articles based on a search phrase. This article explains my approach towards building the project in Python. Check out the project on GitHub below. Structure Photo by Ricardo Cruz on Unsplash … Read more

Data Science with no Math

Using AI to Build Mathematical Datasets This is an addendum to my last article, in which I had to add a caveat at the end that I was not a mathematician, and I was new at Python. I added this because I struggled to come up with a mathematical formula to generate patient data that … Read more

Deep Learning — it`s not only about kitties in mobiles, or how we proceeded in locomotive bogies…

Few days ago Aurorai company sent system of defects and bogie status control recognition of Ermak locomotive for operational tests. This problem is uncommon and very interesting, first stage included evaluation of brake pad and bandage width condition. We managed to solve this task with accuracy up to 1 mm at locomotive speed not exceeding … Read more

Computer Vision for Beginners: Part 1

Computer Vision is one of the hottest topics in artificial intelligence. It is making tremendous advances in self-driving cars, robotics as well as in various photo correction apps. Steady progress in object detection is being made every day. GANs is also a thing researchers are putting their eyes on these days. Vision is showing us … Read more

Using Wrappers to Log in Python

Logging in Python can be tedious, especially when you use it to debug. I am not a fan of Conda or Pycharm myself (or any other fancy IDE), but for those of you that are you will always have the problem of debugging/controlling when you have to put your code in production. A lot of … Read more

How to Build a Reporting Dashboard using Dash and Plotly

8. Building the First Data Table Figure 5: First Data Table with a Condensed View The first data table in the dashboard presents metrics such as spend (the cost associated with a given advertising product), website sessions, bookings (transactions), and revenue. These metrics are aggregated depending upon the dates selected in the date selected and typically … Read more

Real world implementation of Logistic Regression

Binary Logistic Regression model building in Scikit learn Binary Classification A Binary logistic regression (often referred to simply as logistic regression), predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical. In this example, a … Read more

Data Science Code Refactoring Example

When learning to code for data science we don’t usually consider the idea of modifying our code to reap a particular benefit in terms of performance. We code to modify our data, produce a visualization, and to construct our ML models. But if your code is going to be used for a dashboard or app, … Read more

Predict College Basketball Scores in 30 Lines of Python

Create a machine learning algorithm to predict college basketball scores in less than 30 lines of Python Don’t worry, we’ve all been beaten by a Very Good Boy at least once. Photo by Jenny Marvin on Unsplash Finish last in your office’s March Madness pool again? Did a Golden Retriever or your neighbor’s daughter’s pet rock choose … Read more

Enlightened DataLab Notebooks

With Private Bucket, IAM Permissions, and Safe Firewall Configs Photo by Tim Gouw on Unsplash The road to expertise in Cloud Computing is fraught with harrowingly extended afternoons, and countless under completed blog posts. When you want to quickly spin up a virtual machine, and start working in Python from a notebook in the browser, you’re … Read more

Word Distance between Word Embeddings with Weight

In previous story, I introduced Word Mover’s Distance (WMD) which measure the distance between word embeddings. You may notice that there is no weighting mechanism between words. How does weighting help on NLP tasks? Therefore, Huang et al. proposed an improvement and named Supervised Word Mover’s Distance (S-WMD). Introduce to Supervised Word Mover’s Distance (S-WMD) Before … Read more

Attention-based Neural Machine Translation

Attention mechanisms are being increasingly used to improve the performance of Neural Machine Translation (NMT) by selectively focusing on sub-parts of the sentence during translation. In this post, we will cover 2 simple types of attention mechanism: A global approach (which attends to all source words) and A local approach (which only looks at a … Read more

Hypothesis testing in the Northwind dataset using ANOVA

Locating the most profitable customers Project aim As part of a project on the Northwind database, I needed to come up with some questions to ask of the data in order to derive valuable business insights for the company. The Northwind database is a sample database from Microsoft for a fictitious company called Northwind Traders, … Read more

Automatically Storing Data from Analyzed Data Sets

How to Store Data Analysis Results to Facilitate Later Regression Analysis Figure 1: Example Folder Hierarchy This is the fifth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how … Read more

Gaussian Mixture Modelling (GMM)

GMM estimation Figure 3 below illustrates what GMM is doing. It clearly shows three clusters modelled by three different Gaussian distributions. I have used a toy data set here just to illustrate this clearly as it is less clear with the Enron data set. As you can see, compared to Figure 2 modelled using spherical … Read more

SQL and Pandas

Where and how should these tools be used? As I mentioned in my previous post, my technical experience has almost exclusively been in SQL. While SQL is awesome and can do some really cool things, it has its limitations — these limitations are in large part why I decided to acquire Data Science superpowers at Lambda School. In … Read more

Artificial Neural Networks Optimization using Genetic Algorithm with Python

Main Project File Implementation The third file is the main file because it connects all functions. It reads the features and the class labels files, filters features based on the standard deviation, creates the ANN architecture, generates the initial solutions, loops through a number of generations by calculating the fitness values for all solutions, selecting … Read more

Checking Automated Data Analysis for Errors

How to Check for Errors, both Manually and Automatically, when Automating Data Analysis This is the fourth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how to structure data sets … Read more

10 Steps to Set Up Your Python Project for Success

In this guide we’ll walk through adding tests and integrations to speed development and improve code quality and consistency. If don’t have a basic working Python package, check out my guide to building one and then meet right back here. Cool. Here’s our ten-step plan for this article: Install Black Create .pycache Install pytest Create Tests … Read more

How to Perform Explainable Machine Learning Classification — Without Any Trees

Credit: Pixabay Strict and clear rules… appear to us as something in the background — hidden in the medium of the understanding. – Ludwig Wittgenstein Decision trees are a popular technique for classification. They’re intuitive, easy to interpret, and often perform well out-of-the-box. Tree models are paths of rules that humans can understand. In certain contexts, being able … Read more

Master Python through building real-world applications (Part 9)

Endnotes As we all know, we learn from visualizations far better than we learn from raw data. Building visualizations from data are really rewarding and with help of external libraries like Bokeh, Python’s visualization game is stronger than ever. In this post, you learned about stock market data, how to download it, what are candlestick … Read more

A “full-stack” data science project

2. Data exploration The notebook exploring the data is available on GitHub here. Regardless of the data analysis you’re performing, or how well you think you know your data, it is always a good idea to take a look at it and be aware of the various characteristics before starting to work on a specific … Read more

Machine Learning for Beginners: An Introduction to Neural Networks

A simple explanation of how they work and how to implement one from scratch in Python. Here’s something that might surprise you: neural networks aren’t that complicated! The term “neural network” gets used as a buzzword a lot, but in reality they’re often much simpler than people imagine. This post is intended for complete beginners and … Read more

Replacing Excel with Python

Importing Excel Files into a Pandas DataFrame Initial step is to import excel files into DataFrame so we can perform all our tasks on it. I will be demonstrating the read_excel method of Pandas which supports xls and xlsx file extensions. read_csv is same as using read_excel, we wont go in depth but I will share … Read more

Data Science With No Data

Building an AI/ML model with no access to a dataset In this article, we will demonstrate how to generate a dataset to build a machine learning model. According to this, Medicare fraud and abuse cost taxpayers $60 billion per year. AI/ML could significantly help identify and prevent fraud and abuse, but since privacy is of utmost … Read more

Use Google and Tweepy to Build a Dataset of Twitter Users

With ever-increasing value being placed on the effectiveness of social media in marketing, mining data from social platforms is a critical piece of the ad-tech puzzle. Free developer API access to social data is becoming more and more restrictive, and so easily accessing the right data can be a challenge. Twitter is an exception to … Read more

Building a Flask API to Automatically Extract Named Entities Using SpaCy

How to use the Named Entity Recognition module in spaCy to identify people, organizations, or locations in text, then deploy a Python API with Flask The overwhelming amount of unstructured text data available today provides a rich source of information if the data can be structured. Named-entity Recognition (NER)(also known as Named-entity Extraction) is one of … Read more

Extracting faces using OpenCV Face Detection Neural Network

Recently, I came across the website which has some of the greatest tutorials on OpenCV. While reading through its numerous articles, I found that OpenCV has its own Face Detection Neural Network with really high accuracy. So I decided to work on a project using this Neural Network from OpenCV and extract faces from … Read more

Real-time face liveness detection with Python, Keras and OpenCV

Most facial recognition algorithms you find on the internet and research papers suffer from photo attacks. These methods work really well at detecting and recognizing faces on images, videos and video streams from webcam. However they can’t distinguish between real life faces and faces on a photo. This inability to recognize faces is due to … Read more

CASM = Fractals

Using a simple equation, we can see exactly how the iteration occurs. We first substitute a value for x. Solve the equation for y. Then take the value of y and make it our new x. The best way to illustrate this is to actually use real values. Iteration Our first value was 1 for … Read more

Random thoughts on my first ML deployment

5 things I didn’t know six months ago and that’s better not to forget in the months to come A little bit of context: I’m currently working for a fast growing yet still medium-sized company that after having built a robust and widely used product has decided to start leveraging the data generated during the years … Read more

Building Blocks: Text Pre-Processing

Morphological Normalization Morphology, in general, is the study of the way words are built up from smaller meaning-bearing units, morphomes. For example, dogs consists of two morphemes: dog and s Two commonly used techniques for text normalization are: Stemming: The procedure aims to identify the stem of a word and use it in lieu of … Read more

Finding Lane Lines — Simple Pipeline For Lane Detection.

Identifying lanes of the road is very common task that human driver performs. This is important to keep the vehicle in the constraints of the lane. This is also very critical task for an autonomous vehicle to perform. And very simple Lane Detection pipeline is possible with simple Computer Vision techniques. This article will describe … Read more

Set Your Jupyter Notebook up Right with this Extension

Solution: The Setup Jupyter Notebook Extension Rather than just complaining about the problem (it’s easy to be a critic but a lot harder to do something positive) I decided to see what could be done with Jupyter Notebook extensions. The result is an extension that on opening a new notebook automatically: Creates a template to … Read more

Climate Heatmaps Made Easy

Investigating Paleoclimate Data with Pandas and Seaborn Some time ago Dr. Ed Hawkins, who happens to be the creator of the Climate Spirals, released to the world the Warming Stripes graph for Annual Global Temperature ranging from 1850–2017. The concept is simple but also very informative: each stripe represents the temperature for a single year and … Read more

The Python Dreamteam

As a Data Scientist, I code almost entirely in Python. I also get easily scared by configuring stuff. I don’t really know what a PATH is. I have no clue what lies within the /bin directory on my laptop. These are all things that you seemingly have to get familiar with to not have Python … Read more

Boosting: Is It Always The Best Option?

Gradient boosting has become quite a popular technique in the area of machine learning. Given its reputation for achieving potentially higher accuracy than other models, it has become particularly popular as a “go-to” model for Kaggle competitions. However, use of gradient boosting raises two questions: Does this technique really outperform others consistently irrespective of the … Read more