Use Google and Tweepy to Build a Dataset of Twitter Users

With ever-increasing value being placed on the effectiveness of social media in marketing, mining data from social platforms is a critical piece of the ad-tech puzzle. Free developer API access to social data is becoming more and more restrictive, and so easily accessing the right data can be a challenge. Twitter is an exception to … Read moreUse Google and Tweepy to Build a Dataset of Twitter Users

Understanding time complexity with Python examples

Big-O Complexity Chart: http://bigocheatsheet.com/ Nowadays, with all these data we consume and generate every single day, algorithms must be good enough to handle operations in large volumes of data. In this post, we will understand a little more about time complexity, Big-O notation and why we need to be concerned about it when developing algorithms. … Read moreUnderstanding time complexity with Python examples

Building a Flask API to Automatically Extract Named Entities Using SpaCy

How to use the Named Entity Recognition module in spaCy to identify people, organizations, or locations in text, then deploy a Python API with Flask The overwhelming amount of unstructured text data available today provides a rich source of information if the data can be structured. Named-entity Recognition (NER)(also known as Named-entity Extraction) is one of … Read moreBuilding a Flask API to Automatically Extract Named Entities Using SpaCy

How SQL Is Making Me a Better Scientist

An explanation of why and how I started using SQL SQL (Structured Query Language) is a computer language for relational database management and data manipulation. Relational databases and SQL are extremely popular in industry and for a good reason. Relational databases are great when working with large and complex databases. SQL as a language, allows you … Read moreHow SQL Is Making Me a Better Scientist

How to Build a Deep Neural Network Without a Framework

So, for the weighted sum, the function is simply: Simple enough! Now, we build a function to feed the result to an activation function (either ReLU or sigmoid): Now, we want to use the sigmoid function on the last layer, and ReLU on all previous layers. This is specific to this application, because we will … Read moreHow to Build a Deep Neural Network Without a Framework

Erlang/Elixir solutions: struggle for quality

In any information system there is a risk of failures of a different nature, such as: hardware and power failures network failures: configuration mistakes and broken or out of date firmware logical mistakes: from algorithm coding problems to architecture-related issues which appear at the border of subsystems and systems safety issues alongside with cyber attacks … Read moreErlang/Elixir solutions: struggle for quality

Extracting faces using OpenCV Face Detection Neural Network

Recently, I came across the website https://www.pyimagesearch.com/ which has some of the greatest tutorials on OpenCV. While reading through its numerous articles, I found that OpenCV has its own Face Detection Neural Network with really high accuracy. So I decided to work on a project using this Neural Network from OpenCV and extract faces from … Read moreExtracting faces using OpenCV Face Detection Neural Network

March Edition: Making Sense Of So Much Data

8 Must-Read Articles We are blessed in the 21st Century with all the rapid advancements happening around compute, storage and data. Data might be the new oil but without the right tools, methodologies and infrastructure it would be as useless as sitting on an oil well and doing nothing. Big Data is no longer hyped … Read moreMarch Edition: Making Sense Of So Much Data

Real-time face liveness detection with Python, Keras and OpenCV

Most facial recognition algorithms you find on the internet and research papers suffer from photo attacks. These methods work really well at detecting and recognizing faces on images, videos and video streams from webcam. However they can’t distinguish between real life faces and faces on a photo. This inability to recognize faces is due to … Read moreReal-time face liveness detection with Python, Keras and OpenCV

Text Classification of Freedom of Information Requests: Part III

Deep Learning with Recurrent Neural Nets As a last effort with this dataset, we’ll employ deep learning in the form of recurrent neural networks (RNNs). RNNs retain some memory of steps that come before in a sequence, or, for a bidirectional implementation, before and after the current step. They are therefore extremely useful in tasks like … Read moreText Classification of Freedom of Information Requests: Part III

Another Stage Of Visualization: Be Reactive with Dash

A gentle invitation to Dash by Plotly Dash is an open source python library which enables us to create web applications with Plotly. It makes it easy to build an interactive visualization with simple reactive decorators like a dropdown, a slider bar, and markdown text data. We can even update the plots according to the input … Read moreAnother Stage Of Visualization: Be Reactive with Dash

Fundamentals of Machine Learning (Part 3)

Information Let’s motivate our discussion by assuming our goal is to detect the author of a given text by the words that are used. Which words are useful for detecting authorship? Intuitively, words like “the”, “or”, and “it” aren’t going to be very useful, because those words have high probability of showing up in any … Read moreFundamentals of Machine Learning (Part 3)

Unsupervised NLP Topic Models as a Supervised learning input

Predicting Future Yelp Review Sentiment Topic Modeling Overview Topic Modeling in NLP seeks to find hidden semantic structure in documents. They are probabilistic models that can help you comb through massive amounts of raw text and cluster similar groups of documents together in an unsupervised way. This post specifically focuses on Latent Dirichlet Allocation (LDA), which … Read moreUnsupervised NLP Topic Models as a Supervised learning input

Strategies for Productionizing our Machine Learning Models

So you have written your best machine learning code, and have now tuned the model for the best accuracy. Now what? How would you deploy your model so that the business can actually take advantage of the model and make better decision ? This is a follow-up post from my last post where I discussed Productionizing … Read moreStrategies for Productionizing our Machine Learning Models

Random thoughts on my first ML deployment

5 things I didn’t know six months ago and that’s better not to forget in the months to come A little bit of context: I’m currently working for a fast growing yet still medium-sized company that after having built a robust and widely used product has decided to start leveraging the data generated during the years … Read moreRandom thoughts on my first ML deployment

Using word2vec to Analyze News Headlines and Predict Article Success

Word embeddings are a powerful way to represent the latent information contained within words, as well as within documents (collections of words). Using a dataset of news article titles, which included features on source, sentiment, topic, and popularity (# shares), I set out to see what we could learn about articles’ relationships to one another … Read moreUsing word2vec to Analyze News Headlines and Predict Article Success

Understanding the ROC and AUC metrics.

ROC Graphs ROC(Receiver Operator Characteristic Curve) can help in deciding the best threshold value. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis). True Positive Rate indicates what proportion of people ‘with heart disease’ were correctly classified. False Positive Rate indicates the proportion of people classified as ‘not … Read moreUnderstanding the ROC and AUC metrics.

OpenAI, Deceptive Technology, and Model Risk Management

This article connects some dots as I sort through my own thoughts on releasing technology and information that could have adverse effects. It touches on the release of GPT-2 by OpenAI and the Digital Defense Playbook by Our Data Bodies, destructive versus deceptive technology, malicious intent, deepfakes, verification and surveillance, and model risk management. Enjoy! … Read moreOpenAI, Deceptive Technology, and Model Risk Management

Probabilistic Graphical Models: Bayesian Networks

Concepts of Probability Axioms of Probability: For any event A, the probability of occurrence of the event will always be equal to or greater than zero. Figure-1: Probability of an event A 2. If there are disjoint events in a sample space, then the union of all events is the summation of individual probabilities. Figure-2: Union … Read moreProbabilistic Graphical Models: Bayesian Networks

Review: CRF-RNN — Conditional Random Fields as Recurrent Neural Networks (Semantic Segmentation)

An Approach Integrating CRF into End-to-end Deep Learning Solution In this story, CRF-RNN, Conditional Random Fields as Recurrent Neural Networks, by University of Oxford, Stanford University, and Baidu, is reviewed. CRF is one of the most successful graphical models in computer vision. It is found that Fully Convolutional Network (FCN) outputs a very coarse segmentation results. … Read moreReview: CRF-RNN — Conditional Random Fields as Recurrent Neural Networks (Semantic Segmentation)

Better, Faster Speech Recognition with Wav2Letter’s Auto Segmentation Criterion

Facebook AI’s New Loss Function Improves a Decade-Old Technique in ASR Zach CBlockedUnblockFollowFollowing Mar 3 In 2016, Facebook AI Research (FAIR) broke new ground with Wav2Letter, a fully convolutional speech recognition system. In Wav2Letter, FAIR showed that systems based on convolutional neural networks (CNNs) could perform as well as traditional recurrent neural network-based approaches. In this … Read moreBetter, Faster Speech Recognition with Wav2Letter’s Auto Segmentation Criterion

Rendezvous Architecture for Data Science in Production

Part 2: The Solution Meet the rendezvous architecture Summarising the previous introduction to the problem statement, we are looking for some architecture to Evaluate a big number of incumbent and challenger models in parallel Manage the model life cycle Handle an increasing heterogeneity of data science toolkits Allow experimentation in production without impacting the user experience … Read moreRendezvous Architecture for Data Science in Production

What Kind of Data Science Do You Practice?

Beyond Tools and Skills to Domains of Expertise Photo by Philip Swinburn on Unsplash Descriptions of data science are very often centered on the tools that data scientists use and the skills they bring to the job. But what if we shifted our focus to a part of data science that gets relatively less attention? What … Read moreWhat Kind of Data Science Do You Practice?

How Reliable Are Amazon Reviews?

Building An Index To Identify Fake Reviews Introduction As a self-proclaimed tech-enthusiast, I’ve been following the tech review community, especially on YouTube, for quite a while. During that time, I recognized a certain pattern emerge after every new iPhone release: highly popular videos (as well as articles) would be released criticizing initial problems with the new … Read moreHow Reliable Are Amazon Reviews?

Building Blocks: Text Pre-Processing

Morphological Normalization Morphology, in general, is the study of the way words are built up from smaller meaning-bearing units, morphomes. For example, dogs consists of two morphemes: dog and s Two commonly used techniques for text normalization are: Stemming: The procedure aims to identify the stem of a word and use it in lieu of … Read moreBuilding Blocks: Text Pre-Processing

Utilizing Free Image Tools For Home Security

The baby monitor portion of our webapp utilized a different Clarifai offering: the general model. Clarifai’s general model is a convolutional neural network trained to classify objects within pictures with thousands of labels. Given the picture to the left, the model would assign the labels “baby” and “dog”, assessing that there is both a baby … Read moreUtilizing Free Image Tools For Home Security

Introduction to Uber’s Ludwig

Using Ludwig To use Ludwig we need to install it which can be done with the following command: pip install git+https://github.com/uber/ludwigpython -m spacy download en The next step would be to create our model definition YAML file that specifies our input and output features as well as some additional information about the specifc preprocessing steps … Read moreIntroduction to Uber’s Ludwig

Measuring Performance: AUPRC

The area under the precision-recall curve (AUPRC) is another performance metric that you can use to evaluate a classification model. If your model achieves a perfect AUPRC, it means your model can find all of the positive samples (perfect recall) without accidentally marking any negative samples as positive (perfect precision.) It’s important to consider both … Read moreMeasuring Performance: AUPRC

Generative Design & Metric space analysis

The Generative Process: Generative design is essentially a multi-stage process which broadly consists of : The Geometric model : The design of a geometric model which can create many design variations. This requires constructing a robust parametric configuration such that unique parameters define the geometric variations that the design can possibly have. Design | Performance Metrics : The … Read moreGenerative Design & Metric space analysis

The Great Molasses flood — predicting the melting point of metals

Ok but why? Because it’s not like we have enough machine learning This project was a personal challenge granted to me on my visit to Hamilton Ontario for Deltahacks, where Arcelormittal, the largest steel producers in the world, had sponsored the event. Since their Dofasco HQ was just down the street, I asked them about what they … Read moreThe Great Molasses flood — predicting the melting point of metals

Being a Data Scientist does not make you a Software Engineer!

Introduction As we have seen before in the famous Venn diagram of Steven Geringer, Data Science is the intersection of 3 disciplines: Computer Science, Mathematics/Statistics and a particular Domain knowledge. Data Science Venn Diagram [Copyright Steven Geringer] Having basic (or even advanced) programming skills is key to put your end to end experiment together, however … Read moreBeing a Data Scientist does not make you a Software Engineer!

Understanding Decision Trees (once and for all!)

This article is made for complete beginners in Machine Learning who want to understand one of the simplest algorithm, yet one of the most important because of its interpretability, power of prediction and use in different variants like Random Forest or Gradient Boosting Trees. This article is also for all the Machine Learners like me who … Read moreUnderstanding Decision Trees (once and for all!)

Should Data Scientists Be Licensed?

Licensing could lead to increased public safety, but at the cost of slowing down innovation Every day your life is impacted by different machine learning algorithms. Some are innocuous, such as movie recommendations on Netflix. Others such as loan approval and bail sentencing could cause unmitigated harm if they aren’t developed properly. With the growing … Read moreShould Data Scientists Be Licensed?

Getting started with Git and GitHub: the complete beginner’s guide

Git and GitHub basics for the curious and completely confused (plus the easiest way to contribute to your first open source project ever!) Photo by James Bold on Unsplash Looking to get started with Git and GitHub? Do you need to collaborate with a team? Are you working on a project? Have you recently discovered that you … Read moreGetting started with Git and GitHub: the complete beginner’s guide

Visual Deep Computer Vision

Or how you can run deep learning algorithms for computer vision without being an expert. Introduction Deep learning has powered and solved a variety of computer vision problems, such as object detection, motion tracking, action recognition, human pose estimation, semantic segmentation and more. The biggest advances in the field has been possible due to Convolutional Neural … Read moreVisual Deep Computer Vision

Why And How To Use Merge With Pandas in Python

It doesn’t matter whether you’re a data scientist, data analyst, business analyst, or data engineer. If you’ve been using Python in your work — especially for data preprocessing/cleaning — you’d have used Pandas in some ways. Why “Merge”? You’d have probably encountered multiple data tables that have various bits of information that you would like to see all in … Read moreWhy And How To Use Merge With Pandas in Python

Implementing a linear-chain Conditional Random Field (CRF) in PyTorch

Code Let’s start our code by creating a class called CRF which is inherited from pytorch’s nn.Module in order to keep track of our gradients automatically. Also, I added special tokens for the beginning/end of the sentence, and a special flag to inform whether we are passing tensors where the batch is the first dimension … Read moreImplementing a linear-chain Conditional Random Field (CRF) in PyTorch

Failing to land Flight Delay Predictions

In an earlier article (“The Loss of Inference”) I referenced the misuse of non-Independent variables in models utilizing auto-mpg dataset as a symptom of the mathematical / technological determinism that now pervades data science due to the focus on production which has denigrated critical evaluation. In this article, I hope to show how this shift … Read moreFailing to land Flight Delay Predictions

Machine Learning Algorithms In Layman’s Terms, Part 1

(i.e. how to explain machine learning algorithms to your grandma) As a recent graduate of the Flatiron School’s Data Science Bootcamp, I’ve been inundated with advice on how to ace technical interviews. A soft skill that keeps coming to the forefront is the ability to explain complex machine learning algorithms to a non-technical person. https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png This … Read moreMachine Learning Algorithms In Layman’s Terms, Part 1

Finding Lane Lines — Simple Pipeline For Lane Detection.

Identifying lanes of the road is very common task that human driver performs. This is important to keep the vehicle in the constraints of the lane. This is also very critical task for an autonomous vehicle to perform. And very simple Lane Detection pipeline is possible with simple Computer Vision techniques. This article will describe … Read moreFinding Lane Lines — Simple Pipeline For Lane Detection.

All the Steps to Build your first Image Classifier (with code)

Now that you know the basics of the convolution, we can start building one ! Preparing the data This part is useful only if you want to use your own data, or data that can’t be found on the web easily, to build a convolutional neural network maybe more adapted to your needs. Otherwise, here is the … Read moreAll the Steps to Build your first Image Classifier (with code)

Set Your Jupyter Notebook up Right with this Extension

Solution: The Setup Jupyter Notebook Extension Rather than just complaining about the problem (it’s easy to be a critic but a lot harder to do something positive) I decided to see what could be done with Jupyter Notebook extensions. The result is an extension that on opening a new notebook automatically: Creates a template to … Read moreSet Your Jupyter Notebook up Right with this Extension

Weekly Selection — Mar 1, 2019

A brief introduction to Markov chains By Joseph Rocca — 19 min read In 1998, Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd published “The PageRank Citation Ranking: Bringing Order to the Web”, an article in which they introduced the now famous PageRank algorithm at the origin of Google. Favorite