Can Data Science help us find what makes a hit television show?

I recently talked about deriving attributes from sitcom transcripts and determining the possibilities of learning the essence of making a popular television show at Data Science Salon, Los Angeles. This article would present the investigation which went in and what were the results I got post training on neural networks as well as leveraging pre-trained … Read more Can Data Science help us find what makes a hit television show?

Swarm Intelligence — Swarm-Based Dimensionality Reduction

Introduction Technology has evolved tremendously over the years, with the technology access gap shrinking year by year as individuals and businesses are more capable of accessing higher processing and compute capabilities, larger memory capacities, smarter and more efficient data storage technologies, and so on, whether it’s on-premise or in the cloud. Despite this, data volume … Read more Swarm Intelligence — Swarm-Based Dimensionality Reduction

Understanding Matrix Factorization for recommender systems

To understand more, let’s have an example, suppose we have a square matrix with a dimension of 3 X 3, as illustrated below. To get the zeros in the lower-left corner of this matrix, we would do the following: What I’ve done here is that I divide 4 over 2 — which is called the … Read more Understanding Matrix Factorization for recommender systems

Understanding HDBSCAN and Density-Based Clustering

Now, we have an idea what type of data we are dealing with, let’s explore the core ideas of HDBSCAN and how it excels even when the data has: Arbitrarily shaped clusters Clusters with different sizes and densities Noise HDBSCAN uses a density-based approach which makes few implicit assumptions about the clusters. It is a … Read more Understanding HDBSCAN and Density-Based Clustering

CNN Sentiment Analysis

Convolutional neural networks, or CNNs, form the backbone of multiple modern computer vision systems. Image classification, object detection, semantic segmentation — all these tasks can be tackled by CNNs successfully. At first glance, it seems to be counterintuitive to use the same technique for a task as different as Natural Language Processing. This post is … Read more CNN Sentiment Analysis

4 Types of Tree Traversal Algorithms

Before jumping into the tree traversal algorithms, let’s define Tree as a data structure first. That will help you to grasp the concepts in a meaningful way. Tree is a hierarchical data structure which stores the information naturally in the form of hierarchy unlike linear data structures like, Linked List, Stack, etc. A tree contains … Read more 4 Types of Tree Traversal Algorithms

Deep Learning in Brain-Computer Interface

A Brain-Computer Interface (BCI) is a system that extracts and translates the brain activity patterns of a subject (humans or animals) into messages or commands for an interactive application. The brain activity patterns are signals obtained with Electroencephalography (EEG). The concept of controlling devices solely with our minds is nothing new. Science fiction and Hollywood … Read more Deep Learning in Brain-Computer Interface

Automatic Portfolio Optimization

Extract optimal asset weights for your portfolio using Python Introduction Portfolio optimization is a widely studied topic, especially in academia. The main idea is to maximize a portfolio’s value by finding the most productive combination of assets to yield the highest return. In this article, I will show you how to create your own Python … Read more Automatic Portfolio Optimization

Newsvendor Inventory Problem with R

The SCperf package from R contains several functions for multiple inventory planning and managing methods. For the following example, let’s consider the following values for the Newsvendor model variables (assuming the demand is normally distributed): D= 100 units, sd= 30 units, p = $4 per unit, and c = $1 per unit, and s = … Read more Newsvendor Inventory Problem with R

TinyBERT for Search: 10x faster and 20x smaller than BERT

We used the code from this repo for knowledge distillation and modified it for training and evaluation on the MS Marco dataset. We initially trained a teacher bert-base-uncased network in PyTorch with the MS Marco training triples set. Then we used it as a teacher to train a smaller student BERT network with only 4 … Read more TinyBERT for Search: 10x faster and 20x smaller than BERT

The Many Ways To Call Axes In Matplotlib

In matplotlib terminology, a basic plot starts from one figure and at least one axes (if you are confused about these terms, you may find this post is useful). A close analogy with painting, figure is the canvas and axes is the artistic composition. A canvas (figure) can have only one type or many different … Read more The Many Ways To Call Axes In Matplotlib

Data Cleaning with Pandas — Avoid this Mistake!

https://unsplash.com/photos/FOsina4f7qM Pandas is an extremely useful data manipulation package in Python. For the most part, functions are intuitive, speedy, and easy to use. But once, I spent hours debugging a pipeline to discover that mixing types in a Pandas column will cause all sorts of problems later in a pipeline. Read more to discover what … Read more Data Cleaning with Pandas — Avoid this Mistake!

Architecting Serverless Data Integration Hubs on AWS for Enterprise Data Delivery (2020)

Happy New Year! If data is the new oil, why do we experience so many backfires in our systems? In a prior 2-part article series titled Serverless Data Integration, the importance of data integration, particularly in the serverless realm was discussed in good detail. We concluded that data integration was a critical business need, that … Read more Architecting Serverless Data Integration Hubs on AWS for Enterprise Data Delivery (2020)

Predictive Analytics on Customer Behavior with Support Vector Machines (SVM)

It appears that those with higher levels of education are less likely to respond to an offer. Similar to the first graph, it appears that customers with luxury vehicles are less likely to respond to an offer. How can we further explore the seemingly negative correlation between wealth and response rate? This graph further supports … Read more Predictive Analytics on Customer Behavior with Support Vector Machines (SVM)

Accurate, reliable and fast robustness evaluation

As an additional benefit, the attack has only one hyperparameter (the maximum step length or trust region radius), and if one simply sets this parameter to its default value one can already reach very good results. I believe that this attack should be part of the standard toolbox of anyone trying to evaluate model robustness. … Read more Accurate, reliable and fast robustness evaluation

A Keras-Based Autoencoder for Anomaly Detection in Sequences

Time for Some Code Let’s get into the details. I should emphasize, though, that this is just one way that one can go about such a task using an autoencoder. There are other ways and technics to build autoencoders and you should experiment until you find the architecture that suits your project. These are the … Read more A Keras-Based Autoencoder for Anomaly Detection in Sequences

A Layman’s Guide to Fuzzy Document Deduplication

Why Gensim? Gensim is a Python library popularly used for topic modeling. However it also has very valuable utilities for deduplication. While there are several efficient ways to calculate cosine similarity in Python, including use of the popular SKLearn library, Gensim’s major advantage comes when your dataset grows very large. When your corpus grows beyond … Read more A Layman’s Guide to Fuzzy Document Deduplication

Integrate JupyterLab with Google Drive

And as a data scientist your work, insights, and conclusions are of vital importance, whether they are work-related or just something you’ve been working on the side. Sure, you can always bring a flash drive with you, but that’s also an inconvenient option, and needless to say, that flash drives are so easy to lose. … Read more Integrate JupyterLab with Google Drive

Billion Dollar Data Science

How to prevent costly failures in high-risk data science applications The intrinsic risk for predicting customer churn on a smartphone plan is not the same as pricing a trillion dollars of mortgage-backed securities or detecting a pedestrian in a crosswalk for a self-driving car. In the first situation a wrong decision has an opportunity cost, … Read more Billion Dollar Data Science

Why learn Data Science in 2020?

In conclusion, Data Science, Machine Learning and the others, are very promising fields that are exciting, fun , and have endless applications. Despite there being a lot of practitioners, there is a shortage of qualified professionals in these areas. If you become one more name in that list of professionals, you can only expect quality … Read more Why learn Data Science in 2020?

Addressing AI’s Hidden Agenda

What are the sources and significant examples of implicit bias in artificial intelligence? What can be done to minimize their consequences in establishing acceptable data practices? Image credit: Bayeté Ross Smith When we evoke the implicit bias of artificial intelligence, we are recognizing that machine learning reflects the unconscionable attitudes and stereotypes that influence managerial … Read more Addressing AI’s Hidden Agenda

Introducing Linear Regression (Least Squares) the Easy Way

Now, consider a very basic linear model with two variables. Temperature is the independent variable and we wish to find out the effect of temperature on sales (dependent variable). The twelve data points are in the form of (x,Y). The x-coordinate represents the temperature in degree Celsius and the Y-coordinate represents the sales in dollars. … Read more Introducing Linear Regression (Least Squares) the Easy Way

Big Data Analytics: Apache Spark vs. Apache Hadoop

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”[11] Big data analytics can be time-consuming, complicated, and computationally demanding, without the proper tools, frameworks, and techniques. When the volume of data is too high to process and analyze on a single … Read more Big Data Analytics: Apache Spark vs. Apache Hadoop

A Minimalism Approach to Understand What Data Science Is

In an insightful Medium article which I strongly recommend you read if you’re serious in learning more about data science, Cassie Kozyrkov, the Chief Decision Scientist at Google, gave her so-called pithiest definition. Data science is the discipline of making data useful. Subject of Concern: Data From a personal perspective as a minimalist, I think … Read more A Minimalism Approach to Understand What Data Science Is

Getting Started with Data Analysis on AWS

Learn how to use AWS Glue, Amazon Athena, and Amazon QuickSight to transform, enrich, analyze, and visualize semi-structured data. According to Wikipedia, data analysis is “a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making.” In this post, we will explore how to get … Read more Getting Started with Data Analysis on AWS

Are your coding skills good enough for a Data Science job?

5 coding sniffs you must know if you are working in the Data Science industry “It was Friday evening. I clearly remember how excited I was to spend the rest of the day with my family. My parents had traveled to Bangalore for the first time and I already had plans of showing them the … Read more Are your coding skills good enough for a Data Science job?

A (Really) Gentle Introduction to NLP in Python

NLP Made Simple Hold my hand and let’s get started together. source I know, it’s not easy. NLP is a thing that everybody talks about, and it seems like everyone is doing it, besides yourself, who are lost and sad in the middle of the crowd. No worries, Natural Language Processing (NLP) is a hard … Read more A (Really) Gentle Introduction to NLP in Python

What’s New in Splice Machine 3.0

Splice Machine 3.0 adds geo-replication, Kubernetes support, time travel, Jupyter notebooks, in-DB machine learning model deployment & more Splice Machine is part of a relatively new segment of the database management systems (DBMS) market where transactions and analytics functionality converges. In July 2013, Gartner used the term HTAP or Hybrid Transaction Analytical Processing to describe … Read more What’s New in Splice Machine 3.0

Neural Style Transfer using VGG model

A technique to transform a digital image that adopts the style of different image Introduction: Before we begin, let’s go to this website to get some inspiration. On the website, we choose a photo from the local computer (let’s assume the image named Joey.jpg). Let’s call this content image. Then we choose another image, say … Read more Neural Style Transfer using VGG model

Interviewing the 1.5B GPT-2 model by OpenAI

Analysis of Generated Text and Model Training The 1558MB version of GPT-2 is able to generate coherent text for a variety of prompts. Interestingly, it is able to describe neural networks, deep learning, machine learning, and data science fairly accurately. However, these language models still lack intrinsic models of the world. For instance, it generates … Read more Interviewing the 1.5B GPT-2 model by OpenAI

Maximizing Efficiency in Python — Six Best Practices for Implementing Python3.7 in Production.

3. Always Account For Memory and Efficiency Simple python programs will generally never run into issues relating to memory, however this topic will become crucial as scripts grow larger and more complex. Unlike other languages, the Python interpreter performs memory management in the background leaving users with no control whatsoever. For more information regarding memory … Read more Maximizing Efficiency in Python — Six Best Practices for Implementing Python3.7 in Production.

Continuous quality evaluation for ML projects using GitHub Actions.

I would use three different models (+ baseline) to emulate step-by-step “work” on the task: Mean model (baseline) Random predictions Linear Regression Gradient Boosting over Decision Trees (LightGBM) In real-world problems, it is equivalent to continuously improving model whose changes are pushed to the repository. One should also define metrics that estimate how good the … Read more Continuous quality evaluation for ML projects using GitHub Actions.

How Deepfake Technology Can Become More Dangerous Than a Nuclear Weapon

“The powers that be no longer have to stifle information. They can now overload us with so much of it, there’s no way to know what’s factual or not. The ability to be an informed public is only going to worsen with advancing deep fake technology.” J. Andrew Schrecker All of us have heard Donald … Read more How Deepfake Technology Can Become More Dangerous Than a Nuclear Weapon

Training Object Detectors with No Real Data using Domain Randomization

Solving sim2real transfer for specialized object detectors with no budget Deep learning has recently become the favored approach to object detection problems. However, like with many other uses of this technology, annotating training data is cumbersome and time-consuming, especially if you are a small company with a specific use-case. In this article, I present some … Read more Training Object Detectors with No Real Data using Domain Randomization

Who’s smarter? An IQ test for both AI systems and humans

Girl at tablet: Photo by Hal Gatewood on Unsplash Cutting through the hype surrounding artificial intelligence, François Chollet, an AI researcher at Google, has proposed the Abstract and Reasoning Corpus (ARC), an intelligence test that could shape the course of future AI research. To date, there has been no satisfactory definition of artificial intelligence nor … Read more Who’s smarter? An IQ test for both AI systems and humans

Mastering the data science job hunt

Editor’s note: The Towards Data Science podcast’s “Climbing the Data Science Ladder” series is hosted by Jeremie Harris, Edouard Harris and Russell Pollari. Together, they run a data science mentorship startup called SharpestMinds. You can listen to the podcast below: Getting hired as a data scientist, machine learning engineer or data analyst is hard. And … Read more Mastering the data science job hunt

Using neural networks for a functional connectivity classification of fMRI data

We can see that the connections are not particularly strong for either group (the diagonal line can be ignored as it shows correlation with itself and, thus, always equals to 1). To better visualize the connections and the differences, we can project these back onto the brain. #Getting the center coordinates from the component decomposition … Read more Using neural networks for a functional connectivity classification of fMRI data

Ultimate Setup for Your Next Python Project

Starting any project from scratch can be a daunting task… But not if you have this ultimate Python project blueprint! Original image by @sxoxm on Unsplash Whether you are working on some machine learning/AI project, building web apps in Flask or just writing some quick Python script, it’s always useful to have some template for … Read more Ultimate Setup for Your Next Python Project

The Norwegian National Strategy for Artificial Intelligence Has Launched!

A Summary and Review of the New Strategy for AI On The Day of Its Launch 14th of January This day is special to me, because I have been covering most of the AI strategies in Europe, and today my home country has released their own national strategy. The Norwegian national strategy was released on … Read more The Norwegian National Strategy for Artificial Intelligence Has Launched!

Run your data science team like an Admiral …

Running a data science team is hard! We need to take inspiration wherever we can, and the culture of Nelson’s navy is one place to start. Portrait of Nelson by Lemuel Francis Abbott “no captain can do very wrong if he places his ship alongside that of the enemy” Admiral Horatio Nelson, before the battle … Read more Run your data science team like an Admiral …

Kaggle 1st place winner cheated, $10,000 prize declared irrecoverable

How a team obtained private data, constructed a fake AI model, and got away with the money from a platform for adopting neglected pets The cheaters stole from Petfinder.my, a platform for adopting homeless and neglected pets. [pixabay image] Kaggle just announced that the 1st Place Team, Bestpetting[1], has been disqualified from the Petfinder.my competition … Read more Kaggle 1st place winner cheated, $10,000 prize declared irrecoverable