How to extract online data using Python

Basic concepts about HTML, XPath, Scrapy, and spiders “I would be nice to have all the documents of the website” — One of her colleagues said “Yeah, that could give us a lot of information” — Said another colleague “Can you do the scraper?” — They both turn to look at her “Ehhhh… I could….” — She started mumbling “Perfect” — They both said “….try” —She … Read more

PDF Processing with Python

Introduction Being a high-level, interpreted language with a relatively easy syntax, Python is perfect even for those who don’t have prior programming experience. Popular Python libraries are well integrated and provide the solution to handle unstructured data sources like Pdf and could be used to make it more sensible and useful PDF is one of … Read more

Functional Programming features in Scala

Data Engineering Scala language features and patterns I’ve been exploring functional programming with Scala and its eco system for the past few months. In this post, I’ll highlight some of the features of the language that enable intuitive creation of functional code for distributed systems and data operations. Photo by apoorv mittal on Unsplash As … Read more

Linked List Implementation Guide

The following figures demonstrate the implementation of a linked list. Although, the demonstration was written in C, the procedure and steps shown are applicable to implementing a linked list using any programming language. Required Libraries and Structure Definitions Figure 1 below includes the required libraries, defines two functions that are commonly used throughout the demonstration, … Read more

Linked Lists vs. Arrays

Linked Lists A linked list is another approach to collecting similar data. However, unlike an array, elements in a linked list are not in consecutive memory locations. A linked list is composed of nodes which are connected with each other using pointers. Figure 3 below illustrates a linked list. Figure 3: Diagram of singly linked … Read more

The XIV Meltdown

What is the VIX? And Why Does Everyone Short It? Check out the chart below. The dark blue line is SVXY, an ETF that shorts VIX futures (that is still alive and kicking today unlike XIV). The light blue line (that pretty much looks horizontal) is the S&P 500. From the start of 2013 to the … Read more

How to do hyperparameter tuning of a BigQuery ML model

Bayesian Optimization using Cloud AI Platform or Grid Search using scripting When carrying out machine learning, there are many parameters that we choose rather arbitrarily. These include factors such as the learning rate, the level of L2 regularization, the number of layers and nodes in a neural network, the maximum depth of a boosted tree, … Read more

Deploying Models to Flask

Passing model outputs to the HTML template So we’ve made our predictions using our python files, but now it’s time to display them using HTML templates. Templates are just a way to change HTML code so that we can update the user with new values (such as our predictions). My goal isn’t to teach you … Read more

Don’t Put All Your Eggs in One Basket

The Model The purpose is to determine what fraction of a portfolio to invest in each of several possible assets with the goal of minimizing the volatility of the portfolio, subject to a target return. To frame the question mathematically, suppose f is an n-dimensional vector of the fractions that I’ll invest in each of … Read more

Artificial Intelligence in Video Games

An overview of how video game A.I. has developed over time and current uses in games today Written by Laura E. Shummon Maass and Andy Luc Virtual Reality Photo by Harsch Shivam Most people probably imagine that the majority of games released in the last couple of years have highly sophisticated A.I. for any non-player controlled characters, … Read more

Implement K-Nearest Neighbors classification Algorithm

Building Heart disease classifier using K-NN algorithm source The most crucial task in the healthcare field is disease diagnosis. If a disease is diagnosed early, many lives can be saved. Machine learning classification techniques can significantly benefit the medical field by providing an accurate and quick diagnosis of diseases. Hence, save time for both doctors … Read more

Ecom Data Series: What is Demand Forecasting?

MAKING ECOMMERCE DATA SCIENCE CONCEPTS SIMPLE ONE TOPIC AT A TIME. Black magic that has powered retail and logistics operations for generations. Ecom Data Talk Episode 4: What is Demand Forecasting? Understanding of past events to predict future sales 📈📊 is fundamental to retail and ecommerce operation optimization. Before you accurately measure your pricing and … Read more

Language Detection Benchmark using Production Data

This is a benchmark on real-life social media data for multilingual language detection algorithms. The Tower of Babel by Pieter Bruegel the Elder (1563) As data scientists, we’re accustomed to processing many different types of data. But when it comes to text-based data, knowing the language of the data is a top priority. I experienced this … Read more

Press Coverage of the early 2020 Primary

Observations of the early press coverage in the 2020 Democratic presidential primary race Admittedly, we’re still in the year 2019 and the next U.S. presidential election is 2020, about 17 months from the time of this writing. However, the election process has already begun, and there are over 20 individuals who have declared candidacies and are … Read more

Apply and Lambda usage in pandas

Filtering a dataframe Filtering…. Pandas make filtering and subsetting dataframes pretty easy. You can filter and subset dataframes using normal operators and &,|,~ operators. # Single condition: dataframe with all movies rated greater than 8 df_gt_8 = df[df[‘Rating’]>8] # Multiple conditions: AND – dataframe with all movies rated greater than 8 and having more than … Read more

Malware Detection Using Deep Learning

Malware Detection Using Convolutional Neural Networks In fast.ai Photo by Markus Spiske on Unsplash What is Malware? Malware refers to malicious software perpetrators dispatch to infect individual computers or an entire organization’s network. It exploits target system vulnerabilities, such as a bug in legitimate software (e.g., a browser or web application plugin) that can be … Read more

Bayesian inference problem, MCMC and variational inference

Markov Chains Monte Carlo (MCMC) As we mentioned before, one of the main difficulty faced when dealing with a Bayesian inference problem comes from the normalisation factor. In this section we describe MCMC sampling methods that constitute a possible solution to overcome this issue as well as some others computational difficulties related to Bayesian inference. The … Read more

Uncovering what neural nets “see” with FlashTorch

Motivation behind FlathTorch When I discovered the world of feature visualisation, I got immediately drawn to its potential in making neural nets more interpretable and explainable. Then I quickly realised that there was no tool available to easily apply these techniques to neural networks I’ve built in PyTorch. So I decided to build one — FlashTorch, which … Read more

7 Ways to Secure Amazon Athena

Broadly, data security can be considered in two areas: when data is at rest and when data is in flight. Let’s consider data at rest. Scenario #1: You have an S3 bucket containing data you want to query from Athena. How can you ensure the data is secure in the bucket? First, make sure the … Read more

Tweepy for beginners

Using Twitter’s API to build your own data set A good way to build out your portfolio is with a natural language processing project, but like every project, the first step is getting hold of the data. Twitter can be a great resource for text data; it has an API, credentials are easy to acquire and … Read more

How to write a do-while loop on Tensorflow?

Two difficulties arise: There is no simple while statement in Tensorflow, and instead we must use the function tf.while_loop(cond, body, loop_vars) . Tensorflow — — at least in graph mode — — prohibits using tf.Tensor objects as boolean objects (True/False) for control flow. We must instead use the tf.cond(pred, true_fn, false_fn) statement. Concerning the first … Read more

NVIDIA Jetson Nano and LEGO Minifigures

LEGO Minifigures object detection with NVIDIA Jetson Nano. NVIDIA Jetson Nano is a small AI computer which people often refer to it as “Raspberry Pi on steroids.” I received my Jetson Nano Developer Kit a few days ago and decided to build a small project with it: LEGO Minifigures object detection. Setting up Jetson Nano … Read more

Generating Beatles’ Lyrics with Machine Learning

The Beatles were a huge cultural phenomenon. Their timeless music still resonates with people today, both young and old. Personally, I’m a big fan. In my humble opinion, they are the greatest band to have ever lived¹. Their songs are full of interesting lyrics and deep ideas. Take these bars for example: When you’ve seen … Read more

Practical Psychology for Data Scientists

You aren’t as logical as you think. None of us are. We are susceptible to cognitive biases each and every day. If you have had the pleasure of reading Thinking, Fast and Slow by Daniel Kahneman, then you are more than familiar with this reality. We are imperfect creatures and the world is an imperfect … Read more

7 Useful Pandas Tips for Data Management

A Premier League Financial Review Example Money Ball The Premier league is big business. In fact, Premier League clubs have paid out more than £260m to football agents during 2018–19 – an increase of £49m on the previous 12 months. This statistic alone piqued my interest and drove me to delve deeper into Premier League spending … Read more

Feature Elimination Using SVM Weights

Specifically for SVMLight, but this feature elimination methodology can be used for any linear SVM. Figure 1: a random example of accuracy based on the number of SVM features used. While working on my M.Sc thesis, circa 2005–2007, I had to calculate features weights based on an SVM Model. This was before SKlearn, which started in 2007. … Read more

What Separates Good from Great Data Scientists?

The most valuable skills in an evolving field The data science job market is changing rapidly. Being able to build machine learning models used to be an elitist skill that only a few distinguished scientists possessed. But nowadays, anyone with basic coding experience can follow the steps to train a simple scikit-learn or keras model. Recruiters … Read more

Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example

Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data. In other words, the samples used to train our model do not come with predefined categories. In comparison to other clustering algorithms, DBSCAN is particularly well suited for … Read more

Machine Learning At Scale With Apache Spark MLlib Python Example

For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped increasing the clock frequency of individual processors and opted for parallel CPU cores. This is fine for playing video games on a desktop computer. However, when it … Read more

What is Reinforcement Learning?

“Reinforcement Learning is like many topics with names ending in -ing such as Machine Learning, Deep Learning in AI techniques etc. Some names like planning and mountaineering, in that it is simultaneously a problem, a class of solution methods that work well on the class of problems, and the field that studies these problems and … Read more

Semantic similarity classifier and clustering sentences based on semantic similarity.

Recently we have been doing some experiments to cluster semantically similar messages, by leveraging pre-trained models so we can get something off the ground using no labelled data. Task here is given a list of sentences we cluster them such that semantically similar sentences are in same cluster and number of clusters is not predetermined. … Read more

The Incredible Shrinking Bernoulli

Simulating Hacker News inter-arrival times distribution with the flip of a coin Joey Kyber via Pexels Bernoulli counting process Bernoulli distributions sounds like a complex statistical construct, but they represent flipping a coin (possibly biased). What I find fascinating is how this simple idea can lead to modeling more complex processes such as the probability to get a … Read more

Determining Presidential Approval Rating Using Reddit Sentiment Analysis

The Team As mentioned, this problem was tackled by 6 Duke undergraduate students — Milan Bhat, a sophomore studying Electrical and Computer Engineering, Andrew Cuffe, a senior studying Economics and Computer Science, Catherine Dana, a junior studying Computer Science, Melanie Farfel, a senior studying Economics and Computer Science, Adam Snowden, a junior studying Biology and Computer Science, … Read more

The Truth About Open Data

I’m currently volunteering at a data journalism startup in Cali, Colombia. In the past two weeks I’ve had meetings with business owners, students, mayoral candidates, and government officials to dive deep into data. I’ve learned some interesting things. The city of Cali is one of a few places in Latin America that has really begun … Read more

Trail Secrets: An Intelligent Recommendation Engine for Finding Better Hikes

I recently went on a weekend camping trip in The Enchantments, which is just over a two hour drive from where I live in Seattle, WA. To plan for the trip, we relied on AllTrails, which is a fantastic application with over 75,000 hand-curated hiking trails along with photos, reviews, and in-depth trail information. AllTrails … Read more

Analyzing Online Activity and Sleep Patterns

Making Data Science Fun Analyze your Facebook friends’ online activity and sleep patterns There is tons of information publicly available on social networks, which, sometimes we even forget exists. Information as little as just the online activity of our Facebook friends can enable us to deduce information like when they sleep or when they are most active … Read more

Collaborative filtering to “predict” the efficacy of a drug (2)

Another case study and some thoughts on domain knowledge Yu LiuBlockedUnblockFollowFollowing Jun 29 I showed the result of using collaborative filtering to predict the interaction strength between a drug and its target in the first blog post of this series. In this sequel, I will try to work on another dataset and discuss the significance … Read more

Defining Quality: Towards a Better Understanding of “Statistical Quality Control”

Quality of products and services plays an important role in decision making processes of different customer segments. Maintaining quality at the desired level, though may be challenging, is imperative in achieving high level of customer satisfaction, as well as, maximizing revenue and market share, rendering elimination of waste product for companies and prolongation of product … Read more

Writing a simple Flask Web Application in 80 lines

Sample tutorial for getting started with flask Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions. easy to use. built in development server and debugger integrated unit testing support RESTful request dispatching uses Jinja2 templating support for secure cookies (client-side sessions) 100% WSGI 1.0 compliant Unicode based extensively documented The … Read more

Modelling with Tidymodels and Parsnip

Overview Recently I have completed the Business Analysis With R online course focused on applied data and business science with R, which introduced me to a couple of new modelling concepts and approaches. One that especially captured my attention is parsnip and its attempt to implement a unified modelling and analysis interface (similar to python’s … Read more

Automated Data Quality Testing at Scale using Apache Spark

I have been working as a Technology Architect, mainly responsible for the Data Lake/Hub/Platform kind of projects. Every day we ingest data from 100+ business systems so that the data can be made available to the analytics and BI teams for their projects. Problem Statement While ingesting data, we avoid any transformations. The data is … Read more