SQL Order-based Calculations

Image by author The union of field values in SQL is common, such as firstname+lastname and year (birthday). No matter how many fields an expression contains, they come from the same row. We call this intra-row calculation. Correspondingly, there are inter-row calculations. Examples include getting the difference between the result of the champion and the … Read more SQL Order-based Calculations

Pairs Trading ADR and SPY. The price dynamics of ADR-SPY spreads motivate a mean reversion trading strategy.

Algo Trading ADR price dynamics motivate a trading strategy Looking in the same direction? Photo by SK Yeong on Unsplash Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s … Read more Pairs Trading ADR and SPY. The price dynamics of ADR-SPY spreads motivate a mean reversion trading strategy.

Central Limit Theorem: Proofs & Actually Working Through the Math

… Not another ‘hand-wavy’ CLT explanation… Let’s actually work through the math Photo by Diego PH on Unsplash For anyone pursuing study in Data Science, Statistics, or Machine Learning, stating that “The Central Limit Theorem (CLT) is important to know” is an understatement. Particularly from a Mathematical Statistics perspective, in most cases the CLT is … Read more Central Limit Theorem: Proofs & Actually Working Through the Math

5 Probability Questions to Test Your Skills

With many of you applying to Data Science positions, it is expected to be asked various sorts of probability questions during the technical aspect of the interview process. Within this post, I aim to cover 5 different probability questions (increasing in difficulty) which I believe serve as a good blanket to the various types of … Read more 5 Probability Questions to Test Your Skills

Filter Learning with Unsupervised Learning

An unsupervised learning method for learning filters that can extract meaningful features out of images Data is everything. Especially in deep learning, the amount of data, type of data, and quality of data are the most important factors. Sometimes the amount of labeled data that we have is not enough or the problem domain that … Read more Filter Learning with Unsupervised Learning

Databases 101: Introduction to Databases for Data Scientists

Data science is one of the fast-growing fields that I can’t see slowing down any time soon. Not with how our data dependence is overgrowing day by day. Data science is all about data, collecting it, cleaning it, analyzing it, visualizing it, and using it to make our life better. Handling large amounts of data … Read more Databases 101: Introduction to Databases for Data Scientists

Practical Machine Learning Basics

My first exploration of Machine learning using the Titanic competition on Kaggle Louis & Lola, survivors of the Titanic disaster (Photo from Library of Congress Prints and Photographs, No known restrictions on publication) This article describes my attempt at the Titanic Machine Learning competition on Kaggle. I have been trying to study Machine Learning but … Read more Practical Machine Learning Basics

XGBoost, LightGBM, and Other Kaggle Competition Favorites

An Intuitive Explanation and Exploration Kaggle is the data scientist’s go-to place for datasets, discussions, and perhaps most famously, competitions with prizes of tens of thousands of dollars to build the best model. With all the flurried research and hype around deep learning, one would expect neural network solutions to dominate the leaderboards. It turns … Read more XGBoost, LightGBM, and Other Kaggle Competition Favorites

3 Ways to Build Neural Networks in TensorFlow with the Keras API | by Orhan Gazi Yalçın | Medium

Building Deep Learning models with Keras in TensorFlow 2.x is possible with the Sequential API, the Functional API, and Model Subclassing Figure 1. The Sequential API, The Functional API, Model Subclassing Methods Side-by-Side If you are going around, checking out different tutorials, doing Google searches, spending a lot of time on Stack Overflow about TensorFlow, … Read more 3 Ways to Build Neural Networks in TensorFlow with the Keras API | by Orhan Gazi Yalçın | Medium

Ultimate Pandas Guide — Mastering the Groupby

We can also index with a single column (as opposed to list): sales_data.groupby(‘month’).agg(sum)[‘purchase_amount’] In this case, we get a Series object instead of a DataFrame. I tend to prefer working with DataFrames, so I typically go with the first approach. Now that we have the basics down, let’s go through a few of the more … Read more Ultimate Pandas Guide — Mastering the Groupby

What I learned as a college student running a large open-source project

Unsplash My name is Palash Shah, and I’m the author of Libra: a machine learning library that lets you build and train models in one line of code. I’m also an undergraduate student at the University of Virginia. My journey in the open source community started as a normal college student — I worked on … Read more What I learned as a college student running a large open-source project

Domain Expertise: What deep learning needs for better COVID-19 detection

The world probably doesn’t need another neural network, but it needs a coffee chat with those on the front lines. By now, you’ve probably seen a few, if not many, articles on how deep learning could help detect COVID-19. In particular, convolutional neural networks (CNNs) have been studied as a faster and cheaper alternative to … Read more Domain Expertise: What deep learning needs for better COVID-19 detection

Learning to Rank for Information Retrieval: A Deep Dive into RankNet.

Machine Learning and Artificial Intelligence are currently driving innovation in the field of Computer Science and they are being applied on a multitude of fields across disciplines. However, traditional ML models can be still be broadly categorized into solutions of two types of problems. Classification — Which aims at labelling a particular instance of data … Read more Learning to Rank for Information Retrieval: A Deep Dive into RankNet.

Building the Ultimate AI Agent for Doom using Dueling Double Deep Q-Learning

A Reinforcement Learning Implementation in Pytorch. Over the last few articles, we’ve discussed and implemented various value-learning architectures for the VizDoom environment, and examined their performance in maximizing reward. To summarize, these include: Overall, vanilla Deep Q-learning is a highly flexible and responsive online reinforcement learning approach that utilizes rapid intra-episodic updates to it’s estimations … Read more Building the Ultimate AI Agent for Doom using Dueling Double Deep Q-Learning

Evolutionary Decision Trees: When Machine Learning draws its Inspiration from Biology

2.5. Mutation Mutations refer to small random choices made in individuals of a population. It is essential in ensuring genetic diversity and enabling the genetic algorithm to search a broader space. In the context of Decision Trees, it can be implemented by randomly change an attribute and split the value of a node randomly selected. … Read more Evolutionary Decision Trees: When Machine Learning draws its Inspiration from Biology

Model Lifecycle: From ideas to value

Value scoping, discovery, delivery, and stewardship Created by Authors based on Youtube video Monarch Butterfly Metamorphosis time-lapse FYV 1080 HD In Part 1 of this series we examined the key differences between software and models; in Part 2 we explored the twelve traps of conflating models with software; and in Part 3 we looked at … Read more Model Lifecycle: From ideas to value

Business Intelligence Visualizations with Python

Installation process is pretty straight forward. Just open your terminal and insert the following command: pip install matplotlib A. Line Plot After having installed the library, we can jump on to plot creation. The first type we’re going to create is a simple Line Plot: # Begin by importing the necessary libraries:import matplotlib.pyplot as plt … Read more Business Intelligence Visualizations with Python

Data Processing Example using Python

Just some of the steps involved in prepping a dataset for analysis and machine learning. Source: Image Created by Author Forbes’s survey found that the least enjoyable part of a data scientist’s job encompasses 80% of their time. 20% is spent collecting data and another 60% is spent cleaning and organizing of data sets. Personally, … Read more Data Processing Example using Python

How to Query PostgreSQL using Python (with SSH) in 3 Steps

STEP 3: Query! Now were ready to start querying! The defined class only provides a handful of basic functions. Let’s walk through how to use the class and what we can do with it. First, we’ll need to specify our PostgreSQL connection arguments, and SSH arguments (if SSH tunneling is required to access the remote … Read more How to Query PostgreSQL using Python (with SSH) in 3 Steps

Training Better Deep Learning Models for Structured Data using Semi-supervised Learning

Deep learning is known to work well when applied to unstructured data like text, audio, or images but can sometimes lag behind other machine learning approaches like gradient boosting when applied to structured or tabular data.In this post, we will use semi-supervised learning to improve the performance of deep neural models when applied to structured … Read more Training Better Deep Learning Models for Structured Data using Semi-supervised Learning

Latent Dirichlet Allocation: Intuition, math, implementation and visualisation

TL;DR — Latent Dirichlet Allocation (LDA, sometimes LDirA/LDiA) is one of the most popular and interpretable generative models for finding topics in text data. I’ve provided an example notebook based on web-scraped job description data. Although running LDA on a canonical dataset like 20Newsgroups would’ve provided clearer topics , it’s important to witness how difficult … Read more Latent Dirichlet Allocation: Intuition, math, implementation and visualisation

Machine Learning Model Explanation using Shapley Values

Learn how to interpret a black box model using SHAP (SHapley Additive exPlanations) Photo by Frank Vessia on Unsplash Article Outline Why SHAP (SHapley Additive exPlanations) About Dataset Loading Dataset Model Fitting Shaply values estimation Variable Importance plot Summary plot Dependence Plot Force Plot Tutorial DataSet Why SHAP (SHapley Additive exPlanations)? The very common problem … Read more Machine Learning Model Explanation using Shapley Values

Understanding Apache Parquet

Data Warehousing | Data Lake | Parquet Understand why Parquet should be used for warehouse/lake storage Apache Parquet is a columnar storage format available to any project […], regardless of the choice of data processing framework, data model or programming language.— https://parquet.apache.org/ This description is a good summary of this format. This post will talk … Read more Understanding Apache Parquet

SQL Window (Analytic) Functions Explained in 4 minutes

To provide some more clarity, suppose we have the following table: If we wanted to get the average GPA by gender, we could use an aggregate function and run the following query to get the following result: SELECT Gender, AVG(GPA) as avg_gpaFROM studentsGROUP BY Gender result The next part is key. Now suppose we wanted … Read more SQL Window (Analytic) Functions Explained in 4 minutes

Hierarchical topic modeling with BigARTM library

The matrix with topics per documents is rather sparse so we got exactly what we needed. Image by the author It will be convenient to read the articles which relate to the particular topic. So here we can obtain a list of articles sorted on topic probability. Image by the author Building hierarchy The topics … Read more Hierarchical topic modeling with BigARTM library

path.chain: Concise Structure for Chainable Paths

[This article was first published on krzjoa, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. path.chain package provides an intuitive and easy-to-use system ofnested objects, which represents different … Read more path.chain: Concise Structure for Chainable Paths

Slicing the onion 3 ways- Toy problems in R, python, and Julia

Between writing up my thesis, applying to jobs hire me! I’m quite good at programming, and the ongoing pandemic, I don’t really have time to write full blogposts. I have however decided to brush up my python skills and dive headfirst into Julia. As such, I like to answer the toy problems posted at fivethirtyeight’s … Read more Slicing the onion 3 ways- Toy problems in R, python, and Julia

What is AI? A straight-forward introduction

Artificial Intelligence (AI) is a part of our daily lives — from language translation to medical diagnostics and driverless cars to facial recognition — it’s making more of an impact on industry and society every day. But what exactly is AI? Simply put, AI is a technology that replicates human intelligence through computers, systems or … Read more What is AI? A straight-forward introduction

Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

[This article was first published on R – Myscape, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Generalized linear models (GLM) are a classic method for analyzing RNA-seq … Read more Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis

On Demand Materialized Views: A Scalable Solution for Graphs, Analysis or Machine Learning

Let’s create a simple example with some mock data. In this example we will aggregate generic posts and determine how many posts each profile has, then we will aggregate comments. If you are using the code snippets to follow this article, you will want to create a few data points following the style below. However … Read more On Demand Materialized Views: A Scalable Solution for Graphs, Analysis or Machine Learning

Solving a Social Distancing Problem using Genetic Algorithms

“Social distancing” has become very popular these days but it is not always obvious how the rules can fit our daily life. In this story, we are going to study a social distancing problem and find solution to it using Genetic Algorithms. After setting the problem and its constraints, I’ll summarize the principles of Genetic … Read more Solving a Social Distancing Problem using Genetic Algorithms

Why and How to use Cross Entropy

Working out the cross entropies of each observation shows that when the model incorrectly predicted 1 with a low probability, there was a smaller loss than when the model incorrectly predicted 0 with a high probability. Minimizing this loss function will prevent high probabilities from being assigned to incorrect predictions. To demonstrate why cross entropy … Read more Why and How to use Cross Entropy

How to Plan and Organize a Data Science Project? | by Yin Zhang

Conducting a data science/analytics project always takes time and has never been easy. A successful and comprehensive analytics project is way beyond coding. Instead, it involves sophisticated planning and a large amount of communication. Photo by Octavian Dan on Unsplash What is the Life Cycle of an Analytics Project? To complete a data science/analytics project, … Read more How to Plan and Organize a Data Science Project? | by Yin Zhang

How to make your deep learning experiments reproducible and your code extendible

Lessons learned from building an open-source deep learning for time series framework. Photo by author (taken while hiking at the Cutler Coast Preserve in Machias ME) Note this is roughly based on a presentation I made back in February at the Boston Data Science Meetup Group. You can find the full slide deck here. I … Read more How to make your deep learning experiments reproducible and your code extendible

Ultimate Pandas Guide — Joining data with Python

Photo by Laura Woodbury from Pexels Master the difference between “Merge” and “Join” Everyone who works in data knows this: before you build machine learning models or produce stunning visualizations, you have to get your hands dirty with data wrangling. And one of the core skills in data wrangling is learning how to join together … Read more Ultimate Pandas Guide — Joining data with Python

A Summer as a Data Scientist

A retrospective on my summer as a data scientist and how GSI Technology’s summer program breaks the internship status quo. GSI Technology. Reposted with Author’s Permission Data science is a field that can be hard to break into, especially if you are an undergraduate student. My name is Braden Riggs and some of you reading … Read more A Summer as a Data Scientist

Convolutional Neural Network: How is it different from the other networks? | by YANG Xiaozhou | Sep, 2020 | Towards Data Science

Roughly speaking, there are two important operations that make a neural network:1. Forward propagation2. Backpropagation Forward propagation This is the prediction step. The network reads the input data, computes its values across the network, and gives a final output value. But how does the network computes an output value? Let’s see what happens in a … Read more Convolutional Neural Network: How is it different from the other networks? | by YANG Xiaozhou | Sep, 2020 | Towards Data Science

Introducing TMS: a Trading Market Simulator

An easy to use trading simulator to test trading (ML/AI) algorithms and strategies on Python Simulation of AAPL on September 9th, 2020, using TMS. Sometime ago, I wrote an article on how to download stocks market data for free using Alpaca, a trading broker and API. I published this article because I had worked on … Read more Introducing TMS: a Trading Market Simulator