TARGET HK: A Quick Dive Into China’s Disinformation Campaign On Twitter

This is a quick dive into the trove of Chinese state troll tweets released by Twitter on Aug 19. More to come in the coming days and weeks. An example of Chinese state troll tweet exposed by Twitter on Aug 19. On August 19, Twitter dropped a new trove of state troll tweets that the … Read moreTARGET HK: A Quick Dive Into China’s Disinformation Campaign On Twitter

Notes on Becoming an RStudio Certified Trainer

[This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. I recently became an RStudio Certified Trainer, and thought that it … Read moreNotes on Becoming an RStudio Certified Trainer

Boston Job Market for Data Analysts and Scientists : August 2019 Update

Most Hiring Companies, Top Tools & Tech, and More Introduction This is an August 2019 update of my original project where I simply aim to explore the job market for data analysts and data scientists in the Greater Boston Area. These visuals were produced only from job listings posted on Indeed with the search term … Read moreBoston Job Market for Data Analysts and Scientists : August 2019 Update

GeoVec: word embeddings for geosciences

We can see that it is organised by layers and contains details such as colour, presence of roots, descriptions of the pores, textural class (estimated proportion of clay, silt and sand), etc. Most of the time the descriptions follow some recommended format but they might contain more or less free-form text depending on the study. … Read moreGeoVec: word embeddings for geosciences

KL Divergence Python Example

As you progress in your career as a data scientist, you will inevitable come across the Kullback–Leibler (KL) divergence. We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions. One common scenario where this is useful is when we are working with a … Read moreKL Divergence Python Example

Introducing the BigQuery Terraform moduleIntroducing the BigQuery Terraform moduleInfrastructure Cloud Consultant

It’s no secret software developers love to automate their work away, and cloud development is no different. Since the release of the Cloud Foundation Toolkit (CFT), we’ve offered automation templates with Deployment Manager and Terraform to help engineers get set up with Google Cloud Platform (GCP) quickly. But as useful as the Terraform offering was, … Read moreIntroducing the BigQuery Terraform moduleIntroducing the BigQuery Terraform moduleInfrastructure Cloud Consultant

Reducing SAP implementations from months to minutes with Azure Logic Apps

It’s always been a tricky business to handle mission-critical processes. Much of the technical debt that companies assume comes from having to architect systems that have multiple layers of redundancy, to mitigate the chance of outages that may severely impact customers. The process of both architecting and subsequently maintaining these systems has resulted in huge … Read moreReducing SAP implementations from months to minutes with Azure Logic Apps

Detecting and modeling outliers with PyOD

As the name suggests, outliers are datapoint which differs significantly from the rest of your observations. In other words, they are far away from the average path of your data. In statistics and Machine Learning, detecting outliers is a pivotal step, since they might affect the performance of your model. Namely, imagine you want to … Read moreDetecting and modeling outliers with PyOD

Deep Learning and Momentum Investing

V. Test Set Results and Interpretability of Predictions A. Out-of-Sample Results First, to gauge the model’s ability to generalize on unseen data, let’s have a look at the test set loss. Figure 5 plots the ensemble loss relative to its validation loss (dashed black line normalized to 1). The red line draws the average loss … Read moreDeep Learning and Momentum Investing

Azure Sphere’s customized Linux-based OS

Security and resource constraints are often at odds with each other. While some security measures involve making code smaller by removing attack surfaces, others require adding new features, which consume precious flash and RAM. How did Microsoft manage to create a secure Linux based OS that runs on the Azure Sphere MCU? The Azure Sphere … Read moreAzure Sphere’s customized Linux-based OS

Making PATE Bidirectionally Private

This guide is based on this repo. Some sections of the code will be skipped or modified for readability of the article. Initial Setup First, we need to import the necessary libraries. This guide assumes all libraries are already installed locally. We’re declaring the necessary libraries and hooking Syft with Torch. To demonstrate how PATE … Read moreMaking PATE Bidirectionally Private

Data Science Roles: A Classification Problem

Netflix relies on data to deliver personalized experiences for 130 million Netflix members worldwide. According to the Netflix Tech Blog: Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs … Read moreData Science Roles: A Classification Problem

How to Prepare for Your Data Engineering Interview

You should feel very accomplished if you get to the on-site interview, but the hardest part is yet to come! On-sites can be grueling affairs of interviewing with 4–10 people in 3–6 hours, especially if you’re not prepared. Knowing what to expect and doing realistic preparation beforehand go a long way toward reducing fear and … Read moreHow to Prepare for Your Data Engineering Interview

You Don’t Have to be Struck by Lightning to Win the Lottery

A few weeks ago while in lecture, I was asked the following question: “What’s the likelihood of making a living by playing the lottery?” Not very high you think? Well in 2005, a group of MIT students got together and formed a betting syndicate. They had found the game they wanted to bet on, calculated … Read moreYou Don’t Have to be Struck by Lightning to Win the Lottery

RcppQuantuccia 0.0.3

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. A maintenance release of RcppQuantuccia arrived on CRAN earlier … Read moreRcppQuantuccia 0.0.3

Introducing Open Forensic Science in R

The free online book Open Forensic Science in R was created to foster open science practices in the forensic science community. It is comprised of eight chapters: an introduction and seven chapters covering different areas of forensic science: the validation of DNA interpretation systems, firearms analysis of bullets and casings, latent fingerprints, shoe outsole impressions, … Read moreIntroducing Open Forensic Science in R

simstudy updated to version 0.1.14: implementing Markov chains

I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends … Read moresimstudy updated to version 0.1.14: implementing Markov chains

Correspondence Analysis visualization using ggplot

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. What we want to do Recently, I used a correspondence analysis from the … Read moreCorrespondence Analysis visualization using ggplot

Fitting ‘complex’ mixed models with ‘nlme’. Example #1

Fitting mixed models has become very common in biology and recent developments involve the manipulation of the variance-covariance matrix for random effects and residuals. To the best of my knowledge, within the frame of frequentist methods, the only freeware solution in R should be based on the ‘nlme’ package, as the ‘lmer’ package does not … Read moreFitting ‘complex’ mixed models with ‘nlme’. Example #1

Referring to POTUS on Twitter: a stance-based perspective on variation in the 116th House

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. In this post, we investigate how (& how often) members of the … Read moreReferring to POTUS on Twitter: a stance-based perspective on variation in the 116th House

Intelligent Loan Selection for Peer-to-Peer Lending

Automatic Investing on Lending Club Using a Neural Network while Controlling Risk in Loan Selection In this article I describe how to train a neural network to evaluate loans that are offered on the crowd lending platform Lending Club. I also cover how to test the model, how to adjust the risk in loan selection, … Read moreIntelligent Loan Selection for Peer-to-Peer Lending

The Ultimate Guide to using the Python regex module

The first thing we need to learn while using regex is how to create patterns. I will go through some most commonly used patterns one by one. As you would think, the simplest pattern is a simple string. pattern = r’times’string = “It was the best of times, it was the worst of times.”print(len(re.findall(pattern,string))) But … Read moreThe Ultimate Guide to using the Python regex module

spaCy Basics

A guide for getting started NLP and spaCy A major challenge of text data is extracting meaningful patterns and using those patterns to find actionable insights. NLP can be thought of as a two part problem: Processing. Converting the text data from its original form into a form the computer can understand. This includes data … Read morespaCy Basics

Simulate Images for ML in PyBullet — The Quick & Easy Way

When applying deep Reinforcement Learning (RL) to robotics, we are faced with a conundrum: how do we train a robot to do a task when deep learning requires hundreds of thousands, even millions, of examples? To achieve 96% grasp success on never-before-seen objects, researchers at Google and Berkeley trained a robotic agent through 580,000 real-world … Read moreSimulate Images for ML in PyBullet — The Quick & Easy Way

Run Amazon SageMaker Notebook locally with Docker container

The main aim of the local Docker container is to maintain as much as possible the most important features of the AWS-hosted instance while enhancing the experience with the local-run capability. Followings are the features that have been replicated: Jupyter Notebook and Jupyter Lab This is simply taken from Jupyter’s official Docker images with a … Read moreRun Amazon SageMaker Notebook locally with Docker container

Skip the heavy lifting: Moving Redshift to BigQuery easilySkip the heavy lifting: Moving Redshift to BigQuery easilyProduct Manager, Data Analytics, Google Cloud

Enterprise data warehouses are getting more expensive to maintain. Traditional data warehouses are hard to scale and often involve lots of data silos. Business teams need data insights quickly, but technology teams have to grapple with managing and providing that data using old tools that aren’t keeping up with demand. Increasingly, enterprises are migrating their … Read moreSkip the heavy lifting: Moving Redshift to BigQuery easilySkip the heavy lifting: Moving Redshift to BigQuery easilyProduct Manager, Data Analytics, Google Cloud

Meta-Transfer Learning for Few-shot Learning

Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend … Read moreMeta-Transfer Learning for Few-shot Learning

Can you have your groceries delivered in under 15 minutes?

A quick Simulation and Optimization study for rapid fast delivery. Instacart, Amazon Prime now, Farmstead and many more startups in delivery space are tackling very interesting supply, demand, simulation and logistic optimization problems. Back in 2018 my cofounder Ricky Wong and I wanted to validate a radical grocery delivery idea : get groceries delivered under … Read moreCan you have your groceries delivered in under 15 minutes?

What does a modern analytics platform need to offer companies real added value?

What does a modern analytics platform need to offer companies real added value? Currently, new, innovative platforms are sprouting up on the market again and again – implemented with technical competence and ideally suited to the respective analytical approaches. But the question arises: Is that enough? Is it enough to develop software that allows reliable … Read moreWhat does a modern analytics platform need to offer companies real added value?

Build your own custom hotword detector with zero training data and $0!

TLDR: Google TTS -> Noise augment -> {wav files} ->SnowBoy ->{.pmdl models} -> Raspberry Pi OK, so it’s that time of the year again. You know there’s *that* thing in the desert. Last time around, I rigged up a Google AIY vision kit and added espeak on Chip and Terra , the art installations of … Read moreBuild your own custom hotword detector with zero training data and $0!

An easy introduction to unsupervised learning with 4 basic techniques

Deep Learning has gotten a lot of love from both the AI community and the general public. But most recently, researchers have started to question and doubt that deep learning is really the future of AI. The prominent deep learning techniques used today all rely on supervised learning, yet we see quite clearly that humans … Read moreAn easy introduction to unsupervised learning with 4 basic techniques

Regular Sequences

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. So far in this series, we used vectors from built-in datasets (rivers, women … Read moreRegular Sequences

Announcing the general availability of Python support in Azure Functions

Python support for Azure Functions is now generally available and ready to host your production workloads across data science and machine learning, automated resource management, and more. You can now develop Python 3.6 apps to run on the cross-platform, open-source Functions 2.0 runtime. These can be published as code or Docker containers to a Linux-based … Read moreAnnouncing the general availability of Python support in Azure Functions

Why Machine Learning is more Practical than Econometrics in the Real World

[This article was first published on R – Remix Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Motivation I’ve read several studies and articles that claim Econometric … Read moreWhy Machine Learning is more Practical than Econometrics in the Real World

Anomalies in Global Suicide Data

Mental Health Search Interest on Google Trends Every Mental Health Awareness Day (October 10), there is a peak in search interest for “mental health” on Google Trends. However, this past October, there was the highest search interest ever seen. Mental health in the United States is growing as a part of the global conversation – … Read moreAnomalies in Global Suicide Data

Ridge Regression Python Example

Overfitting, the process by which a model performs well for training samples but fails to generalize, is one of the main challenges in machine learning. In the proceeding article, we’ll cover how we can use regularization to help prevent overfitting. To be specific, we’ll talk about Ridge Regression, a distant cousin of Linear Regression, and … Read moreRidge Regression Python Example

Defining A Data Science Problem

The most important non-technical skill for a Data Scientist According to Cameron Warren, in his Towards Data Science article Don’t Do Data Science, Solve Business Problems, “…the number one most important skill for a Data Scientist above any technical expertise — [is] the ability to clearly evaluate and define a problem.” As a data scientist … Read moreDefining A Data Science Problem

Best Investment Portfolio Via Monte-Carlo Simulation In Python

There exists a risk-free rate which is the rate that an investor earns on his/her investment without taking any risk, such as in buying government treasury bills. There is a tradeoff between risk and return. If an investor is expecting to invest in a riskier investment option than the risk-free rate then he/she is expecting … Read moreBest Investment Portfolio Via Monte-Carlo Simulation In Python