Towards Ethical Machine Learning

https://initiatives.provost.uci.edu/event/philosophy-machine-learning-knowledge-causality/ I quit my job to enter an intensive data science bootcamp. I understand the value behind the vast amount of data available that enables us to create predictive machine learning algorithms. In addition to recognizing its value on a professional level, I benefit from these technologies as a consumer. Whenever I find myself in … Read more Towards Ethical Machine Learning

How to give money to the R project

by Mark Niemann-Ross, an author, educator, and writer who teaches about R and Raspberry Pi at LinkedIn Learning I spend a LOT of time at r-project.org, in particular the sections for documentation and CRAN. But I hadn’t spent much time in the other areas: R Project, R Foundation, and links. When I recently wandered into the foundation area, … Read more How to give money to the R project

Parsing XML, Named Entity Recognition in One-Shot

Photo credit: Lynda.com Conditional Random Fields, Sequence Prediction, Sequence Labelling Parsing XML is a process that is designed to read XML and create a way for programs to use XML. An XML parser is the piece of software that reads XML files and makes the information from those files available to applications. While reading an … Read more Parsing XML, Named Entity Recognition in One-Shot

An introduction to web scraping with Python

Introduction As a data scientist, I often find myself looking for external data sources that could be relevant for my machine learning projects. The problem is that it is uncommon to find open source data sets that perfectly correspond to what you are looking for, or free APIs that give you access to data. In … Read more An introduction to web scraping with Python

Top Examples of Why Data Science is Not Just .fit().predict()

In this post, I’m going to review some of the top concepts I learned that turned me from a technical data scientist to a good data scientist Two months ago, I finished my second year as a data scientist at YellowRoad so I decided to do a retrospective analysis on my projects, what did I … Read more Top Examples of Why Data Science is Not Just .fit().predict()

Pew Study Answers on Artificial Intelligence and the Future of Humans

The AI future is uncertain, but generally, I think it will improve life. I was one of the 900+ futurists interviewed for The Pew Research study released yesterday, “Artificial Intelligence and the Future of Humans.” Conducted with Elon University, the study revolved around AI and the 50th anniversary of the Internet. The report asked three questions … Read more Pew Study Answers on Artificial Intelligence and the Future of Humans

Classification (Part 2) — Linear Discriminant Analysis

An explanation of Bayes’ theorem and linear discriminant analysis Photo by Jerry Kiesewetter on Unsplash Overview Previously, logistic regression was introduced for classification. Unfortunately, like any model, it presents some flaws: When classes are well separated, parameters estimate from logistic regression tend to be unstable When the data set is small, logistic regression is also unstable … Read more Classification (Part 2) — Linear Discriminant Analysis

AWS Architecture For Your Machine Learning Solutions

The Undertaking Recently, I was involved in developing a machine learning solution for one of the largest North American steel manufacturers. The company wanted to leverage the power of ML to get insights on customer segmentation, order prediction and product-volume recommendations. This article revolves around why and how we leveraged AWS for deploying our deliverables … Read more AWS Architecture For Your Machine Learning Solutions

How to tune a BigQuery ML classification model to achieve a desired precision or recall

Select the probability threshold based on the ROC curve BigQuery provides an incredibly convenient way to train machine learning models on large, structured datasets. In an earlier article, I showed you how to train a classification model to predict flight delays. Here’s the SQL query that will predict whether a flight is going to be late … Read more How to tune a BigQuery ML classification model to achieve a desired precision or recall

How to deploy a predictive service to Kubernetes with R and the AzureContainers package

It’s easy to create a function in R, but what if you want to call that function from a different application, with the scale to support a large number of simultaneous requests? This article shows how you can deploy an R fitted model as a Plumber web service in Kubernetes, using Azure Container Registry (ACR) and … Read more How to deploy a predictive service to Kubernetes with R and the AzureContainers package

Implementing Defensive Design in AI Deployments

A series of insights and battle scars from the world of medical device design With the upcoming launch of one of our AI products, there has been a repeating question that clients kept asking. This same question also shows up once in a while with our consulting engagements, to a lesser degree, but still demands an … Read more Implementing Defensive Design in AI Deployments

Object detection and tracking in PyTorch

Detecting multiple objects in images and tracking them in videos In my previous story, I went over how to train an image classifier in PyTorch, with your own images, and then use it for image recognition. Now I’ll show you how to use a pre-trained classifier to detect multiple objects in an image, and later track … Read more Object detection and tracking in PyTorch

10 Lessons Learned From Participating in Google AI Challenge

Key Points of My Work Disclaimers: I will present only a portion of the code I wrote for this competition, my teammates are absolutely not responsible for my awful and buggy code. A portion of this code is inspired by great Kagglers sharing their insights and code in Kaggle kernels and forums. I hope I did … Read more 10 Lessons Learned From Participating in Google AI Challenge

AI: the silver bullet to stop Technical Debt from sucking you dry

You’ve heard a lot about student debt, but what about technical debt? It’s Friday evening in the Bahamas. You’re relaxing under a striped red umbrella with a succulent glass of wine and your favorite book — it’s a great read and you love the way the ocean breeze moves the pages like leaves on a tree. As … Read more AI: the silver bullet to stop Technical Debt from sucking you dry

Pitching Artificial Intelligence to Business People

From silver bullet syndrome to silver linings In this article I plan to share with you our recent experience pitching AI to business folk, and what lessons we learned along the way. As a small firm of AI experts, we follow an awareness marketing approach. Rather than relying solely on one marketing channel, we attend conferences … Read more Pitching Artificial Intelligence to Business People

A Thought on Using Machine Learning Models

During my training classes, after/during discussion on the common machine learning models I will usually bring up a topic and that is the usage of insights from these models or the implementation of the model into business /organization process. For instance, we can get the most accurate model where its very good at ‘predicting’ which … Read more A Thought on Using Machine Learning Models

Improving Patient Flows With Data Science And Analytics

Reducing Costs By Improving Processes Our team was recently asked how data analytics and data science can be used to improve bottlenecks and patient flows in hospitals. Healthcare providers and hospitals can have very complex patient flows. Many steps can intertwine, resources have to shift in between tasks all the time, and severity of patients … Read more Improving Patient Flows With Data Science And Analytics

How a High School Junior Made a Self-Driving Car

Questions related to this repository from a project I created almost three years ago are among the most numerous questions I receive. The repository itself is really nothing too special, just an implementation of an Nvidia paper that was released about a year prior. A graduate student later managed to implement my code in an … Read more How a High School Junior Made a Self-Driving Car

Simpson’s Paradox and Interpreting Data

The challenge of finding the right view through data Edward Hugh Simpson, a statistician and former cryptanalyst at Bletchley Park, described the statistical phenomenon that takes his name in a technical paper in 1951. Simpson’s paradox highlights one of my favourite things about data: the need for good intuition regarding the real world and how most … Read more Simpson’s Paradox and Interpreting Data

Word Representation in Natural Language Processing Part II

In the previous part (Part I) of the word representation series, I talked about fixed word representations that make no assumption about semantics (meaning) and similarity of words. In this part, I will describe a family of distributed word representations. The main idea is to represent words as feature vectors. Each entry in vector stands … Read more Word Representation in Natural Language Processing Part II

AlphaZero implementation and tutorial

A walk-through of implementing AlphaZero using custom TensorFlow operations and a custom Python C module I describe here my implementation of the AlphaZero algorithm, available on Github, written in Python with custom Tensorflow GPU operations and a few accessory functions in C for the tree search. The AlphaZero algorithm has gone through three main iterations, first … Read more AlphaZero implementation and tutorial

TensorFlow Filesystem — Access Tensors Differently

Tensorflow is great. Really, I mean it. The problem is it’s great up to a point. Sometimes you want to do very simple things, but tensorflow is giving you a hard time. The motivation I had behind writing TFFS (TensorFlow File System) can be shared by anyone who has used tensorflow, including you. All I … Read more TensorFlow Filesystem — Access Tensors Differently

To all Data Scientists — The one Graph Algorithm you need to know

Dec 8, 2018 Photo by Alina Grubnyak on Unsplash Graphs provide us with a very useful data structure. They can help us to find structure within our data. With the advent of Machine learning and big data, we need to get as much information as possible about our data. Learning a little bit of graph theory … Read more To all Data Scientists — The one Graph Algorithm you need to know

Beating the Fantasy Premier League game with Python and Data Science

Our Moneyball approach to the EPL Fantasy League My friend and I have been playing the Official Fantasy English Premier League game for many years, and despite our firm belief that we know everything about English soccer, we tend to get “unlucky” year after year and somehow never seem to pick the winning team. So, we … Read more Beating the Fantasy Premier League game with Python and Data Science

Maximum Likelihood Estimation: How it Works and Implementing in Python

Previously, I wrote an article about estimating distributions using nonparametric estimators, where I discussed the various methods of estimating statistical properties of data generated from an unknown distribution. This article covers a very powerful method of estimating parameters of a probability distribution given the data, called the Maximum Likelihood Estimator. This article is part of … Read more Maximum Likelihood Estimation: How it Works and Implementing in Python

A Data Analysis of Riding The Bus

What should I expect before a round of the popular drinking game? Recommended equipment for Ride The Bus College. It’s a time for things like exploring your personality, finding your values, and making lifelong friends. Those are all well and good, but college is also a time for drinking games! There’s plenty of time in the … Read more A Data Analysis of Riding The Bus

Building a molecular charge classifier

The intersection of Chemistry and A.I A.I has seen unprecedented growth in the past couple years. Although machine learning architectures like Neural Networks (NN) have been known for a long time thanks to breakthroughs from top researchers like Geoffrey Hinton, only recently have NNs become powerful tools in an A.I specialist’s toolbox. This is credited mainly … Read more Building a molecular charge classifier

A gentle journey from linear regression to neural networks

Deep Learning What are we talking about? A quick search on Google give us the following definition of “deep learning” : “the ensemble of deep learning methods is a part of a broader family of machine learning methods that aims at modelling data with a high level of abstraction”. Here, we should understand that deep learning consists … Read more A gentle journey from linear regression to neural networks

A short guide to using Docker for your data science environment

WHY One of the most time consuming part of starting your work on a new system/starting a new job or just plain sharing your work is the variation of tools available (or lack thereof) due to differences in hardware/software/security policies and what not. Containerization has risen up in recent years as a ready to use … Read more A short guide to using Docker for your data science environment

Data network effects for an artificial intelligence startup

Artificial intelligence (AI) ecosystem matures and it is becoming increasingly difficult to impress customers, investors, and potential acquirers by just attaching an .ai domain to whatever you are doing. Therefore, the significance of building a defensible business model in the long run becomes obvious. In this post, I explore how an AI startup may unlock various … Read more Data network effects for an artificial intelligence startup

R some blog 2018-12-08 04:19:00

Motivation The dplyr functions select and mutate nowadays are commonly applied to perform data.frame column operations, frequently combined with magrittrs forward %>% pipe. While working well interactively, however, these methods often would require additional checking if used in “serious” code, for example, to catch column name clashes. In principle, the container package provides a dict-class … Read more R some blog 2018-12-08 04:19:00

How To Ask The Right Questions As A Data Scientist

How to define a problem statement by asking the right questions? (Source) Admit it or not, defining a problem statement (or data science problem) is one of the most important steps in data science pipeline. A problem well defined is a problem half-solved — Charles Kettering In the following part, we’ll go through the four … Read more How To Ask The Right Questions As A Data Scientist

Feel discouraged on the sparse data in your hand? Give Factorization Machine a shot (2)

By laying a solid foundation of Matrix Factorization, your exploration on a series of advanced models derived from the concept of matrix factorization will be much more smoother, such as LDA, LSI, PLSA and Tensor Factorization and etc. The models derived from the concept of Matrix Factorization In last session, we talked about the basic … Read more Feel discouraged on the sparse data in your hand? Give Factorization Machine a shot (2)

Python Virtual Environment

Conda How to set up a virtual environments using conda for the Anaconda Python distribution A virtual environment is a named, isolated, working copy of Python that that maintains its own files, directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects. Virtual environmets … Read more Python Virtual Environment

“Increase sample size until statistical significance is reached” is not a valid adaptive trial design; but it’s fixable.

TLDR: Begin with N of 10, increase by 10 until p < 0.05 or max N reached. This design has inflated type-I error. Lower p-value threshold needed to ensure specified type-I error rate. The number of interim analyses and max N affect the type-I error rate. Threshold can be identified using simulation. A recent Facebook … Read more “Increase sample size until statistical significance is reached” is not a valid adaptive trial design; but it’s fixable.

Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

What, Why, Possible Solution and Ultimate Utility In one of my previous articles, “Under-sampling : A Performance Booster on Imbalanced Data”: I have applied Cluster Centroid based Majority Under-sampling Technique (CCMUT) on Adult Census Data and proved the Model Performance Improvement w.r.t State-of-the-Art Model, “A Statistical Approach to Adult Census Income Level Prediction”[1]. But there are … Read more Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

“Artist” in Matplotlib — something I wanted to know before spending tremendous hours on googling…

Originally published at dev.to and modified a bit to fit Medium’s editing system. It’s true that matplotlib is a fantastic visualizing tool in Python. But it’s also true that tweaking details in matplotlib is a real pain. You may easily lose hours to find out how to change a small part of your plot. Sometimes … Read more “Artist” in Matplotlib — something I wanted to know before spending tremendous hours on googling…

Avoiding Parking Tickets in San Francisco Using Data Analytics

Although still not a perfect predictor, this model was more accurate than the first. The streets identified as best showed much less variability than those of the worst as well. We could also reduce the amount of tickets by over 50% if we chose the best population compared to the worst. Interestingly, parking density was … Read more Avoiding Parking Tickets in San Francisco Using Data Analytics

Comparative study on Classic Machine learning Algorithms

2. Logistic Regression Just like linear regression, Logistic regression is the right algorithm to start with classification algorithms. Eventhough, the name ‘Regression’ comes up, it is not a regression model, but a classification model. It uses a logistic function to frame binary output model. The output of the logistic regression will be a probability (0≤x≤1), … Read more Comparative study on Classic Machine learning Algorithms

F# Advent Calendar — A Christmas Classifier

The ML.NET Model The model is defined in Program.fs The dataLoader specifies the schema of the input data. Input Data Schema The dataLoader is then used to load the training and test data views. Load Training and Test Data The dataPipeline specifies the transforms that should be applied to the input tsv. Since this is a … Read more F# Advent Calendar — A Christmas Classifier

Gender Diversity in the R and Python Communities

Many (if not most) tech communities have far more representation from men than from women (and even fewer from nonbinary folk). This is a shame, because everybody uses software, and these projects would self-evidently benefit from the talent and expertise from across the entire community. Some projects are doing better than others, though, and data … Read more Gender Diversity in the R and Python Communities

How to determine the best model?

Machine learning models play a critical role in many aspects of today’s business. The use of a predictive model can improve the business bottom line, and a slightly improved model can result in an increase of millions of dollars. Although you may not know all the popular algorithms (and more powerful algorithms in the future), … Read more How to determine the best model?