Google Knows What You Are Saying With Only 80 MBs

So what does this have to do with data science? And how are cats involved? Statistical models and machine learning! HMMs, RNN-Ts, CTC, DNNs, LSTMs, CNNs, a brief history and fun with letters! Traditionally, voice diction has used Hidden Markov Models as a basis to predict output. A hidden Markov model is a statistical model … Read more

Comparing Text Summarization Techniques

Text Summarization is an increasingly popular topic within NLP and, with the recent advancements in modern deep learning, we are consistently seeing newer, more novel approaches. The goal of this article is to compare the results of a few approaches that I experimented with: Sentence Scoring based on Word Frequency TextRank using Universal Sentence Encoder … Read more

Play with the cyphr package

The cyphr package seems to provide a good choice for small research group that shares sensitive data over internet (e.g., DropBox). I did some simple experiment myself and made sure it can actually serve my purpose. I did my experiment on two computers (using openssl): I created the test data on my Linux workstation running … Read more

Categories R Tags ExcerptFavorite

Data science productionization: scale

Let’s look at the word-normalizing code (from my previous two posts) one more time. The code we wrote previously works fine for a single word. It would even work fine for a few thousand words. But if we need to normalize millions or billions of words, it will take more time than we probably want … Read more

Finding the right model parameters

If you’ve been reading about Data Science and/or Machine Learning, you must have come across articles and projects that work with MNIST dataset. The dataset includes a set of 70,000 images where each image is a handwritten digit from 0 to 9. I also decided to use the same dataset to understand how fine tuning … Read more

Why Norms Matters — Machine Learning

Play with norms: Evaluation is a crucial step in all modeling and machine learning problems. Since we are often making predictions on entire datasets, providing a single number that summarizes the performance of our model is both simple and effective. There are a number of situations where we need to compress information about a … Read more

The Single Course That Fast-Tracked My Data Science Learning Journey

That boosted my understanding, skills and confidence tremendously when I first started out in data science Before digging into the awesome resources I know and trust, an important disclosure: Some of the links below are affiliate links, which means that if you choose to make a purchase, I will earn a commission. This commission comes at … Read more

Summer Interns 2019

We received almost 400 applications for our 2019 internship program from students with very diverse backgrounds. After interviewing several dozen people and making some very difficult decisions, we are pleased to announce that these twelve interns have accepted positions with us for this summer: Therese Anders: Calibrated Peer Review. Prototype tools to conduct experiments to … Read more

Categories R Tags ExcerptFavorite

The Actual Difference Between Statistics and Machine Learning

Statistical Models vs Machine learning — Linear Regression Example It seems to me that the similarity of methods that are used in statistical modeling and in machine learning has caused people to assume that they are the same thing. This is understandable, but simply not true. The most obvious example is the case of linear regression, which … Read more

De-Googling Bach: Counterpointing Bach’s Rules of the Road With American Populist Music

Counterpointing Bach’s Rules of the Road With American Populist Music “Thnking Outside the Bachs” by Max Harper Ellert I had fun this week playing with the Google Doodle to create Bach harmonies from simple melodies. Looking through some of the articles about the process of melding A.I. and the principles of counterpoint was interesting too, like this … Read more

Corners in Images and Angular Representation of Their Relationships

Corner detection has been an important subject in image processing. It is essential and important, because it helps us find the unique features in images. There are several methods for detecting corners in images. The most famous one, that I assume, is Harris Corner Detection. After I read about it in Open-CV documentation, it gave … Read more

nice student project

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use … Read more

Categories R Tags ExcerptFavorite

Understanding Negative Log Loss

While learning, I decided to test out the “3 lines of code” on some dataset other than the ones used in the course. The wiki page of has some recommendations and I decided to try out as many as possible. The first recommended dataset under the easy category was Dogs vs. Cats Redux: … Read more

On Retractions in Biomedical Literature

The fierce competition in academia and the rush to publish, many times lead to flawed results and conclusions in scientific publications. While some of these are honest mistakes, others are deliberate scientific misconduct. According to one study, 76% of retractions were due to scientific misconduct in papers retracted from a specific journal¹. Another study from … Read more

Will Scientific Research be able to avoid Artificial Intelligence pitfalls?

It’s now obvious that AI, Machine Learning and Deep Learning are no longer buzzwords as they’re getting more and more present in every industry. Notwithstanding the trend has been overhyped in 2017, we are now certain that these technologies will be ubiquitous by 2020. Scientific research has not been left behind and AI has been … Read more

Don’t let them GO!

Using machine learning to detect customer churn. We have an example of a virtual company called ‘Sparkify’ who offers paid and free listening service, the customers can switch between either service, and they can cancel their subscription at any time. The given customers dataset is huge (12GB), thus the standard tools for analysis and machine learning … Read more

ShinyProxy 2.2.0

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterpriseor larger organizations. Secured Embedding of Shiny Apps Since version 2.0.1 ShinyProxy provides a REST API to manage (launch, shut down) Shiny apps and consume the content programmatically inside broader web applications or portals. This allows to cleanly separate the responsiblity for … Read more

Categories R Tags ExcerptFavorite

Exploring FIFA

SalRiteBlockedUnblockFollowFollowing Mar 24 ‘The thing about football — the important thing about football — is that it is not just about football.’ ~Sir Terry Pratchett. Soccer or Association Football, is not just a game, its an emotion for many. People follow their favorite Clubs no lesser than their Religion! Great Players are celebrated all over the world. But not … Read more

The Deployment Pain

Possible Causes of Deployment Anxiety This article was co-authored with Patrick Slavenburg The data science cycle in magnets: data access, data processing, model training, and deployment In October 2017, I was running the KNIME booth at the ODSC London conference. At the booth, we had the usual conference material to distribute: informative papers, various gadgets, … Read more

Natural Language Processing with Spacy in Node.js

Show Me some Examples Extract Dates Say you want to extract all of the dates from this text: The United States increased diplomatic, military, and economic pressures on the Soviet Union, at a time when the communist state was already suffering from economic stagnation. On 12 June 1982, a million protesters gathered in Central Park, New … Read more

Using R and H2O to identify product anomalies during the manufacturing process.

Introduction: We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance. … Read more

Categories R Tags ExcerptFavorite

Something You don’t know about data File if you just a Starter in Data Science, Import data File…

To be a master in data science, You have to understand how to manage your data and import it from the web because approx. 90% of data in real-world come straight from the internet. Data Engineer Life ( Source: Agula) If you are new to Data Science field, then you must be working hard to learn … Read more

DeViSE Zero-shot learning

Let’s take a closer look at the class probabilities an image classifier returns: With a softmax output layer, each picture can belong to only one single category as softmax is designed to assign a high probability to one single class. This means that you should not introduce an additional category “dog” because the network would … Read more

The complete beginner’s guide to machine learning: simple linear regression in four lines of code!

Even you can build a machine learning model. Seriously! Good data alone doesn’t always tell the whole story. Are you trying to figure out what someone’s salary should be based on their years of experience? Do you need to examine how much you’re spending on advertising in relation to your yearly sales? Linear regression might … Read more

How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work. cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the … Read more

Categories R Tags ExcerptFavorite

Which Data Science Bootcamp is right for you?

Photo by NESA by Makers on Unsplash If you’re thinking about attending a data science bootcamp but have zero data science experience yourself, you’ll probably not be able to sort the good from the bad. You won’t know which ones focus on the right things, the unnecessary things, the weird edge-case things. And most importantly, you … Read more

Learning Theory: (Agnostic) Probably Approximately Correct Learning

In my previous article, I discussed what is Empirical Risk Minimization and the proof that it yields a satisfactory hypothesis under certain assumptions. Now I want to discuss Probably Approximately Correct Learning (which is quite a mouthful but kinda cool), which is a generalization of ERM. For those who are not familiar with ERM, I … Read more

Everybody has a right to know what’s happening with the planet: towards a global commons

The importance of knowing our environmental history How can we judge today if we don’t know what happened yesterday? For anyone to be able to understand ecosystem services and the value they represent to the environment, they must first have insight into past environmental conditions. Some selected point in the past (often referred to in … Read more

When Excel isn’t enough: Using Python to clean your Data, automate Excel and much more…

@headwayio How a Data Analyst can survive in a spreadsheet-driven organization Excel is a very popular tool in many companies, and Data Analysts and Data Scientists alike often find themselves making it part of their daily arsenal of tools for data analysis and visualization, but not always by choice. This was certainly my experience at … Read more

Should you Fly or Should you Drive?

The thought of a plane crash gives me the creeps because I need to fly regularly home to visit my family. Recently there was a tragic accident by an Ethiopian airline where all passengers died. If you are interested in details about the plane crash, you can get them here: If such a crash happens, … Read more

Facial Keypoint Detection: Detect relevant features of face in a go using CNN & your own dataset…

Facial key-points are relevant for a variety of tasks, such as face filters, emotion recognition, pose recognition, and so on. So if you’re onto these projects, keep reading! In this project, facial key-points (also called facial landmarks) are the small magenta dots shown on each of the faces in the image below. In each training … Read more

Strength of a Lennon song exposed with R function glue::glue

love_verse <- function(w1, w2, w3){ glue::glue( “Love is {b}, {b} is love Love is {y}, {y} love Love is {u} to be loved”, b = w1, y = w2, u = w3) } As a return, parameters sometimes gives echoes of poetry. love_verse(‘real’, ‘feeling’, ‘wanting’) Love is real, real is love Love is feeling, feeling … Read more

Categories R Tags ExcerptFavorite

Exploratory Data Analysis: An Illustration in Python

Import the Toolkit We begin by importing some Python packages. These will serve as your toolkit for an effective EDA: import numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport pandas as pd %config InlineBackend.figure_format = ‘retina’%matplotlib inline In this example, we will use the Boston housing dataset (practice with it afterward and convince yourself). Let’s … Read more

Can you turn 1,500 R$ into 1,000,430 R$ by investing in the stock market?

In the last few weeks we’ve seen a great deal of controversy in Brazil regarding financial investments. Too keep it short, Empiricus, an ad-based company that massively sells online courses and subscriptions, posted a YouTube ad where a young girl, Bettina, says the following: Hi, I’m Bettina, I am 22 years old and, starting with … Read more

Categories R Tags ExcerptFavorite

RcppArmadillo 0.9.300.2.0

A new RcppArmadillo release based on a new Armadillo upstream release arrived on CRAN and Debian today. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R … Read more

Categories R Tags ExcerptFavorite

Data Fun – Inspired by Darasaurus

After my recent post on Anscombe’s Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice’s Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to … Read more

Categories R Tags ExcerptFavorite

Statistician proves that statistics are boring

Back-to-basics with nuanced vocabulary I’m about to prove to you that statistics are boring… to help you appreciate the point of all those fancy calculations that statisticians like myself get up to. As an added bonus, this is pretty much what you’d learn about on day 1 of most STAT101 classes, so it doubles as … Read more

Why we Did Not Name the cdata Transforms wide/tall/long/short

We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques. The terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure. The key point … Read more

Categories R Tags ExcerptFavorite

Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm

The post Decode Lyrics in Pop Music: Visualise Prose with the Songsim algorithm appeared first on The Lucid Manager. Music is an inherently mathematical form of art. Ancient Greek mathematician Pythagoras was the first to describe the logic of the scales that form melody and harmony. Numbers can also represent the rhythm of the music. … Read more

Categories R Tags ExcerptFavorite

A Quick and Tidy Look at the 2018 GSS

The data from the 2018 wave of the General Social Survey was released during the week, leading to a flurry of graphs showing various trends. The GSS is one of the most important sources of information on various aspects of U.S. society. One of the best things about it is that the data is freely … Read more

Categories R Tags ExcerptFavorite

Predicting the ‘Future’ with Facebook’s Prophet

Making the Predictions Making the dataset ‘Prophet’ compliant. Let’s convert the data in the format desired by Prophet. We shall rename ‘Date’: ‘ds’ and ‘Views’: ‘y’ df.columns = [‘ds’,’y’]df.head() Prophet follows the sklearn model API wherein an instance of the Prophet class is created and then the fit and predict methods are called. The model … Read more

A Design Thinking Mindset for Data Science

Adapted from a research paper written for The University of Texas capstone. Abstract Data science has received recent attention in the technical research and business strategy since; however, there is an opportunity for increased research and improvements on the data science research process itself. Through the research methods described in this paper, we believe there … Read more

Managing Data Science Workflows the Uber Way

Orchestrating workflows is one of the main challenges of machine learning solutions in the real world. A machine learning solution involves more than just picking the right model and productizing it. Data ingestion, training, deployment or optimization are common steps in any machine learning workflow. Unfortunately, the technology stacks for building and managing coordinated actions … Read more

Data science productionization: maintenance

In the last post, I used a simple word-normalizing function to illustrate a few principles of code portability: Now let’s look at the same function, but this time prioritizing maintenance: The first part doesn’t even include the function itself. What I’ve set up here is logging infrastructure. I’ve designated a file for recording errors (called … Read more

AFL teams Elo ratings and footy-tipping by @ellis2013nz

So now that I live in Melbourne, to blend in with the locals I need to at least vaguely follow the AFL (Australian Football League). For instance, my work like many others has an AFL footy-tipping competition. I was initially going to choose my tips based on wisdom of the crowds (ie choose the favourite) … Read more

Categories R Tags ExcerptFavorite