Doing meaningful work with Machine Learning — Classify Disaster Messages

Build models to help disaster organizations save people’s lives. I’m writing this post at 1am in Bucharest, Romania. Hello there again! Welcome to my fourth piece of content about Machine Learning. I’ve recently done a project that I believe to be socially meaningful. I’ll give a brief overview what this is all about and I’ll dive … Read more

Live Object Detection

Object Detection As said above the example notebook can be reused for our new application. This is because the main part of the notebook is importing the needed libraries, downloading the model and specifying useful helper code. The only section we need to modify is the detection section, which comprises of the last three cells … Read more

Reinforcement Learning: From Grid World to Self-Driving Cars

0. Agents, Environments, and Rewards Underlying many of the major announcements from researchers in Artificial Intelligence in the last few years is a discipline known as reinforcement learning (RL). Recent breakthroughs are mostly driven by minor twists on on classic RL ideas, enabled by the availability of powerful computing hardware and software that leverages said hardware. … Read more

Machine Learning Versus The News

PART TWO: A SOLUTION So, what to do? Mathematically, we may be tempted to think that to know the truth in its unvarnished and untarnished essence, we must read every article that covers the events of the story. Somehow we would then average away all the noise and be left with a well-informed and unbiased view … Read more

Supervised Learning: Basics of Classification and Main Algorithms

Introduction As stated in the first article of this series, Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels (discrete, unoredered values, group membership) of new instances based on past observations. There are two main types of classification problems: Binary classification: The typical example is e-mail spam … Read more

What’s your soccer team’s nemesis?

Is Barcelona really Real Madrid’s toughest opponent? Historical data paint an interesting story. Image from unsplash.com Real Madrid vs Barcelona. Manchester United vs Liverpool. Inter vs Milan. Olympique Lyonnais vs Olympique de Marseille. Chelsea vs everybody. European soccer is filled with some amazing rivalries. These rivalries got created and evolved over time for reasons on … Read more

Keras challenges the Avengers

Sentiment Analysis, also called Opinion Mining, is a useful tool within natural language processing that allow us to identify, quantify, and study subjective information. Due to the fact that quintillion of bytes of data is produced every day, this technique gives us the possibility to extract attributes of this data such as negative or positive … Read more

Creation of Sentence Embeddings Based on Topical Word Representations

An approach towards universal language understanding I am researching on word and sentence embeddings for over a year now and recently wrote also my master’s thesis [1] in this area. The results which I am presenting now were also published here and resulted in cooperation with SAP and the University of Liechtenstein. In the following … Read more

What I Learned from Writing a Data Science Article Every Week for a Year

3. Consistency is the critical factor The 98 articles I published in 2018 totaled 264,894 words. For every word published, there was at least 1 word that didn’t make it through editing. This works out to about 530,000 words or 1,500 words per day. The only way this was possible studying and working full-time was to … Read more

AI Problems are Human Problems

I have no illusions about the nature of wide-scale problem solving throughout the course of history. Rarely are sweeping changes noticed, worked on, and introduced to the populous by genius technocrats. Instead, magical innovations are often the synthesis of seemingly disparate ideas; cultural shifts do not occur due to governmental policy, but rather due to … Read more

Interactive Data Visualization with Python Using Bokeh

Simple and basic go-through example Recently I came over this library, learned a little about it, tried it, of course, and decided to share my thoughts. From official website: “Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to … Read more

Reinforcement Learning with Hindsight Experience Replay

Sparse and Binary Rewards Reinforcement learning has gained a lot of popularity in recent years due some spectacular successes such as defeating the Go world champion and (very recently) winning matches against top professionals in the popular Real time strategy game StarCraft 2. One of the impressive aspects of achievements such as that of AlphaZero (the … Read more

How Big is Big Data?

We have entered the Age of the Data for good. Everything we do online and even offline leaves traces in data — from cookies to our social media profiles. So how much data there really is? How much data do we process on a daily basis? Welcome to the Zettabyte Era. IBM Summit supercomputer Data … Read more

How GPL makes me leave R for Python :-(

Being a data scientist in a startup I can program with several languages, but often R is a natural choice. Recently I wanted my company to build a product based on R. It simply seemed like a perfect fit. But this turned out to be a slippery slope into the open-source code licensing field, which … Read more

Categories R Tags ExcerptFavorite

Thinking Of Switching Careers To A Developer?

I Have The Answers. But How? I know what you’re wondering: how do I even have the answers? Well, I could say from experience but as an aspiring data scientist, to demonstrate how data science can make any decision making process easier and ensure you make the correct decision. I’ll be using data from the 2018 … Read more

Unmaking Graphs

This is how things usually go when I first create any graph: Imagine I just got my hands on a juicy new dataset and I’m doing some exploratory data analysis — hunched over the keyboard with a magnifying glass looking for correlations and analyzing clues. I decide to conjure up some graphs to visualize the data because … Read more

Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century

I have been playing around with historical newspaper data (seehere andhere). I have extracted thedata from the largest archive available, as described in the previous blog post, and now createda shiny dashboard where it is possible to visualize the most common words per article, as well asread a summary of each article.The summary was made … Read more

Categories R Tags ExcerptFavorite

Announcing new software peer review editors: Melina Vidoni and Brooke Anderson

We are pleased to welcome Brooke Anderson and Melina Vidoni to our team of Associate Editors for rOpenSci Software Peer Review. They join Scott Chamberlain, Anna Krystalli, Lincoln Mullen, Karthik Ram, Noam Ross and Maëlle Salmon. With the addition of Brooke and Melina, our editorial board now includes four women and four men, located in … Read more

Categories R Tags ExcerptFavorite

Book review: Beyond Spreadsheets with R

Disclaimer: Manning publications gave me the ebook version of Beyond Spreadsheets with R – A beginner’s guide to R and RStudio by Dr. Jonathan Carroll free of charge. Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques … Read more

Categories R Tags ExcerptFavorite

missing digit in a 114 digit number [a Riddler’s riddle]

A puzzling riddle from The Riddler (as Le Monde had a painful geometry riddle this week): this number with 114 digits 530,131,801,762,787,739,802,889,792,754,109,70?,139,358,547,710,066,257,652,050,346,294,484,433,323,974,747,960,297,803,292,989,236,183,040,000,000,000 is missing one digit and is a product of some of the integers between 2 and 99. By comparison, 76! and 77! have 112 and 114 digits, respectively. While 99! has 156 digits. … Read more

Categories R Tags ExcerptFavorite

The Unsung Heroes of Modern Software Development

Open Source Foundation Leaders I’ll highlight six open source foundations that are key to many important projects. For each foundation I’ll give a brief bio, provide the number of projects being supported as of early 2019, and highlight some well-known projects. Note that these groups fall under various IRS classifications for charitable and trade organizations — not … Read more

Introducing the AI Project Canvas

AI Project Canvas Imagine the following scenario: You have a brilliant idea for a new AI project. To make it happen, you need to convince management to fund your idea. You need to pitch your AI project idea to stakeholders and management. Yuck. This is the first step where the AI Project Canvas comes into play. … Read more

The Grass Really is Greener on the Other Side: Buying Local and its Shortcomings

Evidence-Based Policy is Bigger than You or Your Feelings — Part II Just because your vegetables travel thousands of kilometers to your kitchen table doesn’t mean they can’t be better for the environment than produce from your local farmer’s market. There. I’ve said it. As unpopular opinions go, this one is somewhere between ‘pineapple on pizza’ and ‘healthcare … Read more

ML Algorithms: One SD (σ)

The obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?” Answering these questions vary depending on several factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of … Read more

How to Pace the London Marathon: Fuelled by Data

Chris is a current MSc Computer Science student at the University of Warwick, UK. He is also the co-founder of Sustain Investing. Before that, Chris worked at Citi Ventures and at Citi Markets. It started with an excuse Hi, I’m Chris. While building Sustain with my cofounders Andre, Nick Foden and Sylwia Zieba I’ve been studying … Read more

The Blockchain Scalability Problem & the Race for Visa-Like Transaction Speed

Yes, blockchain has a scalability problem. Here’s what it is, and here’s what people are doing to solve it. The battle for a scalable solution is the blockchain’s moon race. Bitcoin processes 4.6 transactions per second. Visa does around 1,700 transactions per second on average (based on a calculation derived from the official claim of … Read more

Making Sense of Startup Valuations with Data Science

The following is a condensed and slightly modified version of a Radicle working paper on the startup economy in which we explore post-money valuations by venture capital stage classifications. We find that valuations have interesting distributional properties and then go on to describe a statistical model for estimating an undisclosed valuation with considerable ease. In … Read more

Value Investing with Machine Learning

Your favourite holding period doesn’t have to be forever… The Oracle of Omaha once said: “Price is what you pay, value is what you get.” Warren Buffet But how can you be certain that you are paying a fair price for an investment? How can you make the most of a fair or unfair situation? This … Read more

Introducing Snorkel

How this Tiny Project Solves One of the Major Problems in Real World Machine Learning Solutions Building high quality training datasets is one of the most difficult challenges of machine learning solutions in the real world. Disciplines like deep learning have helped us to build more accurate models but, to do so, they require vastly … Read more

Fast Static Maps Built with R

Luke Whyte posted an article (apologies for a Medium link) over on Towards Data Science showing how to use a command line workflow involving curl, node and various D3 libraries and javascript source files to build a series of SVG static maps. It’s well written and you should give it a read especially since he … Read more

Categories R Tags ExcerptFavorite

Cross Validation — Why & How

So, you have been working on an imbalanced data set for a few days now and trying out different machine learning models, training them on a part of your data set, testing their accuracy and you are ecstatic to see the score going above 0.95 every-time. Do you really think you have achieved 95% accuracy … Read more

Price’s Protein Puzzle: 2019 update

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s … Read more

Categories R Tags ExcerptFavorite

R Markdown Template for Business Reports

In this post I’d like to introduce the R Markdown template for business reports by INWTlab. It’s been my aim to have a nice and clean template that is easy to customize in colors, cover and logo. I know there are quite a few templates available, but I was missing one to be used in … Read more

Categories R Tags ExcerptFavorite

Quick Hit: Using seymour to Subscribe to your Git[la|hu]b Repo Issues in Feedly

The seymour Feedly API package has been updated to support subscribing to RSS/Atom feeds. Previously the package was intended to just treat your Feedly as a data source, but there was a compelling use case for enabling subscription support: subscribing to code repository issues. Sure, there’s already email notice integration for repository issues on most … Read more

Categories R Tags ExcerptFavorite

Using Tensorflow Serving GRPC

How to write a GRPC Client for the wrapped model Once you have your Tensorflow or Keras based model trained, one needs to think on how to use it in,deploy it in production. You may want to Dockerize it as a micro-service, implementing a custom GRPC (or REST- or not) interface. Then deploy this to server … Read more

A Dog Detector and Breed Classifier

In a field like physics, things keep getting harder, to the point that it’s very difficult to understand what’s going on at the cutting edge unless it’s in highly simplified terms. In computer science though, and artificial intelligence in particular, knowledge built up slowly over 70+ years by people all over the world is still … Read more

Build a Pipeline for Harvesting Medium Top Author Data

Nuts and Bolts One key requirement was to make deployment of my Luigi workflow very simple. I wanted to assume only one thing about the deployment environment; that the Docker daemon would be available. With Docker, I wouldn’t need to be concerned with Python version mismatches or other environmental discrepancies. It took me a little while … Read more

New R package: load and chart oceanic storms

Mapping historical storms data is now a little bit easier. Off the back of this blog, I have authored an R package (available at basilesimon/noaastorms) that downloads, cleans and parses NOAA IBtrack data for you. The National Oceanic and Atmospheric Administration releases datasets known as International Best Track Archive for Climate Stewardship. These datasets are … Read more

Categories R Tags ExcerptFavorite

December 2108: “Top 40” New CRAN Packages

By my count, 157 new packages stuck to CRAN in December. Below are my “Top 40” picks in ten categories: Computational Methods, Data, Finance, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities and Visualization. This is the first time I have used the Medicine category. I am pleased that a few packages that appear to … Read more

Categories R Tags ExcerptFavorite

Power BI

Using Power BI and R Tutorial here: Run R scripts in Power BI Desktop The only twist that I want to add is an idea on how to enable users without admin access to run R code. This can be achieved by storing a portable r installation on a mountable file storage. R Download the … Read more

Using custom scales with the ‚scales‘ package

Maybe you already heard of the package “scales” – and if you didn’t hear about it, you might have used it without knowing (e.g., in the context of ggplot2 graphs). I want to show you a few of the functionalities of the “scales” package. I will also show you how to create your own scales. … Read more

Categories R Tags ExcerptFavorite

Time Travel with RStudio Package Manager 1.0.4

We all love packages. We don’t love when broken package environments prevent usfrom reproducing our work. In version 1.0.4 of RStudio Package Manager,individuals and teams can navigate through repository checkpoints,making it easy to recreate environments and reproduce work. The new release alsoadds important security updates, improvements for Git sources, further access toretired packages, and beta … Read more

Categories R Tags ExcerptFavorite

Pix2Pix

Shocking result of Edges-to-Photo Image-to-Image translation using the Pix2Pix GAN Algorithm This article will explain the fundamental mechanisms of a popular paper on Image-to-Image translation with Conditional GANs, Pix2Pix, following is a link to the paper: Article Outline I. Introduction II. Dual Objective Function with Adversarial and L1 Loss III. U-Net Generator IV. PatchGAN Discriminator … Read more

Probability — Fundamentals of Machine Learning (Part 1)

The Mathematics of Probability In the beginning, I suggested that probability theory is a mathematical framework. As with any mathematical framework there is some vocabulary and important axioms needed to fully leverage the theory as a tool for machine learning. Probability is all about the possibility of various outcomes. The set of all possible outcomes … Read more