Doing meaningful work with Machine Learning — Classify Disaster Messages

Build models to help disaster organizations save people’s lives. I’m writing this post at 1am in Bucharest, Romania. Hello there again! Welcome to my fourth piece of content about Machine Learning. I’ve recently done a project that I believe to be socially meaningful. I’ll give a brief overview what this is all about and I’ll dive … Read more Doing meaningful work with Machine Learning — Classify Disaster Messages

Reinforcement Learning: From Grid World to Self-Driving Cars

0. Agents, Environments, and Rewards Underlying many of the major announcements from researchers in Artificial Intelligence in the last few years is a discipline known as reinforcement learning (RL). Recent breakthroughs are mostly driven by minor twists on on classic RL ideas, enabled by the availability of powerful computing hardware and software that leverages said hardware. … Read more Reinforcement Learning: From Grid World to Self-Driving Cars

Supervised Learning: Basics of Classification and Main Algorithms

Introduction As stated in the first article of this series, Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels (discrete, unoredered values, group membership) of new instances based on past observations. There are two main types of classification problems: Binary classification: The typical example is e-mail spam … Read more Supervised Learning: Basics of Classification and Main Algorithms

What’s your soccer team’s nemesis?

Is Barcelona really Real Madrid’s toughest opponent? Historical data paint an interesting story. Image from unsplash.com Real Madrid vs Barcelona. Manchester United vs Liverpool. Inter vs Milan. Olympique Lyonnais vs Olympique de Marseille. Chelsea vs everybody. European soccer is filled with some amazing rivalries. These rivalries got created and evolved over time for reasons on … Read more What’s your soccer team’s nemesis?

Keras challenges the Avengers

Sentiment Analysis, also called Opinion Mining, is a useful tool within natural language processing that allow us to identify, quantify, and study subjective information. Due to the fact that quintillion of bytes of data is produced every day, this technique gives us the possibility to extract attributes of this data such as negative or positive … Read more Keras challenges the Avengers

The Danger of Artificial Intelligence in Recruiting (and 3 Suggestions)

I recently came across one of the most well-intended, and most unnerving, applications of AI in recruiting; a talking robot head pitched as a solution to avoid bias in interviewing. Picture a robot the size of an Alexa with an actual human face painted to the top. The face changes, tries to show expression and … Read more The Danger of Artificial Intelligence in Recruiting (and 3 Suggestions)

Creation of Sentence Embeddings Based on Topical Word Representations

An approach towards universal language understanding I am researching on word and sentence embeddings for over a year now and recently wrote also my master’s thesis [1] in this area. The results which I am presenting now were also published here and resulted in cooperation with SAP and the University of Liechtenstein. In the following … Read more Creation of Sentence Embeddings Based on Topical Word Representations

What I Learned from Writing a Data Science Article Every Week for a Year

3. Consistency is the critical factor The 98 articles I published in 2018 totaled 264,894 words. For every word published, there was at least 1 word that didn’t make it through editing. This works out to about 530,000 words or 1,500 words per day. The only way this was possible studying and working full-time was to … Read more What I Learned from Writing a Data Science Article Every Week for a Year

AI Problems are Human Problems

I have no illusions about the nature of wide-scale problem solving throughout the course of history. Rarely are sweeping changes noticed, worked on, and introduced to the populous by genius technocrats. Instead, magical innovations are often the synthesis of seemingly disparate ideas; cultural shifts do not occur due to governmental policy, but rather due to … Read more AI Problems are Human Problems

Interactive Data Visualization with Python Using Bokeh

Simple and basic go-through example Recently I came over this library, learned a little about it, tried it, of course, and decided to share my thoughts. From official website: “Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to … Read more Interactive Data Visualization with Python Using Bokeh

Reinforcement Learning with Hindsight Experience Replay

Sparse and Binary Rewards Reinforcement learning has gained a lot of popularity in recent years due some spectacular successes such as defeating the Go world champion and (very recently) winning matches against top professionals in the popular Real time strategy game StarCraft 2. One of the impressive aspects of achievements such as that of AlphaZero (the … Read more Reinforcement Learning with Hindsight Experience Replay

Reinforcement Learning Tutorial Part 1: Q-Learning

This is the first part of a tutorial series about reinforcement learning. We will start with some theory and then move on to more practical things in the next part. During this series, you will not only learn how to train your model, but also what is the best workflow for training it in the … Read more Reinforcement Learning Tutorial Part 1: Q-Learning

Travelling in the BlockChain Ecosystem with Python

First things first, you’ll want to download Anaconda on your local machine, and set up a conda with Python 3.5+ in an environment, then launch a Jupyter Notebook to run the code below chunks. Better yet, if you haven’t already tried, run the following code in Google Collab for free. Next, we’ll find the number … Read more Travelling in the BlockChain Ecosystem with Python

Thinking Of Switching Careers To A Developer?

I Have The Answers. But How? I know what you’re wondering: how do I even have the answers? Well, I could say from experience but as an aspiring data scientist, to demonstrate how data science can make any decision making process easier and ensure you make the correct decision. I’ll be using data from the 2018 … Read more Thinking Of Switching Careers To A Developer?

Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century

I have been playing around with historical newspaper data (seehere andhere). I have extracted thedata from the largest archive available, as described in the previous blog post, and now createda shiny dashboard where it is possible to visualize the most common words per article, as well asread a summary of each article.The summary was made … Read more Using Data Science to read 10 years of Luxembourguish newspapers from the 19th century

Announcing new software peer review editors: Melina Vidoni and Brooke Anderson

We are pleased to welcome Brooke Anderson and Melina Vidoni to our team of Associate Editors for rOpenSci Software Peer Review. They join Scott Chamberlain, Anna Krystalli, Lincoln Mullen, Karthik Ram, Noam Ross and Maëlle Salmon. With the addition of Brooke and Melina, our editorial board now includes four women and four men, located in … Read more Announcing new software peer review editors: Melina Vidoni and Brooke Anderson

Book review: Beyond Spreadsheets with R

Disclaimer: Manning publications gave me the ebook version of Beyond Spreadsheets with R – A beginner’s guide to R and RStudio by Dr. Jonathan Carroll free of charge. Beyond Spreadsheets with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. You’ll build on simple programming techniques … Read more Book review: Beyond Spreadsheets with R

missing digit in a 114 digit number [a Riddler’s riddle]

A puzzling riddle from The Riddler (as Le Monde had a painful geometry riddle this week): this number with 114 digits 530,131,801,762,787,739,802,889,792,754,109,70?,139,358,547,710,066,257,652,050,346,294,484,433,323,974,747,960,297,803,292,989,236,183,040,000,000,000 is missing one digit and is a product of some of the integers between 2 and 99. By comparison, 76! and 77! have 112 and 114 digits, respectively. While 99! has 156 digits. … Read more missing digit in a 114 digit number [a Riddler’s riddle]

The Unsung Heroes of Modern Software Development

Open Source Foundation Leaders I’ll highlight six open source foundations that are key to many important projects. For each foundation I’ll give a brief bio, provide the number of projects being supported as of early 2019, and highlight some well-known projects. Note that these groups fall under various IRS classifications for charitable and trade organizations — not … Read more The Unsung Heroes of Modern Software Development

The Grass Really is Greener on the Other Side: Buying Local and its Shortcomings

Evidence-Based Policy is Bigger than You or Your Feelings — Part II Just because your vegetables travel thousands of kilometers to your kitchen table doesn’t mean they can’t be better for the environment than produce from your local farmer’s market. There. I’ve said it. As unpopular opinions go, this one is somewhere between ‘pineapple on pizza’ and ‘healthcare … Read more The Grass Really is Greener on the Other Side: Buying Local and its Shortcomings

ML Algorithms: One SD (σ)

The obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?” Answering these questions vary depending on several factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of … Read more ML Algorithms: One SD (σ)

How to Pace the London Marathon: Fuelled by Data

Chris is a current MSc Computer Science student at the University of Warwick, UK. He is also the co-founder of Sustain Investing. Before that, Chris worked at Citi Ventures and at Citi Markets. It started with an excuse Hi, I’m Chris. While building Sustain with my cofounders Andre, Nick Foden and Sylwia Zieba I’ve been studying … Read more How to Pace the London Marathon: Fuelled by Data

The Blockchain Scalability Problem & the Race for Visa-Like Transaction Speed

Yes, blockchain has a scalability problem. Here’s what it is, and here’s what people are doing to solve it. The battle for a scalable solution is the blockchain’s moon race. Bitcoin processes 4.6 transactions per second. Visa does around 1,700 transactions per second on average (based on a calculation derived from the official claim of … Read more The Blockchain Scalability Problem & the Race for Visa-Like Transaction Speed

What are the Skills Needed to Become a Data Scientist in 2019?

It’s hardly a surprise to anyone in the tech and related industries that “data scientist” is the best job to have in the States. After all, this has been what sources like the Harvard Business Review and Glassdoor report for what is now four years in a row. And even if we take the base … Read more What are the Skills Needed to Become a Data Scientist in 2019?

Making Sense of Startup Valuations with Data Science

The following is a condensed and slightly modified version of a Radicle working paper on the startup economy in which we explore post-money valuations by venture capital stage classifications. We find that valuations have interesting distributional properties and then go on to describe a statistical model for estimating an undisclosed valuation with considerable ease. In … Read more Making Sense of Startup Valuations with Data Science

Predicting Premier league standings — putting that math to some use

I am a casual fan when it comes to football, but the idea of building a mathematical model that can be applied to a real-world problem seemed exciting enough to have a try at it. (Let’s kick off then, shall we? ⚽️) Breaking down the problem The rankings in the league table are primarily determined by … Read more Predicting Premier league standings — putting that math to some use

Price’s Protein Puzzle: 2019 update

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s … Read more Price’s Protein Puzzle: 2019 update

Quick Hit: Using seymour to Subscribe to your Git[la|hu]b Repo Issues in Feedly

The seymour Feedly API package has been updated to support subscribing to RSS/Atom feeds. Previously the package was intended to just treat your Feedly as a data source, but there was a compelling use case for enabling subscription support: subscribing to code repository issues. Sure, there’s already email notice integration for repository issues on most … Read more Quick Hit: Using seymour to Subscribe to your Git[la|hu]b Repo Issues in Feedly

A Dog Detector and Breed Classifier

In a field like physics, things keep getting harder, to the point that it’s very difficult to understand what’s going on at the cutting edge unless it’s in highly simplified terms. In computer science though, and artificial intelligence in particular, knowledge built up slowly over 70+ years by people all over the world is still … Read more A Dog Detector and Breed Classifier

Build a Pipeline for Harvesting Medium Top Author Data

Nuts and Bolts One key requirement was to make deployment of my Luigi workflow very simple. I wanted to assume only one thing about the deployment environment; that the Docker daemon would be available. With Docker, I wouldn’t need to be concerned with Python version mismatches or other environmental discrepancies. It took me a little while … Read more Build a Pipeline for Harvesting Medium Top Author Data

Time Travel with RStudio Package Manager 1.0.4

We all love packages. We don’t love when broken package environments prevent usfrom reproducing our work. In version 1.0.4 of RStudio Package Manager,individuals and teams can navigate through repository checkpoints,making it easy to recreate environments and reproduce work. The new release alsoadds important security updates, improvements for Git sources, further access toretired packages, and beta … Read more Time Travel with RStudio Package Manager 1.0.4

December 2108: “Top 40” New CRAN Packages

By my count, 157 new packages stuck to CRAN in December. Below are my “Top 40” picks in ten categories: Computational Methods, Data, Finance, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities and Visualization. This is the first time I have used the Medicine category. I am pleased that a few packages that appear to … Read more December 2108: “Top 40” New CRAN Packages

Power BI

Using Power BI and R Tutorial here: Run R scripts in Power BI Desktop The only twist that I want to add is an idea on how to enable users without admin access to run R code. This can be achieved by storing a portable r installation on a mountable file storage. R Download the … Read more Power BI

New R package: load and chart oceanic storms

Mapping historical storms data is now a little bit easier. Off the back of this blog, I have authored an R package (available at basilesimon/noaastorms) that downloads, cleans and parses NOAA IBtrack data for you. The National Oceanic and Atmospheric Administration releases datasets known as International Best Track Archive for Climate Stewardship. These datasets are … Read more New R package: load and chart oceanic storms

How Does Back-Propagation in Artificial Neural Networks Work?

Our Neural Network Let’s finally draw a diagram of our long-awaited neural net. It should look something like this: The leftmost layer is the input layer, which takes X0 as the bias term of value 1, and X1 and X2 as input features. The layer in the middle is the first hidden layer, which also takes … Read more How Does Back-Propagation in Artificial Neural Networks Work?

Pix2Pix

Shocking result of Edges-to-Photo Image-to-Image translation using the Pix2Pix GAN Algorithm This article will explain the fundamental mechanisms of a popular paper on Image-to-Image translation with Conditional GANs, Pix2Pix, following is a link to the paper: Article Outline I. Introduction II. Dual Objective Function with Adversarial and L1 Loss III. U-Net Generator IV. PatchGAN Discriminator … Read more Pix2Pix

Probability — Fundamentals of Machine Learning (Part 1)

The Mathematics of Probability In the beginning, I suggested that probability theory is a mathematical framework. As with any mathematical framework there is some vocabulary and important axioms needed to fully leverage the theory as a tool for machine learning. Probability is all about the possibility of various outcomes. The set of all possible outcomes … Read more Probability — Fundamentals of Machine Learning (Part 1)