AI, Machine Learning and Data Science Roundup: January 2020

A roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications from Microsoft and elsewhere that I’ve noted recently. Open Source AI, ML & Data Science News Pandas 1.0.0 is released, a milestone for the ubiquitous Python data frame package. … Read more AI, Machine Learning and Data Science Roundup: January 2020

Too big to deploy: How GPT-2 is breaking production

A look at the bottleneck around deploying massive models to production The most optimistic of us envision a future in which machine learning is capable of human-level tasks—driving our cars, answering our calls, booking our appointments, responding to our emails. Reality, of course, is different. Modern production machine learning has only effectively tackled very tightly … Read more Too big to deploy: How GPT-2 is breaking production

How Sampling Biases Might Be Ruining Your Predictions

Why understanding your data distributions is key to good models. Photo by W A T A R I on Unsplash Sampling bias is the term for a bias in which a sample is collected in such a way that some members of the intended population have a lower sampling probability than others. This is best … Read more How Sampling Biases Might Be Ruining Your Predictions

The right Electric Vehicle for me: a use case for Conjoint Analysis

The lease details are contained in the fields monthly_cost, upfront_cost and term. The remaining fields concern specifications of the electric vehicle (e.g. range, Sedan or SUV). The relative popularity of the vehicles can be found here: https://insideevs.com/news/343998/monthly-plug-in-ev-sales-scorecard/ Given a dizzying amount of choice, which EV should you purchase? Conjoint Analysis provides a principled way of … Read more The right Electric Vehicle for me: a use case for Conjoint Analysis

The role of Process Mining in Digital Transformations

Can you transform what you don’t comprehend? Image by Monster Ztudio licensed via Adobe Stocks Genchi gembutsu is a Japanese term that translates to “go and see.” These are two words that transformation leaders must never forget. In the context of a digital transformation, what genchi gembutsu means is that without analyzing the place where … Read more The role of Process Mining in Digital Transformations

Will Streamlit cause the extinction of Flask?

Maybe for Machine Learning (ML) and Deep Learning (DL). For other full-stack applications, probably not! We have yet to encounter one of our Flask-based ML or DL micro-services that can not be refactored into a Streamlit service. The challenge is to keep Streamlit micro-services small by only replacing only 2 to 3 Flask-based micro-services. Extinction … Read more Will Streamlit cause the extinction of Flask?

AWS Backup is now available for Amazon Elastic File System (Amazon EFS) in 4 additional regions

AWS Backup offers a centralized, managed service to back up data across AWS services in the cloud and on premises using Storage Gateway. AWS Backup serves as a single dashboard for backup, restore, and policy-based retention of different AWS resources, including Amazon EBS volumes, Amazon EC2 instances, Amazon RDS databases, Amazon DynamoDB tables, Amazon EFS … Read more AWS Backup is now available for Amazon Elastic File System (Amazon EFS) in 4 additional regions

A Quick Introduction to CMIP6

Climate Data Science How to easily access the next generation of climate models with Python. The Coupled Model Intercomparison Project (CMIP) is a huge international collaborative effort to improve the knowledge about climate change and its impacts on the Earth System and on our society. It’s been going around since the 90s and today we … Read more A Quick Introduction to CMIP6

Here’s how to make Pandas Iteration 150x Faster

Now, I’ve tried to do data science in Go — and it’s possible — but not even remotely pleasant like in Python, mostly due to the static nature of the language and data science being mostly exploratory field. I’m not saying that you can’t benefit performance-wise by rewriting the finished solution in Go, but that’s … Read more Here’s how to make Pandas Iteration 150x Faster

Top 10 resources to become a Data Scientist in 2020

For the absolute beginner like I was I am a Mechanical engineer by education. And I started my career with a core job in the steel industry. With those heavy steel enforced gumboots and that plastic helmet, venturing around big blast furnaces and rolling mills. Artificial safety measures, to say the least, as I knew … Read more Top 10 resources to become a Data Scientist in 2020

Identifying and tracking toil using SRE principlesIdentifying and tracking toil using SRE principlesSRE Systems Engineer

One of the key measures that Google site reliability engineers (SREs) use to verify our effectiveness is how we spend our time day-to-day. We want ample time available for long-term engineering project work, but we’re also responsible for the continued operation of Google’s services, which sometimes requires doing some manual work. We aim for less … Read more Identifying and tracking toil using SRE principlesIdentifying and tracking toil using SRE principlesSRE Systems Engineer

Understanding AdaBoost for Decision Tree

An implementation with R Decision Trees are popular Machine Learning algorithms used for both regression and classification tasks. Their popularity mainly arises from their interpretability and representability, as they mimic the way the human brain takes decisions. In my former article, I’ve been introducing some ensemble methods for decision trees, whose aim is that of … Read more Understanding AdaBoost for Decision Tree

Solving One Truly Big Number Problem in Transport

What effect would having a few pickups (as opposed to only deliveries) have? Let’s pretend that instead of making 60 deliveries we have 50 deliveries and 10 pickups. This means we have 50 Crates to deliver, with 50 Crates capacity, plus 10 Crates to pick up and return to the depot. What approach would you … Read more Solving One Truly Big Number Problem in Transport

50+ Free DataSets for DataScience Projects

Hello All, This is just a short note to specify that the list of FREE datasets is updated for 2020. There are 50+ sites and links to the newly released Google Dataset search engine. So, have fun exploring these data repositories to master programming, create stunning visualizations and build your own unique project portfolios. Some … Read more 50+ Free DataSets for DataScience Projects

Lewis Carroll’s proposed rules for tennis tournaments by @ellis2013nz

Last week I wrote about the impact of seeding the draw in a tennis tournament. Seeding is one way to increase the chance of the top players making it to the final rounds of a single elimination tournament, leading to fairer outcomes and to a higher chance of the best matchups happening in the finals. … Read more Lewis Carroll’s proposed rules for tennis tournaments by @ellis2013nz

Measuring Quality of Conversations of AI Agents

Artificial Intelligence (AI) agents are everywhere. They are embedded within your smartphone (Apple Siri, Google Assistant), they are in your smart home devices (Amazon Alexa, Google Home), you have probably interacted with some while speaking to a company’s customer service department, and they are embedded in the chat widget for many websites you visit; you … Read more Measuring Quality of Conversations of AI Agents

Time-Series Forecasting in Real Life: Budget forecasting with ARIMA

There are several ways you can model a time series, the most popular are: Simple moving average With this approach, you’re saying the forecast is based on the average of the n previous data points. Exponential Smoothing It exponentially decreases the weight of previous observations, such that increasingly older data points have less impact in … Read more Time-Series Forecasting in Real Life: Budget forecasting with ARIMA

How to build and apply Naive Bayes classification for spam filtering

I believe, that today almost everyone has a smartphone and a lot of people keep an email or two. This means that you have to be familiar with tons of messages proposing a lot of money, fantastic lottery wins, great presents and secrets of life. We get dozens of spam messages every day unless you … Read more How to build and apply Naive Bayes classification for spam filtering

What is a GAN?

How a weird idea became the foundation of cutting-edge AI Take any course on machine learning and you’ll invariably encounter Generative Adversarial Networks, or GANs. Understanding them means mastering the surprising power of playing a computer out against itself. It’s around five o’clock and you’ve just finished your homework. ‘I’m done!’ ‘Great! Would you like … Read more What is a GAN?

Analyzing Yelp Dataset with Scattertext spaCy

Exploratory data analysis and visualization for text data using NLP Scattertext spaCy One of the most crucial work in the text mining field is to present the content of the text data visually. Using natural language processing (NLP), a data scientist can summarize documents, create topics, explore storylines of the content in different angles and … Read more Analyzing Yelp Dataset with Scattertext spaCy

Keep Jupyter Notebook Running Even After Browser is Closed

Ways to Leave the Jupyter Notebook Running and their drawbacks Keeping the browser tab open to run Jupyter Notebook files for days is not the most exciting work. It can be troublesome. Here are some ways I found to solve the issues. Each solution has its benefits and drawbacks, so without further ado, let’s start … Read more Keep Jupyter Notebook Running Even After Browser is Closed

Bayesian Neural Networks with TensorFlow Probability

If you have not installed TensorFlow Probability yet, you can do it with pip, but it might be a good idea to create a virtual environment before. pip install –upgrade tensorflow-probability Open your favorite editor or JupyterLab. Import all necessarty libraries. # Load libriaries and functions.import pandas as pdimport numpy as npimport tensorflow as tftfk … Read more Bayesian Neural Networks with TensorFlow Probability

Comparing Ensembl GTF and cDNA

It seems that most people think Ensembl’s GTF file and cDNA fasta file mean the same transcripts: Watch out! @ensembl‘s Fasta and GTF annotation files available via https://t.co/2AhCSnL7py do not match (there are transcripts in the GTF not found in the Fasta file. Anyone else expected them to match? — K. Vitting-Seerup (@KVittingSeerup) August 13, … Read more Comparing Ensembl GTF and cDNA

A Shiny App for Tracking Moral Networks

This is a post outlining a ShinyApp that I made for visualising inter-participant agreement on quesions relating to Haidt’s Moral Foundations (e.g., Haidt and Joseph 2008). This is part of a line of research on moral judgements, inspired by DAFINET project, where I aim to investigate the role of agreement with others in the robustness … Read more A Shiny App for Tracking Moral Networks

An efficient way to install and load R packages

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Unlike other programs, only fundamental functionalities come by default … Read more An efficient way to install and load R packages

another easy Riddler

[This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. A quick riddle from the Riddler In a two-person game, … Read more another easy Riddler

Python and AWS SSM Parameter Store

Now, it’s time to write the script that will retrieve the secret parameter we just stored in SSM parameter store and decrypt it so we can use it in our application. Let’s create a new Python file in the project directory and name it retrieve_and_decrypt_ssm_secret.py. retrieve_and_decrypt_ssm_secret.py Notice the get_parameter() function’s argument named WithDecryption. In this … Read more Python and AWS SSM Parameter Store

Top 5 Open Data Science Competitions with Cash Prizes

Last week over dinner celebrating the Chinese New Year, an acquaintance that is learning Data Science on his free time asked me how to formulate a problem worth to be solved with data and how to find the right dataset to practice his new skills. We talked a bit about his interests and strengths, and … Read more Top 5 Open Data Science Competitions with Cash Prizes

Has Artificial Intelligence Progress Ground to a Screeching Halt?

We just entered the year 2020 and the world is no closer to looking like the Jetsons with flying cars and robot maids. With all the surrounding talk and hype about artificial intelligence taking over our lives and replacing us with robots, we should be all unemployed and homeless, but we are not. Why? According … Read more Has Artificial Intelligence Progress Ground to a Screeching Halt?

Is Explainable AI (xAI) the Next Step, or Just Hype?

Recent years have seen the expansion of artificial intelligence into an array of industries with varying levels of disruption. Once a horizon-technology (perhaps similar to how we now view quantum computing) AI has officially breached everyday life, and informed opinions are no longer reserved for tech enthusiasts and elite data scientists. Now, stakeholders include executives, … Read more Is Explainable AI (xAI) the Next Step, or Just Hype?

How Data Scientists Can Balance Practicality and Rigor

When building quantitative systems that drive commercial value, pragmatism and innovation are not in conflict with one another. For growing and lean start-ups with challenging research problems and data-focused customers, data science research must yield clear business wins quickly and iteratively. An effective approach to scaling technology in these environments must embody a mix of … Read more How Data Scientists Can Balance Practicality and Rigor

AWS Batch now available in AWS GovCloud (US) Regions

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU, GPU, or memory-optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, … Read more AWS Batch now available in AWS GovCloud (US) Regions

Windows Server applications, welcome to Google Kubernetes EngineWindows Server applications, welcome to Google Kubernetes EngineProduct ManagerProduct Manager

In the beta release of Windows Server container support in GKE (version 1.16.4), Windows and Linux containers can run side-by-side in the same cluster. This release also includes several other features aimed at helping you meet the security, scalability, integration and management needs of your Windows Server containers. Some highlights include: Private clusters: a security … Read more Windows Server applications, welcome to Google Kubernetes EngineWindows Server applications, welcome to Google Kubernetes EngineProduct ManagerProduct Manager

Building the R Community in Southern Africa

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. By Heather Turner, Chair of Forwards, the R Foundation taskforce for underrepresented … Read more Building the R Community in Southern Africa

A tactile guide to Python Collections Final

Photo by chuttersnap on Unsplash Python is a powerful programming language with a dynamic semantics, that is guided by 19 principles known as the “Zen of python”. The principles are listed below: “Beautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better … Read more A tactile guide to Python Collections Final

Guinea Pig Breed Classification

Step 6. Model Comparison and Selection The InceptionV3 transfer learning model had the best scores overall. Metric Comparison for different Models On top of that, it was able to brilliantly classify ‘Skinny’, where classical classifiers had generally failed (high recall). Notably, none of the models were able to confidently identify ‘Abyssinian’ as per se (low … Read more Guinea Pig Breed Classification

Newbie’s Guide to Study Reinforcement Learning

Taking baby steps in the realm of Reinforcement Learning Starter resource pack described in this guide If the metered paywall is bothering you, go to this link. If you want to know my path for Deep Learning, check out my article on Newbie’s Guide to Deep Learning. What I am going to talk here is … Read more Newbie’s Guide to Study Reinforcement Learning

P versus NP — The million dollar problem!

On May 24, 2000, Clay Mathematics Institute came up with seven mathematical problems, for which, the solution for any of the problem will earn US $1,000,000 reward for the solver. Famously know as the Millennium Problems, so far only one of the seven problems is solved till date. Wanna make a million dollar, try solving … Read more P versus NP — The million dollar problem!

Hyperledger Fabric on Azure Kubernetes Service Marketplace template

Customers exploring blockchain for their applications and solutions typically start with a prototype or proof of concept effort with a blockchain technology before they get to build, pilot, and production rollout. During the latter stages, apart from the ease of deployment, there is an expectation of flexibility in the configuration in terms of the number … Read more Hyperledger Fabric on Azure Kubernetes Service Marketplace template

Building Models with Keras

Keras is a high-level API for building neural networks in python. The API supports sequential neural networks, recurrent neural networks, and convolutional neural networks. It also allows for easy and fast prototyping due to its modularity, user-friendliness, and extensibility. In this post, we will walk through the process of building sequential neural networks for regression … Read more Building Models with Keras

Introducing PyLathe: Use Lathe In Python Using Julia

Handling Data Machine-learning works about exactly how it should in PyLathe, with very few issues that jeopardize this. Some of these issues can come to light from time to time, but for the most part, aren’t too big of a deal. The good news about using PyLathe is that now we’re in Python, and with … Read more Introducing PyLathe: Use Lathe In Python Using Julia

Building sensitivity atlases

Researchers, environmental managers, ecologists, researchers — we are all looking for a better perspective. Fieldwork, remote sensing and systemic understanding lets us piece together knowledge which in turn can be used to prioritize human interaction with our environment. All this knowledge represents generalizations. The paradox of generalisations is that they are some times needed for … Read more Building sensitivity atlases