Optimizing Feature generation

Feature generation is the process of creating new features from one or multiple existing features, potentially for using in statistical analysis. This process adds new information to be accessible during the model construction and therefore hopefully result in more accurate model. In this article I describe how to used feature interaction detection algorithm based on … Read more

Decision Tree from Scratch in Python

Decision trees are among the most powerful Machine Learning tools available today and are used in a wide variety of real-world applications from Ad click predictions at Facebook¹ to Ranking of Airbnb experiences. Yet they are intuitive, easy to interpret — and easy to implement. In this article we’ll train our own decision tree classifier … Read more

Discerning Odors Using Machine Learning

Hey Google, what does this smell like? Deep learning has made many advances in sight — using computer vision to identify objects, detect cancer in cells, and self-driving cars. It has also made many advances in sound — live captioning, AI generated music, and offline speech recognition are some examples. It is because of these … Read more

The Most Important Supreme Court Decision For Data Science and Machine Learning

Google Books ruled legal in massive win for fair use (updated), Ars Technica Nov 14 2013. Google Wins: Court Issues a Ringing Endorsement of Google Books, Publishers Weekly, Nov 14, 2013. Google book-scanning project legal, says U.S. appeals court, Reuters, October 16, 2015. “We trust that the Supreme Court will see fit to correct the … Read more

Amazon DocumentDB (with MongoDB compatibility) is now available in the Europe (Paris) region

Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.   You can use Amazon DocumentDB in the following AWS regions: US East (N. Virginia, Ohio), US West (Oregon), Europe (Paris, Ireland, Frankfurt, London), and Asia Pacific (Mumbai, Singapore, Sydney, Tokyo, Seoul). For more information on AWS … Read more

Categories AWS ExcerptFavorite

Imagineering & Resurrections

Generative design technologies can be used to emulate and reconfigure things that already exist. There has already been much discussion on the impact of “deepfakes” — an application of deep learning that creates fake photos, videos, and writing based on their real counterparts. But there’s been less discussion on how entire works might be pulled … Read more

How GCP helps you take command of your threat detectionHow GCP helps you take command of your threat detectionCloud Developer AdvocateProduct Manager

Why do we keep talking about security all the time? Why hasn’t anyone just gone and fixed it? You’ve probably heard these questions, whether from your leadership, or a board member, or just from friends. Then you labor at explaining why security in the cloud is so complex and challenging, the constant arms race, and … Read more

Keep Parquet and ORC from the data graveyard with new BigQuery featuresKeep Parquet and ORC from the data graveyard with new BigQuery featuresProduct Manager, Google BigQuery

“At Pandora, we have petabytes of data spread across multiple Google Cloud storage services; accordingly, we expect BigQuery’s federated query capability to be a useful tool for integrating our diverse data assets into a unified analytics ecosystem,” says Greg Kurzhals, product manager at Pandora. “The support for Parquet and other external data source formats will … Read more

DataOps and data science at enterprise scale

Editor’s note: This is the 11th episode of the Towards Data Science podcast’s “Climbing the Data Science Ladder” series, hosted by Jeremie Harris, Edouard Harris and Russell Pollari. Together, they run a data science mentorship startup called SharpestMinds. You can listen to the podcast below: One thing that you might not realize if you haven’t … Read more

Data Science Bootcamp: Would I do it again?

One year ago today (yes, on Halloween), I started a data science bootcamp. This seemed like as good a time as any to look back and share some thoughts and takeaways. Bootcamps are not for everyone. The cost can limit who participates. If you do a bootcamp, focus on the bootcamp. What I learned is … Read more

Automating bits and pieces of your daily life

Being an avid techie and problem solver, my mind is always looking out for opportunities to apply what I’ve learnt. Other than during my time in internships, I haven’t really put my school fees to good use. Until one fateful day, a notification showed up on my phone: My mum’s daily routine of collating meal … Read more

Gold-Mining Week 9 (2019)

[This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. ShareTweet The post Gold-Mining Week 9 (2019) appeared first … Read more

Categories R Tags ExcerptFavorite

Business intelligence applied to a user engagement problem

One key area where companies often become concerned and willing to exploit data to get guidance on where the problems are located and how they can be solved is related to consumer retention or user engagement. So, let’s imagine that we are analysts working for a technological company whose most important KPI revolves around how … Read more

Tensorflow 2.0 Data Transformation for Text Classification

A complete end-to-end process for classifying text In this article, we will utilize Tensorflow 2.0 and Python to create an end-to-end process for classifying movie reviews. Most Tensorflow tutorials focus on how to design and train a model using a preprocessed dataset. Typically preprocessing the data is the most time-consuming part of an AI project. … Read more

What Triggers Crime in NYC Parks?

Photo Credit: Pixabay, licensed by Creative Commons Images Walking your dog in one of the city’s parks may be your daily routine or you may occasionally visit the park to get some fresh air. Regardless of the purpose of your visit, safety is the most critical concern. Talking about safety, I recall a great quote … Read more

Offensive Programming in action (part III)

[This article was first published on NEONIRA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. This is the third post on offensive programming, dedicated to using offensive programming … Read more

Categories R Tags ExcerptFavorite

It’s Been a Bad Year for Smartphone Facial Recognition Technologies

Photo by Jyotirmoy Gupta on Unsplash It was recently announced that the Google Pixel 4 would be replacing its fingerprint technology with facial recognition technology to increase security. This was one of the most exciting selling points for the model — but it didn’t take long to discover the loopholes. It was quickly discovered that … Read more

Reinforcement Learning with AWS DeepRacer

From a toy car to AlphaGo and autonomous Teslas In March 2016, Lee Sedol, the greatest Go player of the past decade, was defeated 4–1 by AlphaGo. Computers have beaten the best humans at chess before, but Go is at another next level at complexity. Do you know what’s even crazier? The machine had only … Read more

AWS App Mesh is now available in Europe (Paris) Region

AWS App Mesh is a service mesh that provides application-level networking to make it easy for your services to communicate with each other across multiple types of compute infrastructure. App Mesh standardizes how your services communicate, giving you end-to-end visibility and ensuring high-availability for your applications.  Favorite

Categories AWS ExcerptFavorite

Jupyter notebook autocompletion

Jupyter notebook autocompletion How can you use Jupyter notebook autocompletion? The good news is: you do not install anything as it comes with the standard jupyter notebook set up. To start using autocompletion you have to start typing your variable name and hit the tab button on your keyboard. When you do it the box … Read more

Scraping Hansard with Python and BeautifulSoup

Packages required These are the packages I used import csvfrom bs4 import BeautifulSoupimport pandas as pdimport requests csv allows you to manipulate and create csv files. BeautifulSoup is the web scraping library. Pandas will be used to create a dataframe to put our results into a table. Requests is used to send HTTP requests; to … Read more

Clustering in detail

It takes context, plots, algorithms, metrics, and fiddling! Some of the students in a data science course showed me this very interesting dataset about Brazilian states they found on the internet. It was love at the first plot! But it took much more than just the first plot to grasp some of its nuances. If … Read more

Calculating String Similarity in Python

As before, let’s start with some basic definition: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.[2] By Giphy: https://giphy.com/gifs/CiYImHHBivpAs/html5 I know, it’s not the cleanest of definitions, but I find it good enough. It requires some math knowledge, … Read more

Introduction to Sequence Modeling Problems

In sequence learning problems, we know that the true output at timestep ‘t’ is dependent on all the inputs that the model has seen up to the time step ‘t’. Since we don’t know the true relationship, we need to come up with an approximation such that the function would depend on all the previous … Read more

Let the kids into the library

The data ops team should provide tools to lower the barrier for all your employees, not only the Java or Python developers. ETL/ELT should be easy to do for everyone, maybe via SQL instead of Java or maybe using a data integration tool that has a drag and drop interface, such as [CDAP] or [Matillion]. … Read more

TensorFlow Enterprise makes accessing data on Google Cloud faster and easierTensorFlow Enterprise makes accessing data on Google Cloud faster and easierDeveloper Programs Engineer, Google Cloud AI

Data is at the heart of all AI initiatives. Put simply, you need to collect and store a lot of it to train a deep learning model, and with the advancements and increased availability of accelerators such as GPUs and Cloud TPUs, the speed of getting the data from its storage location to the training … Read more

Practical Data Problems in ML

Before we can do anything else, we have to find where the data lives and who can provide us access. This step is usually harder than it sounds. Large enterprises will have well-structured data sitting in a dozen different silos (each with their own owner), while startups and smaller companies lack the time and funding … Read more

Business Strategy For Data Scientists

But most of us will hopefully work for growing, high potential companies (less stress of being laid off). Thus, let’s take a look at the two primary deciders of whether our firm will achieve liftoff and ultimately become profitable and successful — customer lifetime value and customer acquisition cost. Photo by Chevanon Photography from Pexels … Read more

A brief primer on Variational Inference

[This article was first published on Fabian Dablander, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. … Read more

Categories R Tags ExcerptFavorite

AWS Athena helps to find the worst place to park your car in Portland.

After visiting Portland, OR last weekend I’ve decided to explore some publicly available datasets about the city. In this post, we are going to calculate the number of incidents related to vehicles (theft from or theft of a vehicle) and the number of parking spots in each Portland neighborhood using Athena geo queries. After that, … Read more

How vital are powerful graphics for Data-Science?

The saving grace of the GPU comes in the form of packages designed with CUDA in mind. Most statistical machine-learning operations involve moving high volumes of data in the form of numerical matrices and computing values in real time. Luckily, this is precisely what a graphics card is designed to do. Although there are short-comings … Read more

Full Stack Machine Learning on Azure

Jupyter Notebooks are great (see my full notebook here): Easy way to annotate my thinking as I test out different ideas — helps when revising later or passing it to someone else! Quicker to setup and run than a VS Code proj — the ability to isolate cells and run independently is like modified REPL! … Read more

81st TokyoR Meetup Roundup: A Special Session in {Shiny}!

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. As another sweltering summer ends, another TokyoR Meetup! With globalwarming in … Read more

Categories R Tags ExcerptFavorite

The Mysterious Case of the Ghost Interaction

This spooky post was written in collaboration with Yoav Kessler (@yoav_kessler) and Naama Yavor (@namivor).. Experimental psychology is moving away from repeated-measures-ANOVAs, and towards linear mixed models (LMM). LMMs have many advantages over rmANOVA, including (but not limited to): Analysis of single trial data (as opposed to aggregated means per condition). Specifying more than one … Read more

Categories R Tags ExcerptFavorite

Location, Location, Location! in data science

Location, Location, Location! You have heard this many times. It is a common mantra in real estate. Does that apply in data science as well? How do we embrace the location component in Data Science? Is it only another column in your dataset? Or perhaps spatial is special. Location data (big data) is ubiquitous as … Read more

Using Python To Create a Slack Bot

Photo by Lenin Estrada on Unsplash Working at a startup, we needed to automate messages in order to get notified of certain events and triggers. For example, the company I work with deals with connections to certain stores. If that connection is broken Python will read that information in our database. We can now send … Read more

Natural Language Understanding — Core Component of Conversational Agent

We are living in an era where messaging apps deal with all sorts of our daily activities, and in fact, these apps have already overtaken social networks as can be indicated in the BI Intelligence Report. In addition to this clear point, the consumption of messaging platforms is further expected to grow significantly in the … Read more

Power of XGBoost & LSTM in Forecasting Natural Gas Price

FORECASTING cost model is a prerequisite to the development and validation of new optimization methods and control tools. Here, I will show a simple yet powerful approach of forecasting using machine learning algorithms. Psychics and fortune tellers have used Tarot cards for hundreds of years, and Trusted Tarot will give us an accurate reading that’s … Read more