Introduction The ‘Data Science Strategic Guide — Get Smarter with Data Science’ is envisioned as a series of articles, which serve to be more of a strategic guide depicting essential challenges, pitfalls and principles to keep in mind when implementing and executing data science projects in the real-world. We also focus on how you can get maximum … Read more Get Smarter with Data Science — Tackling Real Enterprise Challenges
Yet another boring barplot?No!I’ve asked my students from MiNI WUT to visualize some data about their favorite movies or series.Results are pretty awesome.Believe me or not, but charts in these posters are created with ggplot2 (most of them)! Star Wars Fan of StaR WaRs? Find out which color is the most popular for lightsabers!Yes, these … Read more Data, movies and ggplot2
Condenado a estar toda la vida, preparando alguna despedida (Desarraigo, Extremoduro) I live just a few minutes from the Spanish National Museum of Science and Technology (MUNCYT), where I use to go from time to time with my family. The museum is plenty of interesting artifacts, from a portrait of Albert Einstein made with thousands … Read more Spinning Pins
This is my second blog post from the series of My R take on Advent of Code. If you’d like to know more about Advent of Code, check out the first post from the series or simply go to their website. Below you’ll find the challnge from Day 2 and the solution that worked for … Read more My R take on Advent of Code – Day 2
Walkable neighborhoods are great for health, happiness and economic growth. Cities around the world that want to draw a talented young workforce increasingly focused on creating a good pedestrian experience. How could we measure and map walkability using data science tools? This blog suggests an approach drawing on Pandana, an excellent Python library developed by … Read more Measuring pedestrian accessibility
Imagine this. You wake up and find a frightening mark on your skin so you go to the doctor’s office to get it checked up. They say it’s fine so you go home and don’t worry about it for a couple months, but then you have a throbbing pain from that spot — it looks ugly and … Read more Classifying Skin Lesions with Convolutional Neural Networks
Introduction to Quick Thoughts In previous story, I shared skip-thoughts to compute a sentence embeddings. Today, we have another unsupervised learning approach to compute sentence embeddings which is Quick Thoughts. Logeswaran et al. (2018) introduced quick-thoughts approach to retrieve sentence embeddings for downstream application. After reading this article, you will understand: Quick-Thoughts Design Evaluation Experiments Reference … Read more Building sentence embeddings via quick thoughts
Improving yield by removing bad quality material with image recognition Author: Partha Deka and Rohit Mittal Automation in Industrial manufacturing: Today’s increased level of automation in manufacturing also demands automation of material quality inspection with little human intervention. The trend is to reach human level accuracy or more in quality inspection with automation. To stay … Read more Quality inspection in manufacturing using deep learning based computer vision
Maybe so. Maybe not. I’ll level with you: I’m a PhD dropout. I’ve gotten a lot of mileage out of that title, by the way: it hints that I’ve done a lot of grad school, but still maintains the aura of badassery that only the word “dropout” can provide. In some ways, it’s the ultimate humble … Read more Do you need a graduate degree for data science?
I. Background This project was built by Danny Vo, Troy Stidd, Patrick Zhu, Aaron Li, and Samuel Zhang. The code for our project can be found in this repo. For our project, we chose to look for problems in the domain of Magic the Gathering (MTG) because our group has some domain knowledge and the … Read more It’s Magic : Data Science Lab Project
In my last article, we looked at how to get meaningful insights from a huge collection of medical articles gathered from PubMed, a free archive of biomedical and life sciences literature. This time, we are going to continue working with medical articles to create something completely different. We will use Supervised Machine Learning to identify … Read more Applying Logistic Regression to PubMed
In this previous post, I showed how one can scrape top-level NBA game data from BasketballReference.com. In the post after that, I demonstrated how to scrape play-by-play data for one game. After writing those posts, I thought to myself: why not do both? And that is what I did: scrape all the box scores for … Read more All the (NBA) box scores you ever wanted
Characterising biological pathways from gene expression data Gene Set Variation analysis is a technique for characterising pathways or signature summaries from a gene expression dataset. GSVA builds on top of Gene Set Enrichment analysis where a set of genes is characterised between two condition groups defined in the sample. GSEA (Gene set enrichment analysis) works … Read more Decoding Gene Set Variation Analysis
How much data do we need to build this computer vision classifier? This is the data question. In my experience the data question comes up in almost every computer vision project we’ve taken on, and the answer is usually, “it depends.” Along with others, the State Farm® data science community has been investigating the impact … Read more The Data Question
How many times will you be forced to hear “Wonderful Christmastime”? 122 hours, 1,510 tracks. Only 80 original songs. Source: 106.7 LiteFM; 11/30/2018–12/5/2018; Download the data. It starts well before Thanksgiving. This year it was November 16th at 5pm to be precise. That’s when New York’s WLTW 106.7 LiteFM makes a hard switch to an all-Christmas … Read more I Analyzed 122 Hours of Holiday Radio
So…you want to play a pRank with R? This short post will give you a fun function you can use in R to help you out! How to change a file’s modified time with R Let’s say we have a file, test.txt. What if we want to change the last modified date of the file … Read more So you want to play a pRank in R…?
Shiny is a great tool for fast prototyping. When a data science team creates a Shiny app, sometimes it becomes very popular. From that point this app becomes a tool used on production by many people, that should be reliable and work fast for many concurrent users. There are many ways to optimize a Shiny app like … Read more Alternative approaches to scaling Shiny with RStudio Shiny Server, ShinyProxy or custom architecture.
Source: unsplash.com Dec 18, 2018 The quality and accuracy of machine learning models depend on many factors. One of the most critical factors is pre-processing the dataset before feeding it into the machine learning algorithm that learns from the data. Therefore, it is critical that you feed them the right data for the problem you … Read more Serverless Distributed Data Pre-processing using Dask, Amazon ECS and Python (Part 1)
Using ML and AI as a force-multiplier will be a significant competitive advantage for networking product teams Photo by Hitesh Choudhary on Unsplash Machine learning and related techniques have seen tremendous advances in the last few years. And while at times it might feel that there’s a lot of hype surrounding the space, it’s clear that … Read more Think Machine Learning and AI Won’t Impact Your Networking Product — Think Again!
Imagine Snoopy without Woodstock or Calvin without Hobbes, Friends without Rachel, Batman without Robin or Mowgli without Baloo. Social platforms thrive on the ability of the members to find relevant friends to interact with. The network effect is what drives growth or time spent and daily active users on the application. This is even more … Read more Friend Recommendation Using Heterogeneous Network Embeddings
Fusion of multiple modalities using Deep Learning Being highly enthusiastic about research in deep learning I was always searching for unexplored areas in the field (Though it is tough to find one). I had previously worked on Maths word problem solving and many such related topics. The challenge of using Deep Neural Networks as black boxes … Read more Multimodal Deep Learning
vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes). … Read more vtreat Variable Importance
How to Use: Slightly Friendlier Version First install docker. Instructions for your machine can be found here. The docker getting started guide is useful for learning how docker works, although we don’t need the details to use it effectively with repo2docker Make sure docker is running. If docker run hello-world shows the message Hello from … Read more Docker Without the Hassle
Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science. Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University. Looking Forward in Glaucoma Progression Research The contribution of the womblR package and corresponding statistical methodology is a technique for correctly accounting … Read more Statistics in Glaucoma: Part III
The Ecology Hackathon Almost one year ago now, ecologists filled a room for the “Ecology Hackathon: Developing R Packages for Accessing, Synthesizing and Analyzing Ecological Data” that was co-organised by rOpenSci Fellow, Nick Golding and Methods in Ecology and Evolution. This hackathon was part of the “Ecology Across Borders” Joint Annual Meeting 2017 of BES, … Read more rcites – The story behind the package
Reduce training time for deep neural networks by using many GPUs Marenostrum Supercomputer — Barcelona Supercomputing Center https://bsc.es (This post will be used in my master course SA-MIRI at UPC Barcelona Tech with the support of Barcelona Supercomputing Center) “Methods that scale with computation are the future of Artificial Intelligence” — Rich Sutton, father of reinforcement learning (video 4:49) In … Read more Distributed TensorFlow using Horovod
Some practical examples, tips, and thoughts on supervised ML Earlier this year, through my MBA program at Cornell Tech, I took a great intro course on Machine Learning with a fantastic professor, Lutz Finger. Lutz’s course inspired me to dig even deeper into ML and AI, so I recently started a hands-on Introduction to Machine … Read more A journey into supervised machine learning
Rethinking the problem I decided to pivot and try something new. It seemed to me that there was a clear disconnect between the odd look of the training data and images that my model was likely to see in real life. I decided I’d try building my own dataset. I had been working with OpenCV, an … Read more Training a Neural Network to Detect Gestures with OpenCV in Python
Meng LiBlockedUnblockFollowFollowing Dec 17 Ever seen a destination map in Tableau? It’s usually used to show the tracks of flights, bus maps, traffic and so on. There are loads of videos and articles teaching you how to create a destination map from a dataset containing all the information you need, for example, this video uses … Read more Creating US Immigration Path Map in Tableau with R
What is mlFlow? mlFlow is a framework that supports the machine learning lifecycle. This means that it has components to monitor your model during training and running, ability to store models, load the model in production code and create a pipeline. The framework introduces 3 distinct features each with it’s own capabilities. MlFlow Tracking Tracking is … Read more Getting started with mlFlow
Storage endpoints Perhaps the more relevant part of AzureStor for most users is its client interface to storage. With this, you can upload and download files and blobs, create containers and shares, list files, and so on. Unlike the ARM interface, the client interface uses S3 classes. This is for a couple of reasons: it … Read more AzureStor: an R package for working with Azure storage
Indeed, true personalization understands customers at a deeper level — their real-time intent, purchasing history, preferences and complex shopping journeys. It then utilizes these insights to tailor congruent, 1:1 interactions across channels. So far, most companies rely on machine learning, to take all this customer data and build predictive models on it, operating not just on what’s … Read more Deep Learning and Hyper-Personalization
Introduction Playing around with PyTorch and R Shiny resulted in a simple Shiny app where the user can upload a flower image, the system will then predict the flower species. Steps that I took Download labeled flower data from the Visual Geometry Group, Install Pytorch and download their transfer learning tutorial script, You need to … Read more An R Shiny app to recognize flower species
Let’s take a look at the following diagram that illustrates the purposes of the specific layers in the CNN. As we can see above, starting from the left we are learning low-level features and the more we go to the right, the more specific things are being learned. The idea behind Transfer Learning is to … Read more Histopathologic Cancer Detector – Finding Cancer Cells with Machine Learning
Identifying firms with intentional distortion of financial statements is a challenging and exciting problem among auditors, banks and investors who rely on financial information to make decisions. Yet it is difficult to flag out these firms as intentional accounting misstatement (cooking the books) can take several forms: hiding company losses through other entities, recognizing revenue … Read more Detecting Firms with Intentional Misstatements using Machine Learning
We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it … Read more Day 17 – little helper to_na
Data Collection There are three datasets we’re using to run this experiment: A dataset we’ll collect ourselves that includes over 3400 song lyrics between 1970 and 2018. A list of prohibited/restricted words from www.freewebheaders.com that we’ll use to assess the perceived levels of profanity in lyrics. A training dataset from Kaggle (originally used for the … Read more 49 Years of Lyrics: Why so Angry?
For anyone who has been paying attention, it will not have gone unnoticed that the past year has seen a dramatic expansion in the use of face recognition technology, including at schools, border crossing, and interactions with the police. Most recently, Delta announced that some passengers in Atlanta will be able to check in and … Read more On the Perils of Automated Face Recognition
In a project of developing PPNR balance projection models, I tried to use the Phillips-Ouliaris (PO) test to investigate the cointegration between the historical balance and a set of macro-economic variables and noticed that implementation routines of PO test in various R packages, e.g. urca and tseries, would give different results. After reading through the … Read more Phillips-Ouliaris Test For Cointegration
Explore further the world of data science, machine learning and artificial intelligence We are on a mission to get the best content relevant to data science, machine learning, and artificial intelligence out there for everyone. One of the challenges with any content platform on the internet is having a dedicated and curated list of resources … Read more Our Collections
Article jointly written by Arthur Pesah and Antoine Wehenkel Motivation There are usually two ways of coming up with a new scientific theory: Starting from first principles, deducing the consequent laws, and coming up with experimental predictions in order to verify the theory Starting from experiments and inferring the simplest laws that explain your data. … Read more Improve your scientific models with meta-learning and likelihood-free inference
How to conceptualize and implement effective data science projects Results, not hype Motivation The more I delve in data science, the more convinced I am that companies and data science practitioners must have a clear view on how to cut through the machine learning and AI hype, to implement an effective data science strategy that drives business … Read more 6 uncommon principles for effective data sciences
Using Keras and TensorFlow.js to classify seven types of skin lesions Alex YuBlockedUnblockFollowFollowing Dec 16 After doing research on Convolutional Neural Networks, I became interested in developing an end-to-end machine learning solution. I decided to use the HAM10000 dataset to build a web app to classify skin lesions. In this article, I’ll provide some background information … Read more Building a Skin Lesion Classification Web App
Learn how to Deal with Anxiety. When you start researching how to become a data scientist, you will discover an unfortunate fact about the profession. Namely, that becoming a data scientist requires knowledge of a broad and deep set of tools, technologies, and skills. All of which makes the prospect of becoming a data scientist VERY … Read more How to Learn Data Science: Staying Motivated.
Ho, ho, ho! It’s almost Christmas time and I don’t know about you, but I can’t wait for it! And what can be a better way of killing the waiting time (advent!) than participating in excellent Advent od Code. Big thanks to Colin Fay for telling me about it! It’s a series of coding riddles, … Read more My R take on Advent of Code – Day 1
Advanced NLP techniques for deep learning With the problem of Image Classification is more or less solved by Deep learning, Text Classification is the next new developing theme in deep learning. For those who don’t know, Text classification is a common task in natural language processing, which transforms a sequence of a text of indefinite length … Read more What Kagglers are using for Text Classification
This report describes several R packages that allow HTML content to be rendered as part of an R plot. The core package is called ‘layoutEngine’, but that package requires a “backend” package to perform HTML layout calculations. Three example backends are demonstrated: ‘layoutEngineCSSBox’, ‘layoutEnginePhantomJS’, and ‘layoutEngineDOM’. We also introduce two new font packages, ‘gyre’ and … Read more 2018-13 Rendering HTML Content in R Graphics
In a new paper in Monthly Weather Review, minimum CRPS and maximum likelihood estimation are compared for fitting heteroscedastic (or nonhomogenous) regression models under different response distributions. Minimum CRPS is more robust to distributional misspecification while maximum likelihood is slightly more efficient under correct specification. An R implementation is available in the crch package. Citation … Read more Minimum CRPS vs. maximum likelihood
Dec 16, 2018 Source: Dark Reading Background Our project was inspired by Jamie Ryan Kiros who created a model trained on 14 million romance passages to generate a short romantic story for a single image input. Similarly, the ultimate goal of our project was to output a short story for children. “neural-storyteller is a recurrent neural … Read more Is a Picture Worth A Thousand Words?
Opening up a Colab Notebook When using Colab for the first time, you can launch a new notebook here: Once you have a notebook created, it’ll be saved in your Google Drive (Colab Notebooks folder). You can access it by visiting your Google Drive page, then either double-click on the file name, or right-click, and then … Read more Getting Started with TensorFlow in Google Colaboratory