The evolution of the technologies in Big Data in the last 20 years has presented a history of battles with growing data volume. The challenge of big data has not been solved yet, and the effort will certainly continue, with the data volume continuing to grow in the coming years. The original relational database system … Read more Design Principles for Big Data Performance
Setting up Configuring the databricks-connect client will be pretty easy, you will need to accept the agreement, enter the url (including the https://), enter the token, enter the cluster ID and push enter twice to accept the default values for the Org ID and Port questions. Do you accept the above agreement? [y/N] ySet new … Read more Working On a Databricks Cluster From A Remote Machine
What’s Alexa got to do with it? With the advent of Alexa, Google Assistant, Siri, and Alibaba and Baidu killing it in smart speaker adoption in China, consumer voice AI is eating the world, but to what end? Case in point, Alexa echo devices don’t make much of a profit for Amazon on hardware sales. … Read more If Software is Eating the World
Who would win the 2018 presidential election if there were no fake news? How many new drivers would sign up in San Francisco if Uber had carried out the alternative incentive plan? Would employees be more efficient at work if companies encourage a 10-minute coffee break every two hours? These questions are difficult to answer … Read more Why do we do and how can we benefit from experimental studies?
Supervised learning is the most common subbranch of machine learning today. Typically, new machine learning practitioners will begin their journey with supervised learning algorithms. Therefore, the first of this three post series will be about supervised learning. Supervised machine learning algorithms are designed to learn by example. The name “supervised” learning originates from the idea … Read more A Brief Introduction to Supervised Learning
Emmanuel Macron, President of France Addressing world leaders at the UN General Assembly’s annual high-level debate on Tuesday, French President Emmanuel Macron called for courage, and for politicians to take the risks needed to achieve real solutions to contemporary challenges. Full text here. President’s speech was having polarity of 0.08 which highlights it was neutral … Read more Visualizing the speeches of world leaders at UNGA
The easy way. Spark, Wikipedia. Running PySpark on your remote machine, and using it from within Jupyter or python requires a bit of installation and playing around in your shell. The following method worked for me, I was able to install PySpark and run the demo code from inside Jupyter Lab. So lets begin. Install … Read more How to Install PySpark on a remote machine
According to Wikipedia, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. In simple terms, Gini impurity is the measure of impurity in a node. Its formula is: where J is … Read more The Simple Math behind 3 Decision Tree Splitting criterions
Note: This was originally posted at martinheinz.dev There are plenty of articles written about lots of cool features in Python such as variable unpacking, partial functions, enumerating iterables, but there is much more to talk about when it comes to Python, so here I will try to show some of the features I know and … Read more Python Tips and Trick, You Haven’t Already Seen
The way I use to extract data from PDF scanned images When I was extracting data to perform an analysis for Malaysia Vehicle’s Market, I faced a problem with how to retrieve clean data. I kept on googling to find maybe there are other sources to get the data, but I found no answer. At … Read more How to Retrieve Data from A Complex Formatted File
Best Practices for Statistical Ethics This is not a situation unique to government agencies. Many companies and organizations face these challenge. How should those of us empowered with data act in order to reinforce good practices and ethics? One place to start is the American Statistical Associations’ ethical guidelines. Not surprisingly these include: Choosing methods … Read more Data Science and Politics
Next step is to set up the project and the environment in PyCharm. There are two possible scenarios: Starting a new project Cloning a project from Gitlab Setting up a new project with an existing environment is very straightforward in PyCharm, once you open the initial window you will see the option: + Create a … Read more A Framework to Distribute Data Projects Across Teams
With the enormous amount of data that the world is currently collecting alongside the proliferation of AI, Machine Learning, and “Big Data” methodologies especially in the last several years, there have been many data roles that have been invented to use these data and methods to bring real-world values. There are many skillsets that are … Read more Lessons Learned From The Front Line of Analytics
Photo by Alice Donovan Rouse on Unsplash A log is a sequence of records ordered by time. It’s configured to allow more and more records to be appended at the end: Logs keep track of anything and everything. There are all kinds of logs in computing environments: Server logs are important. They keep track of … Read more Data Logs: Data’s unifying abstraction
ethics — Google Search.jpeg Like anything, boundaries and frameworks need to be established, and artificial intelligence should be no different. Whether we have realized it or not, AI is changing the way we live. It’s present in the way social media feeds are organised; the way predictive searches show up on Google; and how music … Read more Ethics in AI: Decisions by Algorithms
Note: This was originally posted at martinheinz.dev For me, the biggest struggle when starting new project has always been trying to set the project up “perfectly”. I always try to use the best directory structure so everything is easy to find and imports work nicely, setup all commands so that I’m always one click/command away … Read more Ultimate Setup for Your Next Golang Project
In this project, we will use the Foursquare API to explore neighborhoods in San Francisco, get the most common venue categories in each neighborhood, use the k-means clustering algorithm to find similar neighborhoods, use the Folium library to visualize the neighborhoods in San Francisco and their emerging clusters. Project Flow This is the clustering map … Read more Kickstart Your First Clustering Project in San Francisco Neighborhoods
I’m proud to announce the release of an R package that has cured one of my own personal itches: pulling and working with USDA data, specifically Quick Stats data from NASS. tidyUSDA is a minimal package for doing just that. The following is cut out from the package vignette, which you can find here: https://github.com/bradlindblad/tidyUSDA … Read more Announcing tidyUSDA: An R Package for Working with USDA Data
I thoroughly enjoyed my first hackathon (you can read about my experience about scope from a previous post). The opportunity arose through BetaNYC to participate in the Mobility for All Abilities Hackathon, part of the larger National Day of Civic Hacking of 2019. I was on the Reliable Access to Subways team, partnered with TransitCenter … Read more Getting Stuff Done at Hackathons for Rookies
It’s great news to see that there are more family-friendly neighborhoods in London than there are neighborhoods to avoid. In fact, there are 136 neighborhoods to choose from. Here is a simple breakdown: So for any families like my own who are looking for the best family-friendly neighborhoods in London, England. I suggest you start … Read more My First Data Science Project — Family-Friendly Neighborhoods in London
Source: Pexels You have just started your first job as a data scientist and you are excited to start using your random forest skills to actually make a difference. You get all setup — primed to start your Jupyter Notebook — only to realize you first need to “SSH” into a different machine to run … Read more An Often Overlooked Data Science Skill
We create a Sampling Distribution of the mean of the WeightLoss samples assuming our Null hypothesis is True. Central Limit Theorem: The central limit theorem simply states that if you have a population with mean μ and standard deviation σ, and take random samples from the population, then the distribution of the sample means will … Read more P-value Explained Simply for Data Scientists
An intuition and tutorial on trust score Several efforts to improve deep learning performance have been done through the years, but there are only few works done towards better understanding the models and their predictions, and whether they should be trusted or not. In this article, we shall lightly probe the trustworthiness of a model … Read more How can I trust you?
What your expected salary will be after graduating based on college degree and college region. Teenagers reach that point in their life where they need to pursue their goals in life. Some have ambitions that require college education. Some are still unsure about their goals or ambitions so they go to college to find them. … Read more Where should you go for college?
There is a general consensus that when we talk about open data we are referring to any piece of data or content that is free to access, use, reuse, and redistribute. Due to the way most governments have rolled out their open data portals, however, it would be easy to assume that the data available … Read more Is There a Difference Between Open Data and Public Data?
After reading all of that stuff about positive and negatives (a couple of times preferably), you now have a basic idea and intuition about confusion matrix, and you see that it’s not that confusing after all — it just needs to “sink in” properly. But is that all about confusion matrix? I hope you’re kidding. … Read more A Non-Confusing Guide to Confusion Matrix
‘If you want to be good at swimming in pools, that is fine, go for Kaggle. If you want to be good on the open sea, go for Omdena’ — Leonardo Sanchez, Omdena challenge collaborator from Brazil. ‘What I learned in the past couple of months in Omdena’s AI challenge is much more than what … Read more Why Kaggle Is Not Inclusive and How to Improve It.
Following my previous article on the Strata Data Science Conference, I started to ponder future developments in data science and business intelligence — namely, how these two simple terms will change the way we work, think, and live. To be honest, “data science” seems somewhat distant to me; however, the concept of “business intelligence” can … Read more Power BI as a Tool for Business Intelligence
In the last story, we discussed RASA NLU which is an open-source conversational AI Tool. We used Tensorflow pipeline which is used for intent classification. The pipeline has different components such as tokenizer, featurizer, entity extractor, and intent classifier. Our intent classifier itself has sub-components such as TensorFlow embedding. Now we are going to discuss … Read more The crux of word embedding layers -Part 1
This is the first post in the series of “Digital Image Processing”. In this series, we will be discussing digital images and how to process them. Let’s discuss what an Image is. If you are from Signal processing background, then you might consider image as a two-dimensional signal i.e., a function with two dimensions f(x,y) … Read more Introduction to Digital Images
‘Growth Hacks’ vs ‘Growing Pain’ Some people accused him of fraud, I’m not so sure about that. This to me looks more like an honest mistake due to lack of experience in scaling up and entering into a field he is not familiar with. See, most of his more popular videos are entry-level tutorials with … Read more When ‘Growth Hacks’ Meets ‘Growing Pain’
In Search Of A Better Approach To Physical Training Credit: Coen Van Den Broek I cycle from time to time. I’m talking road cycling. I’m not a pro, I don’t want to be — but I love competition, I love racing and I do love a challenge. Turns out cycling is a sport heavily tangled … Read more Machine Learning, Cycling & 300W FTP (Part 1)
What’s the actual objective of a business case interview? It’s to test the ability of a candidate to both think critically and creatively when faced with an open-ended problem. But as an interviewer how do you assess these things? The thinking critically part is not as hard — if the person is stumbling through basic … Read more Making Data Science Interviews Better
Machine learning prediction models using time-series weather data. Image licensed from Adobe Stock Dengue, commonly called dengue fever, is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, Dengue can cause severe … Read more Using Keras and TensorFlow to Predict Dengue Fever Outbreaks
Lesson 3 of “Practical Deep Learning for Coders” by fast.ai “I have not failed. I’ve just found 10,000 ways that won’t work.” ~Thomas Edison I’m a math adjunct working my way through Lesson 3 of “Practical Deep Learning for Coders” by fast.ai, and this week has been a major pride-swallower for me. At the end … Read more 10,000 Ways That Won’t Work
Time series prediction Photo by rawpixel.com from Pexels The idea of using a Neural Network (NN) to predict the stock price movement on the market is as old as NNs. Intuitively, it seems difficult to predict the future price movement looking only at its past. There are many tutorials on how to predict the price … Read more LSTM for time series prediction
So, how exactly do you leverage this amazing technology? Luckily, you’re a whiz with a keyboard and don’t even need to see the screen to do this — here’s how it breaks down: You’re doing data science — you need data! Then you gotta see if that data is balanced and usable. Prepare your data. … Read more Hey, Can (A)I Get Your Number?
Learn visualization using Python and Folium, from scratch Data visualization is not merely science, it is an art. The way our human brain works, it is really easy to process information in the form of visualization. After almost 25 years into digital mapping and many companies using machine learning to collect mass amounts of data, … Read more Visualizing Tesla Superchargers in France
Exploring the Difference or Nuance between Monolithic Kernel as Opposed to Microkernel In the dictionary a kernel is a softer, usually edible part of a nut, seed, or fruit stone contained within its shell such as “the kernel of a walnut”. It can also be the central or most important part of something “this is … Read more What is theKernel?
Why we need a new breed of leader in the data-fueled era Multiple choice time! What’s the best kind of worker? A) Reliable workers who carry out orders precisely, quickly, and efficiently. B) Unreliable workers who may or may not feel like doing what they’re told. If you think this is a no-brainer and reliable … Read more Artificial Intelligence: Do stupid things faster with more energy!
For our analyses of anonymized mobile phone location data here at Invenium we use, amongst others, Apache Spark™. In our applications, we interface it directly using the Java API as well as using the Python API pyspark. Recently we noticed an unusual performance drop when running our algorithms. After making sure that we haven’t made … Read more How to get the Python Environment of all Spark Cluster Nodes
SciPy wants your ideas to help it become more user-friendly You’ve heard of SciPy. You’ve probably used it. You might have looked through some of the technical documentation and user guides. You might even have an opinion of the documentation… But have you given any thought to actually getting involved and letting SciPy know how … Read more Get Involved With SciPy!
Endgame for “AI Winter” How a competition, ImageNet, along with a noisy algorithm, Stochastic Gradient Descent, changed the fate of AI? Picture from The Elders Scroll | Skyrim In the early 1980s, Winter was coming for Artificial Intelligence (AI) with a period of reduced funding and interest in AI research, which will later be called … Read more A classic bedtime story: Cinderella of Neural Networks
Variance as Information In Machine Learning, we need features for the algorithm to figure out patterns that help differentiate classes of data. More the number of features, more the variance (variation in data) and hence model finds it easy to make ‘splits’ or ‘boundaries’. But not all features provide useful information. They can have noise … Read more Introduction to Principal Component Analysis (PCA) — with Python code
Organizations widely recognize the potential power of artificial intelligence (Ai). They instinctively understand that it feels like we’re on the cusp of something that will change our lives and our businesses in a profound way. Yet, many struggle with where to use it. If you’re looking for how and where your company should use Ai, … Read more Ai: Where To Begin?
How to discover if your Numpy is using a fast BLAS library. The Numpy Logo, Wikipedia. If your research work is highly dependent on Numpy-based calculations, such as vector or matrix additions and multiplications, etc. Then it is advisable to run a few checks in order to see if Numpy is using one of three … Read more Is your Numpy optimized for speed?
What is Ai?It depends on who you ask. When the term was coined in 1956, “Artificial Intelligence” has endured a lifetime of misunderstanding. Explainability is the missing link and the reason why it’s misunderstood. The problem lies in the interpretation of the word “intelligence.” In the words of legendary computer scientist Edsger Dijkstra: “The question … Read more Artificial Intelligence: Explainable in every language
Photo by Ross Findon on Unsplash However, most modern web pages are quite interactive. The concept of “single-page application” means that the web page itself will change without the user having to reload or getting redirected from page to page all the time. Because this happens only after specific user interactions, there are few options … Read more Image Scraping with Python
https://www.youtube.com/watch?time_continue=82&v=ARJ8cAGm6JE This month I went to visit a friend of mine in Ireland who had just remodeled a house. She had purchased a large mirror and asked the workers onsite if they could hang it in the dining room and then she headed out for the day. When she returned, she found that while the … Read more Mirrors, Self-driving trucks, Unconscious Bias and Machine Learning/Artificial Intelligence