Using unsupervised learning with K-Modes to identify regional tech hubs
- The setup
Hiring in the US is both expensive and time-consuming. For some context, small business owners spend approximately 40% of their time on non-revenue generating tasks such as hiring. Separately, MIT estimates that each new US hire costs the company approximately 1.4x base salary. For startups, these effects are felt even more dramatically as they are typically strapped for both time and cash.
Hiring remote teams offers startups access to larger talent pools and the opportunity to pay salaries at localized rates. So, with limited time and resources, how do startups find the best location for their remote offices for the specialized skills they need? Are there certain countries that specialize in a specific tech stack? And are China and India still the best places to look?
2. The data
I started answering these questions by using the 2019 Stack Overflow Developer Survey which included approximately 40,000 non-US based respondents. Along with their location, each respondent reported the tech stacks and frameworks that they actively use such as Java, Python, Angular, and Django.
Before investing in an offshore office, startups need to make sure that there is a plentiful talent pool that meets the tech and experience requirements of their company. To adequately evaluate the quality of the local talent pool, I added dimensionality to my dataset. I created weights for experience, education, and salary for each respondent to establish their quality as a potential candidate — to distinguish them beyond the stack that they work with.
3. The algorithm
I binarized the columns that included the respondent’s tech stack. At the end of this processing, my data looked something like this:
Example of binarized data
Clustering binary data requires a specialized algorithm. I used a cousin of K-Means, called K-Modes which allowed me to cluster my data in this current form. A quick refresher — K-Means is a partitioning algorithm that groups data on existing (and unseen!) similarities in continuous data among K clusters.
- K– centroids are initialized at random, points are assigned,
- Assign points to the nearest centroid based on a metric (such as Euclidean distance),
- Recompute centroid calculated as the mean of each cluster,
- Reassign pints to nearest centroid,
- Repeat steps 2 through 4 until the points are no longer assigned to a different cluster.
The result is a grouping of data where the separation of the objects within the clusters are minimized.
As highlighted in its name, K-Means uses the mean to calculate the centroid. However, the K-Means algorithm can only work for continuous data, not categorical data. For example, if we take the Euclidean distance for both Respondent 1 and Respondent 2 (as seen above) K-Means would assign both respondents to the same cluster when we know that this is incorrect.
So, for this binarized data, how can we calculate the distance between the different developer’s skills? The solution lies in the K-Modes algorithm which uses dissimilarities instead of distances to the centroid for each data point. In this case, the dissimilarity or “distance” of each data point and each centroid can be defined as the number of tech stacks they disagree on. When the data point and the centroid agree on a tech stack, this will make the “distance” lower, and when they diverge, it will make the “distance” higher.
The K-Modes algorithm also diverges from K-Means when calculating the new centroids. Instead of calculating the average in the cluster, K-Modes calculates the mode of the cluster. So, for my work, I was interested in clustering respondents based on the particular technologies that they used.
K-Modes is a relatively new algorithm that came from a paper in 1998, and the not yet apart of scikit-learn package. You can find K-Modes and it’s cousin K-Prototypes on Github for installation and documentation.
4. The implementation
Now that I have an algorithm that will work with my data, I need to decide on the number of clusters. I used the Jaccard dissimilarity function within the K-Modes package which measures how dissimilar my clusters are from one another.
The Jaccard Distance, as it is also known, is one minus the size of the intersection, over the size of the union— illustrated here:
Jaccard Distance Formula
Like with the K-Means silhouette score, I applied the “elbow” method with the Jaccard dissimilarity score to find the number of clusters that best fit my model. I found that elbow at 40 clusters.
5. The app
Now that I ran my model with k=40, I want to be able to understand the geographic nature of my clusters — to visually display the regionality of the developers within my clusters. Particularly, I wanted to build a tool for early-stage startups to use a first-line tool for locating an off-shore office.
To do this, I built a Flask app that takes the parameters of stack, education, experience and salary for each respondent, and returns an interactive map for the cluster that meets these constraints.
The demo in the video below takes the parameters of Hadoop, Spark and Postgres and where respondents have at least a Bachelor’s degree and at least 4 years of experience and make under $75K. With these parameters, my model shows me that I should begin my search in Estonia, Poland, and Finland as there are hubs of individuals with that experience.
Search for Data Engineer by tech stack
However, if I needed to build a remote office dedicated to Android app development with the specific tech stack of Java, Kotlin and C#, my model suggests looking in Central and South America first.
Search for Android Developer by tech stack
6. The conclusion
Great talent is everywhere. Companies can more accurately target regions when locating off-shore offices for the specific talent they need. This model and app is an initial tool for helping companies pursuing an offshore strategy take the first step to building a remote team.
7. Additional Resources