Applying product methodologies in data science

Gareth Walker

What makes a great data driven product? Fancy models? Ground breaking ideas? The truth is that the secret sauce usually rests in successfully implementing a product methodology.

In this post I carry out a retro on a recent hackathon experience, using lean and agile methodology concepts of Minimum Viable Product, Risky Assumptions, and Spikes. I explore how these approaches can help a team quickly identify a use case, map the risks and complexity of the solutions envisioned, and iterate rapidly towards a shippable product.

GitHub code available here

“Where and what types of human rights abuses are taking place around the world?”

This was the not-so-modest briefing issued to Insight Data Science Fellows in a recent joint hackathon with Microsoft’s AI for Good initiative. I was a member of the Seattle team, initially composed of a room full of PhDs shaking their heads and muttering about resources, scope, and the very premise that the brief was even a valid research question. We forged ahead, got pretty far down the road, and ended up as one of two finalist teams presenting to Microsoft.

Our work analyzed the US State Department’s Annual Report on Human Rights Status of Countries, published annually for about 160 countries worldwide. We used a quantitative scoring method designed to summarize country performance in each area of human rights (using data and methods from the the Cingranelli-Richards (CIRI) Human Rights Data Project). We then produced a dashboard visualizing (1) how types of human rights violations cluster across countries, (2) their relationship to macroeconomic and development indicators (3) the keywords typically associated with these clusters in the State Department’s reports.

Final output of the Seattle hackathon team: credit to Kyle Chezic for data vis code

This was our first hackathon. I’m extremely proud of our team and the work we did. However the feedback from the judges gave me the impression that while the methods and models we used showed strong technical ability, we sometimes struggled to tell a clear story about what we did and why. It has since gotten me thinking about applying product thinking to data science projects, which truth be told tends to get lost in the rush to showcase technical prowess.

Insight is a post-doctoral training fellowship bridging the gap between academia and data science. The program’s remit ensures that its organizers take special pleasure in putting PhDs in situations where their old academic coping strategies die a painful death. As a species, we were not bred for our ability to quickly cut to a problem’s barebones, understand the solution’s beneficiaries, and draw the shortest line between the two. Far better to spend the next year carefully refining the research question…and the next four years or so publishing heavily caveated answers in journals buried behind paywalls.

To be clear, I’ve no desire to add to the brisk trade of data science blogs triumphantly lambasting academia. PhDs make good data scientists because of their scientific training, not in spite of it. However, making the transition from academia to the tech sector involves deciding what tools to bring with you on the journey, and which ones to lovingly store in the attic “for later,” alongside that bread maker, rock climbing harness, and other relics of past ambitions.

Into the attic must go the academic research mindset, in particular the instinct to evaluate the merit of a research question by its potential contribution to a corpus of knowledge, and the corresponding belief that only an in-depth approach forged in the fires of peer review will pass muster. To be retained, is the understanding that domain knowledge is essential; that no matter how good your code and math, if you don’t understand the system you’re modeling, and the impact of your own biases, you’re still wandering around in the dark.

So what to take away from Microsoft’s feedback? Well, one comment indicated the use case wasn’t clear. Who was using this product and what problem were we solving for them? The second was was a lack of clarity on our product methodology. We needed to tell a better story about features prioritization, and how we allotted our time to developing them.

Defining the use case

In retrospect, the use case we had in mind was built on the back of some work that Microsoft had already done with the Office of the United Nations High Commissioner for Human Rights (OHCHR). To summarize, Microsoft and OHCHR engaged in a joint ideation session to map out what was really needed by OHCHR staff. The output was RightsView, essentially a live dashboard of information regarding emerging and ongoing human rights violations around the world. A quote from the workshop indicated that OHCHR wanted the dashboard;

“To provide a clear human rights perspective on potential, emerging or ongoing crises, and to get the appropriate responses to them by engaging other parts of the UN and the international community more broadly.”¹

The session also produced a sketch of the dashboard mockup, which is a helpful starting point in deciding where to go with product development.

Product Vision: ‘The Right(s)View’, a dashboard for human rights monitoring. Source:https://news.microsoft.com/features/technology-helps-un-advance-protection-human-rights-new-ways/

The MVP approach

Given a well-researched user and use case, the task becomes to describe the functionality of that product and identify immediate priorities for development. A leading methodology for doing so has been to develop a Minimum Viable Product (MVP). This approach prescribes identifying the minimum set of features which would satisfy early users, shipping those features as soon as possible, and iterating on the product through building on successes and integrating user feedback.

Looking to the example of the RightsView dashboard, the first thing that jumps out at me is that this is essentially a problem of “drinking from the firehose”; we’re trying to provide a service which ingests a wide array of unstructured data with high variability and noise, and return something structured, relevant, concise, and predictive to a user which leads or supports them in taking an action. The depth of potential features within such a service is formidable, and we can picture it as essentially a multi-stage approach where we move closer and closer to the end goal of ‘structured, relevant, concise, and predictive’ (see figure below.)

When drinking from the firehose, use a funnel.

From a MVP perspective, we might then ask, what is smallest amount of functionality that would help a user digest all this data? I would say it’s the first step of ingesting all that unstructured data and somehow translating it into classified and geolocated events, combined with a presentation layer for the structured data (most likely a map dashboard.)

Prioritized MVP functionality

It should be noted that MVP approaches are often accused of falling into the trap of paring down a product to the point where the initial release is somewhat trivial and of limited use. So it’s worth looking at the market for examples of this functionality as a stand alone product.

In the case of event mapping from pubic unstructured data, there’s actually a pretty healthy market. For example, the The Armed Conflict Location & Event Data Project (ACLED) is very close to this first step. However, its data relies on researchers manually coding unstructured reports submitted to the website. While this makes it a more trustworthy source for data and a favorite for journalists (for example the Guardian newspaper uses its data frequently), its manual approach makes it a time consuming and expensive product.

The ACLED uses a mostly manual approach to codifying events data

A good example at the other end of the spectrum is the Global Database of Events, Language, and Tone (GDELT), which represents an attempt to fully automate the process of parsing public media feeds into event data. The project’s founder Philip Schrodt outlines the necessary steps to generating this dataset in a 2011 paper , and the dataset is now available for anyone to use here. While GDELT is almost too vast to be applicable to the question of what types of human right violations are happening and where, it does demonstrate that event data production is indeed its own field of study with a wide user base in policy analysis and sociology.

GDELT is an example of fully automated event detection

Having identified an MVP, and taking a lead from the steps outlined in Schrodt’s 2011 paper, I elected to carry out some further work of my own to see what an MVP iteration of the work would have looked like. There are plenty of Medium posts issuing a step-by-step process of text classification procedures, so I will just provide a brief outline and a link to the github project if you’re interested.

The MVP scraped text from the State Department’s annual human rights country from 2015–2018. Each year covered about 160 countries, each report consisted of about 10,000 words, and was divided into the following sections;

  • Corruption and Lack of Transparency in Government
  • Discrimination, Societal Abuses, and Trafficking in Persons
  • Respect for Civil Liberties
  • Respect for the Integrity of the Person
  • Worker Rights
  • Freedom to Participate in the Political Process

Each sentence was labeled by the section in which it occurred, which then formed the labeled training data. Following some bootstrap resampling to address class imbalances, the data were cleaned, vectorized, and used to train a Support Vector Machine. This then served as the classification algorithm for news feeds parsed from various RSS feeds.

Concurrently, I used the Natural Language Tool Kit (NLTK) to carry out some entity detection on each news story, specifically searching for place names. Where a name was found, this was passed to Google Map’s API in order to geo-locate the story in question. The results were stored in a pandas data frame and visualized using Tableau.

Outline of the process for event coding RSS newsfeeds

The outcome was a basic version of the existing approaches I just surveyed, but with a specific focus on Human Rights violations reported in news feeds. In training the model achieved a 76% accuracy across the 6 categories, when applied to RSS feeds this dropped to 67%.

Demo of MVP for human rights events in RSS feeds

Clearly this is a much simpler outcome than the approach taken by our team. But is it any better? Would our time have been better spent refining this “first step” in the product or were we right to explore much more ambitious scope of functionality? To answer this question, its helpful to mobilize a second product development methodology…

The Riskiest Assumption approach

An MVP seduces with false reassurances of a clear, linear path to an optimized solution. A Riskiest Assumption Test puts the focus on learning…²

When looked at from this perspective, the funnel outlined in MVP section can now be thought of instead as increasingly risky assumptions regarding what can and can’t be done, particularly in the time frame allotted.

Taking a Risky Assumption perspective means explicitly acknowledging the scale and risk of the challenge

Once again, it really pays to do some market research on what’s out there when assessing what is and isn’t possible. It turns out the field of predicting outbreaks in state sponsored crimes, particularly violence, has been attempted by several groups. For example the Early Warning Project bills its self as the first-of-its-kind public system, designed to “spotlight countries where mass atrocities have not begun, but where the risk for such violence is high.”³

The Early Warning Project attempts to highlight countries ‘at risk’ of mass killings

The first thing that stands out to me is the care with which the project’s spokesperson emphasizes that it is not a predictive model. Here’s Gill Savitt, directors of the U.S. Holocaust Memorial Museum which hosted the project;

“We’re not forecasting with precision. That’s not the intention of the tool,” Savitt says. “What we’re doing is trying to alert policymakers that here’s a situation that is ripe for horrors to happen and give them a heads up that there are actions that can be taken to avert it.”³

Similar models appear to be operating in the slightly more murky waters of the private defense and security sector, and their architects are notably less shy about their predictive performance. For example Lockhead Martin promotes its Integrated Crisis Early Warning System (ICEWS) as having an 80% accuracy in predicting crisis across the world. Details of what these “crises” actually are, or any sense of how accuracy was measured aren’t given …

So it appears we do indeed have some risky assumptions present in our product, most notably that we can detect any correlation between human rights violations and socio-economic indicators, and even more so that we can predict changes in human rights performances over time.

Exploring risky assumptions and the concept of code spikes

Our team spent a great deal of time exploring different approaches to moving from description to prediction with our data, and in the process eliminated quite a few approaches which didn’t work. This is essentially what the risky assumption approach is about: moving to directly test the core assumptions of your project rather than starting with the easiest/minimum elements only to hit a brick wall later. As one description puts it, the key to the risky assumptions approach is ;

…rapid, small tests. What’s the smallest experiment you can do to test your biggest assumption?³

When these tests are in the form of writing some skeletal code in order to test if an idea ‘has legs’, as was the case with us, then this test can also be described as a spike. Briefly, a spike is a term which emerged from the Extreme Programming (XP) school of product development. It’s characterized as an open-ended, but ideally brief and focused, coding exercise designed to test an assumption. Put another way;

‘What is the simplest thing we can program that will convince us we are on the right track?’⁴

Looking back at the way our team approached this challenge, the model of identifying risky assumptions, combined with using spikes to test them, was the intuitive approach we used. Our team was split into four broad tasks who each focused on different assumptions and spike tasks:

Risky Assumptions can be tested using Spikes

The outcomes were mixed. We found a good way to translate structured text into ‘scores’ for each human right category, we also found a way to cluster countries into different types fo human rights profiles, and we at least made a start of exploring the role of socio-economic variables in shaping those clusters. Prediction, somewhat predictably, wasn’t possible.

Having used both a MVP and riskiest assumptions approach to map out how a product is developed, we can now take a look at the outcomes of each and reflect on strengths and weaknesses.

Short-term versus long-term risk

One of the take-aways for me has been that an MVP approach appears to mitigate short-term risk, while tackling the riskiest assumptions takes the longview. With an MVP, you’re perhaps more likely to get something out the door which will meet the most minimal user requirements. However, your roadmap might fall off a cliff because what you’re actually trying to do — predict human rights abuses — might not be possible in the end.

Managing velocity

This has been especially important issue for a hackathon where a product outcome is expected in a matter of weeks. More often than not, an MVP will be composed of a user stories where complexity and ambiguity are low(er) and as such you have a better chance of mapping out how long it’s going to take. Tackling your riskiest assumption in a lean approach is by its nature more exploratory, and as such you will need to be far more aggressive about allotting strict time allowances to each code spike.

Feature bloat

An MVP approach helps you to focus on a single feature and nail it. We felt compelled to include the outcomes of each code spike on our riskiest assumptions because the hackathon was a competition. As such, we ended up with a large number of features, which showed off how much work we’d done, but overwhelmed the Microsoft judges, and potentially our future users.

A huge thanks to all my teammates who contributed to our success. Please do check out their profiles:

Priscilla Addison

Tyler Blair

Kyle Chezic

Colin Dietrich

Stephanie Lee

Marie Salmi

  1. https://news.microsoft.com/features/technology-helps-un-advance-protection-human-rights-new-ways/
  2. https://hackernoon.com/the-mvp-is-dead-long-live-the-rat-233d5d16ab02
  3. https://www.npr.org/sections/goatsandsoda/2018/12/20/675582639/is-genocide-predictable-researchers-say-absolutely
  4. http://agiledictionary.com/209/spike/
Favorite

Leave a Comment