Practical Debugging for Data Science

A primer on debugging your machine learning system.

Manu Joseph

Now before writing about this topic, I did a quick Google Search to see how much of this is already covered and quickly observed a phenomenon that I see increasingly in the field — Data Science = Modelling, at best, Modelling + Data Processing. Open a MOOC, they talk about the different models and architectures, go to a bootcamp, they will make you write code to fit and train a machine learning model. While I understand why the MOOCs and bootcamps take this route (because these machine learning models are at the heart of data science), they sure have made it seem like machine learning models are the only thing in Data Science. But Data Science in practice is radically different. There are no curated datasets or crisply formatted notebooks, only a deluge of unorganized, unclean data, and complex processes. And to effectively practice Data Science there, you need to be a good programmer. Period.

  1. Problem and Model formulation

In Dijkstra’s classic paper “On the Cruelty of Really Teaching Computing Science”, he argues the case for calling bugs as errors, because it puts the blame squarely where it should reside — with the programmer and not on a gremlin who creeps up when we are sleeping and deletes a line, or an indentation. This change in vocabulary has a profound impact on how you approach a problem with your code. Before the program was “almost correct” with some unexplained bugs which the hero programmer will find and fix. But once you start calling them errors, the program is just wrong and the programmer, who made the error, should find a way to correct it and himself in the process. It went from a “me-against-the-world” action movie to a thoughtful, and introspective drama about a man/woman who brings a profound change in their character through the course of the movie.

In one of his lectures, Jeremy Howard mentioned something profound, and it derives directly from mindset #1.

  1. Read the error message and understand what may have gone wrong.
  2. Find the line in your script which threw the error, and think through the possibilities that can raise this particular error.

You should consider yourselves lucky if the error in your code throws an exception and a helpful traceback. But many times, the error is not so superficial. It either does not throw an error or manifests itself in a totally different form and raises an unexplained exception. It is such errors which are the hardest to debug.

  1. Split the code in two. You can either comment out half the code or put a logger midway through to check the dtype.
  2. Find the offensive split. If the dtype is what you expected at the end of the first block, the second block is the culprit and vice versa.
  3. Pick the offensive split, and Repeat 2–4 until you’ve zeroed in on the line where it all goes wrong.

Now that I’ve made a blasphemous statement and caught your attention, let me clarify. Jupyter Notebooks is a brilliant tool and I use it all the time, but for quick prototyping. Once you’ve made substantial progress in the coding process and you have a long block of code it becomes unwieldy.

In machine learning, an error can be because of programmatic logic or mathematics and it is important to be able to quickly diagnose and isolate the error source so that you don’t waste all day chasing an error.

Have you ever run a model and hit 90% accuracy on a difficult problem and you feel elated that you achieved such a stellar result without putting in much effort? But, a little voice in your head is nagging you, telling you this is not possible. I’m here to hand an amplifier to that voice. Listen to it. More often than not, that voice got it right.

  • When the data is temporal in nature and you end up using information in the future to fit the model. For eg. using K-fold cross validation.
  • When there are duplicates in the data and they are split across both train and validation.
  • When you run a PCA on the entire data set and use the extracted components in your model.

If you are working on a classification problem with a high class imbalance, it has its own pitalls.

The errors in your model will tell you exactly the story you need to make your model perform better. And the process of extracting this story from your results is Error Analysis. When I sat Error Analysis, it includes two parts — ablation study to identify the errors/benefits from each of the components of the system, like preprocessing, dimensionality reduction, modelling, etc., and the analysis of the results and errors in them like Andrew Ng tells us to do.

Classifier Visualizations

1. Confusion Matrix

2. Regression Visualizations

1. Residual Plots


Leave a Comment