Measuring Machine Learning

From desktop, to single board computer, to micro-controller

Alasdair Allan
Talking at Crowd Supply’s Teardown conference in Portland, OR, on Saturday the 22nd of June 2019.

A transcript of a talk I gave at Crowd Supply’s Teardown conference in Portland, OR, in June 2019. While the video is of the talk as given, the transcript has been expanded with details of some events that have happened since it was presented.

Machine learning is traditionally associated with heavy duty, power-hungry, processors. It’s something done on big servers. Even if the sensors, cameras, and microphones taking the data are themselves local, the compute that controls them is far away, the processes that make decisions are all hosted in the cloud. But this is changing, things are moving towards the edge.

Now, for anyone that’s been around a while, this isn’t going to be a surprise as throughout the history of the industry—Depending on the state of technology—we seem to oscillate between thin- and thick-client architectures. Either the bulk of our compute power and storage is hidden away in racks of, sometimes distant servers, or alternatively, it’s in a mass of distributed systems much closer to home. We’re on the swing back, towards distributed systems once again. Or at least, a hybrid between the two. No surprises there, and Machine learning has a rather nice split, that can be made between development and deployment.

Initially an algorithm is trained on a large set of sample data, that’s generally going to need a fast powerful machine or cluster, but then that trained network is deployed into an application that needs to interpret real data in real time and that’s a easy fit for lower powered distributed systems. Sure enough this deployment, or “inference,” stage is where we’re seeing the shift to local processing, or edge computing if you want to use the latest buzzword, right about now.

Which is sort of a good thing. Recently researchers at the University of Massachusetts, Amherst, performed a life cycle assessment for training several common large AI models. They found that the process can emit the equivalent of more than 626 thousand pounds of CO2 — nearly five times the lifetime emissions of the average American car.

Source: Strubell et al. (📊: MIT Technology)

Now I’ve been hearing about this study a lot, and I’ve got some problems with it, and how it looks at machine learning. Firstly the sort of machine learning it looks at is natural language processing (NLP) models, that’s a small segment of what’s going on in the community.

But also it’s based on their own academic work, their last paper, where they found that the process of building and testing the final paper-worthy model required training 4,789 models over a six-month period. That’s just not been my experience about how, out in the real world, you train and build a model for a task. The analysis that’s fine as far as it goes, but it ignores some things about how models are used, about those two stages, development and deployment.

Because using a trained model doesn’t take anything like the resources required to train it in the first place, and just like software, once trained a model isn’t a physical thing. It’s not an object.

One person using it doesn’t stop someone else using it.

You have to split that sunk cost down amongst everyone or every object that uses it — potentially thousands or even millions of instances. It’s okay to invest a lot into something that’s going to be used a lot. It also sort of ignores the facts about how long those models might hang around.

The first job I ever had as an adult, fresh out of University was at a now defunct defence contractor. There, amongst other things, I built neural network software for video compression. To be clear, this was during the first, well maybe second, time machine learning was trendy, back in the early nineties when machine learning was still called neural networks.

The compression software I built around the neural network leaves rather specific artefacts, in the video stream and every so often I still see those artefacts in video today in products by a certain large manufacturer who presumably picked up the intellectual property of the defence contractor for a bargain price after it went bankrupt.

Those networks, presumably now buried at the bottom of a software stack wrapped inside a black box with “here be magic” written on the outside — the documentation I left behind was probably that bad — are therefore still around, something like 25 to 30 years later.

Which makes accessibility to pre-trained models, and what have become colloquially known as ‘model zoos’ rather important. Because while you might drawn an analogy between a trained model and a binary, and the data set the model was trained on, and source code. It turns out that the data isn’t as useful to you—or at least most people—as the trained model.

Because lets be real for a moment. The secret behind the recent successes of machine learning isn’t the algorithms, this stuff has been lurking in the background for decades waiting for computing to catch up. Instead, the success of machine learning has relied heavily on the corpus of training data that companies — like Google — have managed to build up.

For the most part these training datasets are the secret sauce, and closely held by the companies, and people, that have them. But those datasets have also grown so large that most people, even if they had them, couldn’t store them, or train a new model based on them.

So unlike software, where we want source code not binaries, I’d actually argue that for machine learning the majority of us want models, not data. Most of us—developers, hardware folks—should be looking at inferencing, not training.

To be fair, I’ll now state up front that this is a fairly controversial opinion.

However it’s the existence of pre-trained models, that let us easily and quickly build prototypes and projects on top of our machine learning. Which is what people that aren’t focused on the machine learning, but just want to get things done, actually want.

A retro-rotary phone powered by AIY Projects Voice Kit and a Raspberry Pi. (📹: Alasdair Allan)

As late as last year mid range single board computers, like the Raspberry Pi, were really struggling at the limits of their capabilities to carry out fairly straightforward tasks like hot word voice detection, without talking to the cloud. However things have moved on a lot in the last year.

Because over the last year or so there has been a realisation that not everything can, or should, be done in the cloud. The arrival of hardware designed to run machine learning models at vastly increased speeds, and inside a relatively low power envelopes, without needing a connection to the cloud, is starting to make edge based computing that much more of an attractive proposition for a lot of people.

The ecosystem around edge computing is actually starting to feel far mature enough that real work can get done at long last. Which is where accelerator hardware like the Coral Dev Board from Google come in, these are leading indicators.

The Coral Dev Board from Google. (📷: Alasdair Allan)

Underneath the ludicrously sized heat sink is something called the Edge TPU. It’s part of a tidal wave of custom silicon that we’ve seen released to market over the last six months or so. Intended to speed up machine learning inferencing on the edge, no cloud needed. No network needed. Take the data. Act on the data. Throw the data away.

But that’s a whole different talk around data privacy.

The difference the new generation custom silicon is dramatic, and on the market at the moment we have hardware for Google, Intel, and NVIDIA, with hardware from smaller companies coming soon, or already in production.

Some of it designed to accelerate existing embedded hardware, like the Raspberry Pi, while some of it is designed as evaluation boards for System-on-Module (SoM) units that should be available in volume later in the year.

An edge computing hardware zoo. Here we have the Intel Neural Compute Stick 2 (left, top), a Movidus Neural Compute Stick (left, bottom), the NVIDIA Jetson Nano (middle, top), a Raspberry Pi 3, Model B+ (middle, bottom), a Coral USB Accelerator (right, top), and finally the Coral Dev Board (right, bottom).

But before we look at that custom silicon we should take a look at the Raspberry Pi. The Raspberry Pi 3, Model B+, which until very recently was the fastest Raspberry Pi you could buy, is built around a 64-bit quad-core ARM Cortex-A53 clocked at 1.4GHz. You should bear in mind he Cortex-A53 isn’t a performance core, it was designed as a mid-range core, and for efficiency.

Installing TensorFlow on the Raspberry Pi used to be a difficult process, however towards the middle of last year everything became a lot easier.

However it’s actually sort of interesting, it’s incredibly hard to find a good tutorial on how to do inferencing. A lot of the tutorials you’ll find on ‘how to get started with Tensor Flow’ talk about training models, some even just stop once you’ve trained the model. They don’t bother to use it.

I find this sort of puzzling, and presumably it speaks to the culture of the community around machine learning right now. Still sort of vaguely, academic in nature. You see similar sorts of weirdness in with cryptography, a lot of discussion of mathematics, and little about using it

Anyway this is roughly how you do inferencing on an image when you’re using a object detection model, like MobileNet, where you’re expecting a bounding box returned.

Feeding our code a test image containing two recognisable objects, a banana and an apple, gives us reasonably shaped bounding boxes.

Running the code gives us roughly 2 frames per second, more-or-less, when benchmarked using Google’s MobileNet models, v2 and v1. Now v1 models are a bit less processor intensive than v2, and generally return detections with a bit less confidence. I’m also using something called “Depthwise Separable Convolution” to reduce the model size and complexity there, which reduces the detection confidence a bit more, but speeds things up. Anyway, that 2 frames a second is not great. But it gives us a yard stick to look at the accelerator hardware.

Now Intel were first to market with custom silicon intended to speed up machine learning. They were actually way ahead of everyone else as they bought a startup called Movidius, and then rebranded their silicon all the way back in 2016. Adoption has sort of been slow, but the custom silicon has shown up in a bunch of places, and most of the boards, cards, sticks, and other widgets you see advertising themselves as machine learning accelerators are actually based around it.

We’re going to take a look at Intel’s own product, called the Neural Compute Stick. There has actually been two generations of Intel hardware spun up around two generations of the Movidius chip.

I have both on my desk as, unsurprisingly I’m an early adopter.

The Intel Neural Compute Stick 2. (📷: Alasdair Allan)

Now this is where things start to get a little hairy. Because unfortunately you can’t just use TensorFlow natively with Intel’s hardware. You have to use their OpenVINO framework, and of course that means you can’t just use your TensorFlow model off the shelf.

Fortunately you can convert TensorFlow models to OpenVINO’s IR format, which is good because if you’re trying to compare timings for things you sort of want to keep everything more-or-less the same which means I really need to use the same model here as everywhere else. However this turns out to be a sticking point as the software we need to convert TensorFlow models isn’t included as part of the cut down version of the OpenVINO toolkit installed onto the Raspberry Pi.

Which means we need an actual x86 machine running Ubuntu Linux with OpenVINO installed. Fortunately, we don’t need to have a Neural Compute Stick attached. We just need to have a full OpenVINO installation, and we can do that in the cloud. So the easiest way to do this is to spin up an instance on a cloud provider like Digital Ocean, and then install the OpenVINO toolkit and the run the model optimiser, the piece of software that can convert our TensorFlow model to Intel’s OpenVINO IR format, on the cloud instance.

Unfortunately it turns out converting models from TensorFlow to OpenVINO is a bit of a black art, and the instructions don’t really cover how to convert anything except the most basic of models. It’s not formulaic. The best, and as far as I can see the only, place to get help on this topic is the Computer Vision forum in the Intel Developer Zone. The whole things is intensely frustrating and requires a moderately deep understanding of the details of the model you’re trying to convert.

But once you’ve finally converted your model you can throw your image against it. The code is slightly different, but only really in the details, the essentials are very much the same.

Here we’re getting much better performance, roughly 10 frames per second. So by offloading your inferencing onto Intel’s Movidius chip we’re seeing a ×5 improvement. Although you should bear in mind we’re not being entirely fair to the Neural Compute Stick here, the Raspberry Pi 3 only has USB 2, and the Neural Compute Stick is a USB 3 device. There’s going to be throttling issues, so you’re not seeing the full speed advantage you could be seeing.

The NVIDIA Jetson Nano. (📷: Alasdair Allan)

Next is the NVIDIA Jetson Nano. Built around a 64-bit quad-core Arm Cortex-A57 CPU running at 1.43GHz alongside a NVIDIA Maxwell GPU with 128 CUDA cores. It’s a pretty weighty piece of hardware, and it really needs the ludicrously sized heatsink.

Now in theory we can just throw our TensorFlow model onto the NVIDIA hardware, but it turns out that while it works, everything runs really slowly. Stupidly slowly. Looking at the timings I’m sort of unconvinced whether ‘native’ TensorFlow actually gets offloaded to the GPU at all. If you want things to run quickly you need to optimise your TensorFlow model using NVIDIA’s TensorRT framework, and predictably that’s stupidly hard. Although not actually as opaque as trying to use Intel’s OpenVINO toolkit.

TensorFlow (on the left, dark blue bars) and TensorRT models (on the right, the light blue bars).

However after optimising your models using TensorRT, things go a lot faster, and the Jetson Nano has an inferencing performance around 15 frames per second.

The Coral Dev Board from Google. (📷: Alasdair Allan)

Back to the Coral Dev Board. The board is built around a ARM Quad-core Cortex-A53, with a removable System-on-Module with Google’s Edge TPU. That’s their accelerator hardware that does all the work. The Dev Board is essentially a demonstrator board for the EdgeTPU. But, unlike Intel and the Movidius, it doesn’t look like Google is going to be willing to sell just the silicon. If you want to build a product around the Edge TPU you’ll have to buy it on the SoM, which should be available later this year on its own in quantity.

The Coral USB Accelerator. (📷: Alasdair Allan)

However you can also pick up the EdgeTPU in a Neural Compute Stick Like form factor, although Google’s USB Accelerator stick has a USB-C connector.

Predictably of course you can’t just use your off the shelf TensorFlow model. The Coral hardware is expecting TensorFlow Lite models that have further been been compiled to run on the Edge TPU.

Now this is the first time quantisation has come up. TensorFlow Lite is intended to run specially optimised (quantised) models on mobile and embedded hardware. Quantising of neural networks uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.

Essentially we’re using 8-bits to represent our tensors rather than 32-bit numbers. That makes things easier on low end hardware, but it also makes things a lot easier to optimise in hardware, hence the Edge TPU.

Once you have converted your TensorFlow model to TensorFlow Lite, which is about as painful as you’d expect, and only works for models that have been trained in a ‘quantised aware’ fashion. You have to throw the model at the EdgeTPU compiler. This used to be web-only, but there’s an offline version now as well.

On the plus side, once you have your model in TensorFlow Lite format the code to use the inference engine is incredibly simple. Things also run a lot faster, we’re looking here at between 50 and 60 frames per second.

So were are we? Well, it looks like this

Google’s EdgeTPU beats out all comers, even when I throttle it by connecting the USB Accelerator via USB 2 on the Raspberry Pi rather than using a full USB 3 connection. I’d expect it to perform more-or-less on par with the Dev Board when connected to USB 3.

Unsurprisingly the Jetson Nano is in second place, with both generations of Intel hardware in the back of the pack, while they had first mover advantage than also means the age of the hardware is starting to show.

Inferencing speeds in milli-seconds for MobileNet SSD V1 (orange) and MobileNet SSD V2 (red) across all tested platforms. Low numbers are good!

So, the Edge TPU hardware wins?

No so fast. One big advantage that the Coral hardware has is quantisation, what happens if we use TensorFlow Lite on our other platforms. Well, it doesn’t work at all on the Intel hardware, that’s only supported by OpenVINO.

However while it’s still extremely early days, TensorFlow Lite has recently introduced support for GPU acceleration for inferencing. Running models using TensorFlow Lite with GPU support should reduce the time needed for inferencing on the Jetson Nano. This leaves open the possibility that the gap between the NIVIDIA and Google platforms might shrink in the future. Last I heard, about a week ago, they’re pursing that hard.

But what we can do, is look again at the Raspberry Pi.

Unfortunately the official TensorFlow wheel maintained by Google doesn’t have TensorFlow Lite included, I really don’t know why. But fortunately there is a community maintained wheel which does.

Code using TensorFlow Lite is somewhat different than TensorFlow, and gets a little but further down into the underlying mess than its big brother. But it looks sort of the same.

We see an approximately ×2 increase in inferencing speed between the original TensorFlow figures and the new results using TensorFlow Lite.

Yellow bars on the left are the TensorFlow Lite results and red bars on the right are our original TensorFlow results. There doesn’t seem to be any affect on confidence of object detections, at all.

Which sort of makes you wonder whether there might be something to quantisation. Because it really doesn’t look like you need more accuracy.

Just last month a startup call finally released their AI2GO platform to public beta. They’ve been in closed testing, but I’d been hearing rumours about them for a while. What they’re doing isn’t TensorFlow, not even close. It’s a new generation of binary weight models. There are some technical white papers, and I’m currently wading my way through them.

But just testing things out was easy. You configure a model ‘bundle’ online, and then download and install it as a Python wheel.

Inferencing is this simple this, an image goes in, and a list of detected objects and associated bounding boxes come out.

However feeding the AI2GO our test image containing two recognisable objects, a banana and an apple, does give us somewhat odd bounding boxes compared with the bounding boxes we’re used to from TensorFlow.

Which is sort of different. Not wrong. Not crazy. But definitely different.

But putting that to one side, it’s really rather fast, yet another factor of ×2 faster than TensorFlow Lite, which was ×2 faster than TensorFlow.

Comparing this to our original results, that makes the Raspberry Pi 3, Model B+, competitive with pretty much everything else except the Edge TPU, which of course is also using quantised models.

Which makes you wonder whether we’ve gone ahead and started optimising in hardware just a little too soon. If we can get that much leverage out of software, then perhaps we need to wait till the software in the embedded space has matured enough so that we know what to optimise for? It also makes Microsoft’s decision to stick with FPGA for now, rather than rolling their own custom ASIC like everybody else seems to be, look a lot more sensible.

Just something to ponder there…

The new Raspberry Pi 4, Model B. (📷: Alasdair Allan)

It also makes the arrival of the new Raspberry Pi 4, Model B, which was released just recently, all the more interesting. Because while we can’t run TensorFlow Lite quite yet, we can get both TensorFlow and the AI2GO framework working on the new board.

Inferencing time in milli-seconds for the Raspberry Pi 3 (blue, left) and Raspberry Pi 4 (green, right).

With roughly twice the NEON capacity more than the Raspberry Pi 3, we would expect this order of speedup in performance for well-written NEON kernels. As expected we see an approximate ×2 increase in inferencing speed between the original TensorFlow benchmarks and the new results from the Raspberry Pi 4, along with a similar increase in inferencing speed using the Xnor AI2GO platform.

However we see a much bigger change when looking at the results from the Coral USB Accelerator from Google. The addition of USB 3.0 to the Raspberry Pi 4 means we see an approximate ×3 increase in inferencing speed between our original results and the new results.

Conversely the inference times for the Coral USB Accelerator when it was connected via USB 2, rather than the new USB 3 bus, actually increased by a factor of ×2. This somewhat surprising result is mostly likely due to the architectural changes made to the new Raspberry Pi. With the XHCI host now at the far end of the PCI Express bus, there’s potentially much more latency in the system. Depending on the traffic pattern you could imagine that blocking, as opposed to streaming, use of the channel could well be slower.

The performance increase seen with the new Raspberry Pi 4 makes it a very competitive platform for machine learning inferencing at the edge, and it performs rather well when compared to all that custom silicon.

But of course in parallel to the arrival of accelerator hardware we’ve seen the arrival of machine learning on much, much, lower powered hardware.

Micro-controllers, not micro-processors, the custom silicon I’ve been talking about so far is actually the high end of the embedded hardware stack.

Officially announced at the TensorFlow Dev Summit earlier in the year is TensorFlow Lite for Micro-controllers. This is a distribution of TensorFlow specifically intended for bare metal systems, and the core library fits inside just 16KB. To be absolutely clear. While the accelerator hardware is great fun to play around with, and it sure is fast, I actually sort of think that this is the future of edge computing.

It’s really early days, but I’m starting to think that the biggest growth area in machine learning practice over the next year or two could well be around inferencing, rather than training.

The OpenMV Cam H7 with an IR camera running blob tracking during ARM Dev Day. (📷: Alasdair Allan)

There are a lot of cameras out in the world, its probably the best sensor we have, and adding machine learning makes those sensors better. TensorFlow Lite running on micro-controllers makes that if not trivial, it makes it easily doable inside the power and processing envelope already available in those camera already.

Whether you think that’s a good idea or not is another matter.

The SparkFun Edge is the board that got spun up to act as the demonstrator board for TensorFlow Lite for Micro-controllers. It’s built around the Ambiq Micro’s latest Apollo 3 micro-controller. It’s an ARM Cortex-M4F running at 48MHz with 96MHz burst mode operation, and has built in Bluetooth.

It uses somewhere between 6 and 10 μA per MHz. So around 0.3 to 0.5 mA running flat out and it draws just 1 μA in deep sleep mode with Bluetooth turned off. That’s insanely low power, the Raspberry Pi draws around 400 mA, and for comparison the ESP32 draws between 20 and 120 mA. Probably the closest comparison, the Nordic nRF52840 draws around 17mA. The chip at the heart of this board runs flat out within a power budget less than many micro-controllers draw in deep sleep mode, and it runs TensorFlow Lite.

The TensorFlow Lite for Micro-controllers “Yes/No” demo.

Real-time machine learning on a micro-controller board powered by a single coin cell battery that should last for months, even years. No cloud needed, no network needed, no private personal information leaves the board.

At least on the open market, right now this is machine learning at the absolute limit of what our current hardware is capable of, it doesn’t get any cheaper or less powerful than this, at least until recently.

The SparFun Artemis.

This is the SparkFun Artemis. The same Ambiq Apollo 3 chip, in a 10 × 15 mm module, that should make it through FCC/CE approval sometime next month, and if everything goes well, be available in tape-and-reel quantities soon after that.

It’s fully Arduino compatible as SparkFun have put together their own in-house Arduino core built on top of Ambiq’s Hardware Abstraction Layer. You can now use this insanely low powered chip from the Arduino development environment, and then drop down into the HAL from your Arduino code if you need to get a bit low level.

The “official” Google port of TensorFlow Lite for Micro-controllers.

Of course it was only a matter of time before someone took the TensorFlow demo and ported it, along with TensorFlow Lite for Micro-controllers to the Arduino development environment. Turns out it’s Adafruit that got there first.

Making TensorFlow Lite for Micro-controllers available from within the Arduino environment is a big deal, and like the availability of more pre-trained models, will be a huge change in the accessibility of machine learning in the emerging edge computing market. Arguably perhaps one of the major factors that drove the success of the Espressif ESP8266 was the arrival of Arduino compatibility.

It’s will be fascinating to see if the same will happen with machine learning.

Links to Previous Benchmarks

If you’re interested in details of around the previous benchmarks.

Benchmarking Edge Computing

Comparing Google, Intel, and NVIDIA accelerator hardware

Benchmarking TensorFlow and TensorFlow Lite on the Raspberry Pi

I recently sat down to benchmark the new accelerator hardware that is now appearing on the market intended to speed up…

Benchmarking the Xnor AI2GO Platform on the Raspberry Pi

I recently sat down to benchmark the new accelerator hardware that is now appearing on the market intended to speed up…

Benchmarking Machine Learning on the New Raspberry Pi 4, Model B

How much faster is the new Raspberry Pi? It’s a lot faster.

Links to Getting Started Guides

If you’re interested in getting started with any of the accelerator hardware I used during my benchmarks, I’ve put together getting started guides for the Google, Intel, and NVIDIA hardware I looked at during the analysis.

Hands on with the Coral Dev Board

Getting started with Google’s new Edge TPU hardware

How to use a Raspberry Pi to flash new firmware onto the Coral Dev Board

Getting started with Google’s new Edge TPU hardware

Hands on with the Coral USB Accelerator

Getting started with Google’s new Edge TPU hardware

Getting Started with the Intel Neural Compute Stick 2 and the Raspberry Pi

Getting started with Intel’s Movidius hardware

Getting Started with the NVIDIA Jetson Nano Developer Kit

Getting started with NVIDIA’s GPU-based hardware


Leave a Comment