Object detection and tracking in Python

[This article was first published on poissonisfish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

How are common objects identified and tracked in real-world applications? [source]
Over six months ago I decided to embark on a learning journey of image analysis using Python. After carefully reviewing various options I took a two-course offer from OpenCV.org for about US$479, chiefly because of i) the pivotal role of OpenCV as an open-source toolkit in computer vision, ii) the relevance of the course modules, and iii) the vast experience of the instructor, Satya Mallick. In the end this choice paid off every cent.
One of the topics that most fascinated me in the course of this six-month journey was object detection and tracking on video. Such was the experience that after having had written about image, text and audio data it seemed logical to work on the video analysis debut.

In this tutorial we will use OpenCV to combine a YOLOv3 detector with a tracking system to identify and track among 80 object classes on video. To follow along this tutorial you will need a video recording of your own. Code and further instructions are available in a dedicated repository. Lights, camera, action 🎬

Note: Apparently some browsers display the code without indentation. For better readability I recommend using Chrome or Firefox.


Computer vision is practically everywhere – summoned whenever you unlock your phone, check-in at the airport or drive an autonomous vehicle. In industry, it is revolutionising fields ranging from precision agriculture to AI-assisted medical imaging. Many such applications are based on object detection, one of the key topics of this tutorial and to which we will turn our attention next.

Object detection

We have seen how convolutional neural networks (CNNs) can be used for image classification. In this setting, the CNN classifier returns a fixed number of class probabilities per input image. Object detection, on the other hand, attempts to identify and locate any number of class instances by extending CNN classification to a variable number of region proposals, such as those captured by bounding boxes.

Unlike image classification where a single prediction is made (CAT), object detection assesses and predicts from large number of region proposals (CAT, DOG, DUCK) [source]

Object detectors form two major groups – one-stage and two-stage detectors. One-stage detectors, such as You Only Look Once (YOLO)1 are based on a single CNN, whereas two-stage detectors such as Faster R-CNN2 decouple region proposal and object detection into two separate CNN modules. One-stage detectors are generally faster though less accurate than their two-stage counterparts. Let us now briefly introduce YOLO.

The YOLO detector was first developed in 2015 using the Darknet framework, and since then various updates came out. As illustrated below, YOLO leverages the CNN receptive field to divide the image into a S x S grid. For each cell in the grid,

  • it estimates the centre (x, y), size (w, h) and objectness score for each of B bounding boxes per cell (bounding boxes + confidence)
  • it emits the probabilities of all C object classes (class probability map)

For a given input image, this large search space yields a three-dimensional tensor of size S x S x (B x 5 + C). Arriving at the final detections requires the filtering of high-confidence predictions, followed by non-maximum suppression (NMS) to keep those that meet a certain maximum overlap threshold.

The YOLO detector takes advantage of receptive fields to simultaneously identify and locate objects [source]

In this tutorial we will use YOLOv33, the 2018 model update with the architecture represented below, inspired by feature pyramid networks. This particular version extends object detection to three different scales – owing to the introduction of residual blocks â€“ each of which responsible for predicting three bounding boxes per cell. The model takes RGB images with 416 x 416 resolution as input and returns three tensors of size S x S x (15 + C), one per detection scale, where S is one of 52, 26 or 13. Furthermore, the model is trained to minimise the error between the bounding box coordinates (regression), class probabilities (multi-label classification) and objectness scores (logistic regression) of observed and predicted boxes.

Schematic representation of the YOLOv3 architecture. Adapted from [source]

The YOLOv3 detector was originally trained with the Common Objects in Context (COCO) dataset, a large object detection, segmentation and captioning compendium released by Microsoft in 20144. The dataset features a total of 80 object classes YOLOv3 learned to identify and locate. To give a perspective of their diversity, here is a graphical representation of a random sample.



The resulting detector enjoyed so much success that following its release, it became widely used for inference based on the COCO classes and transfer learning to solve different detection problems.

At a processing rate of ~35 FPS, one of the tasks this detector succeeds the most is object detection on video. However, detection in successive frames is computationally intensive and oblivious to transitions between successive predictions, and may furthermore fail due to problems of occlusion or change in appearance. In this context, devising a framework that alternates between object detection and tracking can alleviate these issues.

Object tracking

Example of pedestrian tracking from CCTV footage [source]

Following object detection, various methods, including MIL, KCF, CSRT, GOTURN and Median Flow can be used to carry out object tracking. For tracking of multiple objects using any such method, OpenCV supplies multi-tracker objects to carry out frame-to-frame tracking of a set of bounding boxes until further action or failure.

For the purpose of this tutorial we will use Median Flow, a simple, fast and scalable tracking method that works best provided there is little to no occlusion5. Under the hood, Median Flow initialises points inside a bounding box, tracks the points using the Lucas-Kanade algorithm, estimates the forward-backward tracking error, discards 50% of the outliers and updates the bounding box coordinates using the median vector of the consistent trajectories. The process is then repeated over a sequence of frames. Here is an insightful, interactive visualisation of Median Flow in action.

Schematic representation of the Median Flow algorithm [source]

Having introduced this much, you should now be able to follow along the different steps we will take next. I nonetheless highly encourage reading more about YOLO and Median Flow. If you are ready, have a coffee and get ready to code ☕


Workspace setup

Prior to Python coding we need to set up a few things. After creating a MOV video recording, for example using an iPhone, move it to your working directory. With formats other than MOV you will need to make the necessary changes to the code below. Then, simply run a full workspace setup with the terminal command ./init.sh <PATH_TO_MOV>.  Let us have a closer look into what this Bash script does.

First, it creates the subdirectories yolov3/, input/ and output/ which will contain the YOLOv3 dependencies, the input video and the output video, respectively.

Second, it converts your MOV video file to MP4 using FFmpeg. This conversion will generate input/input.mp4 using the following options:

  • -vcodec h264,  to select a MPEG-4 codec used for MP4 conversion
  • -vf scale=720:-2,setsar=1:1, to resize the output video to 720p and preserve both display and sample aspect ratios
  • -an, to discard the audio channel, since we do not need it

Third, it downloads three small text files that together provide all 80 COCO class labels, the network configuration and the network weights from training with the COCO dataset – these are, in respective order, coco.names, yolov3.cfg and yolov3.weights.

Model loading and configuration

Switching to Python, we import the few modules needed and set three inference parameters in advance – the thresholds for objectness score, object class probabilities and NMS overlap. It is also advisable to seed the analysis, if for example you set to compare different configurations.

#%% Imports and constants
import cv2, os
import numpy as np
import matplotlib.pyplot as plt
# Define objectness, prob and NMS thresholds
# Set random seed

Next, we load the 80 COCO class labels and assign them each a random colour. This will enable the identification of the corresponding class probabilities in the YOLOv3 output tensors, and facilitate the distinction among different object types in the output video. Then, we load YOLOv3 by passing the configuration and weight files to cv2.dnn.readNetFromDarknet(), and extract the output layer names to more easily access predictions during inference.

#%% Load YOLOv3 COCO weights, configs and class IDs
# Import class names
with open(‘yolov3/coco.names’, ‘rt’) as f:
classes = f.read().rstrip(\n).split(\n)
colors = np.random.randint(0, 255, (len(classes), 3))
# Give the configuration and weight files for the model and load the network using them
cfg = ‘yolov3/yolov3.cfg’
weights = ‘yolov3/yolov3.weights’
# Load model
model = cv2.dnn.readNetFromDarknet(cfg, weights)
# Extract names from output layers
layersNames = model.getLayerNames()
outputNames = [layersNames[i[0] – 1] for i in model.getUnconnectedOutLayers()]

To make the process more structured we will also define and incorporate the function where_is_it(). It takes a video frame along the corresponding YOLO output, and returns OpenCV-format bounding box coordinates, class probabilities and labels of all predictions that meet our objectness score and probability criteria – let us call these predictions high-confidence boxes. More concretely, for each of the three detection scales the function identifies high-confidence boxes and, for each of these boxes, scales the predicted coordinates to the image width and height, computes the box top-left corner position and determines the maximum class probability and corresponding index in the output tensor.

#%% Define function to extract object coordinates if successful in detection
def where_is_it(frame, outputs):
frame_h = frame.shape[0]
frame_w = frame.shape[1]
bboxes, probs, class_ids = [], [], []
for preds in outputs: # different detection scales
hits = np.any(preds[:, 5:] > P_THRESH, axis=1) & (preds[:, 4] > OBJ_THRESH)
# Save prob and bbox coordinates if both objectness and probability pass respective thresholds
for i in np.where(hits)[0]:
pred = preds[i, :]
center_x = int(pred[0] * frame_w)
center_y = int(pred[1] * frame_h)
width = int(pred[2] * frame_w)
height = int(pred[3] * frame_h)
left = int(center_x – width / 2)
top = int(center_y – height / 2)
# Append all info
bboxes.append([left, top, width, height])
return bboxes, probs, class_ids

Note that the three YOLO output tensors passed under outputs are in fact two-dimensional, and not three-dimensional as we had discussed. This is because in practice, the model predictions are unfolded with respect to both bounding boxes and grid cells, yielding three tables of size (S x S x 3) x (5 + C), or more specifically 507 x 85, 2028 x 85 and 8112 x 85. Hence, the first five columns in each of these tables carry the predicted coordinates and objectness score of individual boxes, while the remaining 80 provide the corresponding probabilities of all COCO classes.

Video processing

At last we have all pieces in place to begin the processing of the MP4 input video. Here is mine for reference, showing my living room and featuring a famous cat ðŸ”‡

Ahead of processing, we must set up both video capture and writing – the latter conforming to the same FPS rate, width and height of the former. Together, these two OpenCV utilities enable looping over one frame at a time, running detection or tracking accordingly and storing it with the overlaid results in output/output.mp4.

#%% Load video capture and init VideoWriter
vid = cv2.VideoCapture(‘input/input.mp4’)
vid_w, vid_h = int(vid.get(3)), int(vid.get(4))
out = cv2.VideoWriter(‘output/output.mp4’, cv2.VideoWriter_fourcc(*‘mp4v’),
vid.get(cv2.CAP_PROP_FPS), (vid_w, vid_h))
# Check if capture started successfully
assert vid.isOpened()

Now, assuming you have a basic knowledge of Python, I will summarise how processing unfolds.

We will perform detection every 60 frames and object tracking in between. If no high-confidence boxes are predicted we repeat detection in the next frame; likewise, if tracking fails we switch back to detection. The processing of the input video will be monitored in real-time using a cv2.namedWindow() instance. As long as the video capture is open and feeding frames, we check whether detection or tracking should take place and proceed accordingly:

  • For detection, we first pass the current frame to the loaded YOLOv3 model after appropriate preprocessing. Preprocessing comprises scaling pixel intensities to the 0-1 range, resizing the input frame to 416 x 416 and reordering the BGR channels to RGB. Next, a forward pass with the preprocessed frame outputs the model predictions, from which we filter high-confidence boxes using the custom function where_is_it(). Lastly, any filtered boxes are subjected to NMS and the resulting final detections, along with the current frame, are used to create a multi-tracker object. If otherwise no boxes are returned, a red message indicating detection failed is printed on the top-right corner of the frame.
  • For tracking, we pass the current frame to the existing multi-tracker object. If tracking is successful, we extract the new box coordinates with which to draw rectangles around the previously detected objects and print the corresponding class labels, on the current frame. 

As a result, tracked objects will be highlighted in successive frames, and these in turn will be added to the output video file. Once the capture is exhausted, we release the output writer. Note that cv2.waitKey() allows for breaking the loop by pressing ESC – this can be helpful for debugging.

#%% Initiate processing
# Init count
count = 0
# Create new window
# Perform detection every 60 frames
perform_detection = count % 60 == 0
ok, frame = vid.read()
if ok:
if perform_detection: # perform detection
blob = cv2.dnn.blobFromImage(frame, 1 / 255, (416, 416), [0,0,0], 1, crop=False)
# Pass blob to model
# Execute forward pass
outputs = model.forward(outputNames)
bboxes, probs, class_ids = where_is_it(frame, outputs)
if len(bboxes) > 0:
# Init multitracker
mtracker = cv2.MultiTracker_create()
# Apply non-max suppression and pass boxes to the multitracker
idxs = cv2.dnn.NMSBoxes(bboxes, probs, P_THRESH, NMS_THRESH)
for i in idxs:
bbox = [int(v) for v in bboxes[i[0]]]
x, y, w, h = bbox
# Use median flow
mtracker.add(cv2.TrackerMedianFlow_create(), frame, (x, y, w, h))
# Increase counter
count += 1
else: # declare failure
cv2.putText(frame, ‘Detection failed’, (20, 80),
cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0,0,255), 2)
else: # perform tracking
is_tracking, bboxes = mtracker.update(frame)
if is_tracking:
for i, bbox in enumerate(bboxes):
x, y, w, h = [int(val) for val in bbox]
class_id = classes[class_ids[idxs[i][0]]]
col = [int(c) for c in colors[class_ids[idxs[i][0]], :]]
# Mark tracking frame with corresponding color, write class name on top
cv2.rectangle(frame, (x, y), (x+w, y+h), col, 2)
cv2.putText(frame, class_id, (x, y – 15),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, col, 2)
# Increase counter
count += 1
# If tracking fails, reset count to trigger detection
count = 0
# Display the resulting frame
cv2.imshow(‘stream’, frame)
# Press ESC to exit
if cv2.waitKey(25) & 0xFF == 27:
# Break if capture read does not work
print(‘Exhausted video capture.’)

Here is my output video ðŸ”‡

As you can see, this detection-tracking framework can simultaneously identify, delineate and smoothly track various objects on video. In my living room example we can identify books, bottles, potted plants, chairs, a dining table, a sofa, a cat, a wine glass and various cars parked outside. On the other hand we have a caravan and a glass erroneously predicted as bus and cup, respectively – neither actually too far off. We can also easily note that detection takes place immediately before a new set of objects undergoes tracking and becomes highlighted.

In this tutorial we built an OpenCV-based framework with which to identify and track objects on video. We have seen how most highlighted objects – however few in number – were accurately identified and tracked over successive frames. Here are some suggestions to improve on this prototype:

  • Tweak the objectness score, class probability and NMS thresholds. There is nothing special about the thresholds set at the beginning of this exercise. Experiment with these and work out your precision-recall â€˜sweet spot’ for the first two – raising either threshold will lead to lower recall and higher precision, and vice versa. The rate of detection per frame too is just as easily adjustable 🎛
  • Do not take Median Flow for a walk. Object tracking with Median Flow worked well because there was no movement in my living room; testing this framework in a more lively scene would almost certainly fail. In that case, alternative methods that cope better with occlusion such as KCF and CSRT will render tracking more stable at expense of more computation ⚠
  • Go find the state-of-the-art. Believe it or not, the techniques presented here aged quite rapidly. This is particularly obvious for YOLOv3, as object detection has been advancing fast in recent years; in fact, the performance of convolutional methods is now rivalled by vision transformers, which are inherently capable of multimodal self-supervised learning 🤯
  • Try out processing from a live stream. With minor changes to the code above you can perform live detection and tracking, for example using a webcam. Live processing might motivate you to create a lighter framework to boost the FPS rate  ðŸŽ¥
  • Keep a modest video resolution. Keeping a larger input video resolution could be thought to improve predictions, alas it does not. Frames are invariably resized to 416 x 416 before inference, thus preserving a higher resolution will at best render a higher resolution output video. Differences in the contribution of bicubic (FFmpeg) and bilinear (OpenCV) interpolations to downscaling should be unimportant 📺
  • Up your game with transfer learning. For a more advanced object detection tutorial, transfer learning with YOLO might be the perfect fit. This requires a convolutional backbone, a list of custom class names, a modified network configuration template and ground-truth boxes and labels. With a line of code you can train the model to solve your own detection problem. If you are interested, check out the two Colab notebooks I wrote outlining the fine-tuning of YOLOv3 and YOLOv4, to identify bare and mask-wearing faces 😷

I hope you had fun implementing object detection and tracking to explore our surroundings. Please leave your comments below, I always appreciate some feedback. Until next time!


Categories R Tags ExcerptFavorite

Leave a Comment