SORT: Simple Online and Realtime Tracking

In this article, we will discuss SORT, Simple Online and Realtime Tracking, which was published in 2016 and has influenced the current multiple object tracking (MOT).

The SORT paper is available on ArXiv, and code is available on GitHub.

SORT is an implementation of a tracking-by-deetection framework that performs tracking using the bounding boxes (bboxes) of multiple objects detected from each frame of a image sequence. SORT is, as its name suggests, simple and fast.

In the background at that time, methods such as Multiple Hypothesis Tracking (MHT) and Joint Probabilistic Data Association (JPDA) that elaborated on data association dominated the MOT benchmark. MHT has high performance but has real-time problems, and JPDA has real-time properties but has problems with performance.

Tracker

Here, we will look at Tracker, which is the main body of SORT.

SORT emphasizes real-time performance and combines a Kalman filter and the Hungarian algorithm as components for tracking.

The tracking process consists of three steps:

In the first step prediction, the Kalman filter predicts the bbox for each existing track.
In the second step, data association, a cost matrix using intersection-over-union (IoU) is created between the bboxes predicted by the Kalman filter and the detected bboxes that are observations. From the cost matrix, the Hungarian algorithm is used to assign tracks and observations.
In the third step update, the Kalman filter updates each track with its assigned track-observation pair. Also, new tracks are created with observations that were not assigned to existing tracks. Additionally, tracks that have not been updated a given number of times are deleted.

SORT, which emphasizes real-time performance, is about eight times faster than JPDA, making it easy to ensure real-time performance without using a high-performance processor.

Detector

Here, we will look at the detector that generates the data to input to SORT. If we keep the components for tracking simple, we are worried about performance degradation.

In SORT, the inputs are the bboxes of objects detected by a convolutional neural network (CNN) based object detection model. According to the paper, Faster Region CNN (Faster R-CNN) is used for the object detection framework.

From the performance comparison results of three types of detectors, Aggregate Channel Filter (ACF), Faster R-CNN (ZF) using ZF for CNN, and Faster R-CNN (VGG16) using VGG16 for CNN, the detection quality of the detector has been found to have a significant effect on tracking performance. In the performance comparison between ACF and Faster R-CNN (VGG16), MOTA, a metric of tracking performance, is 15.1% and 34.0%, respectively. This corresponds to the up to 18.9% improvement in tracking performance due to detector changes described in the abstract of the paper.

By using the data detected by the CNN-based object detection model as input, it is considered that the impact of achieving the same performance as MHT, although it is a simple tracker, is significant.

Note: As an implementation, the detected bboxes file is created in advance, and it does not ensure real-time performance, including the detector.

Issues of SORT

Here, we will look at the issues of SORT.

Comparing the tracking performance metrics in the SORT paper with the other methods, we find that the ID switch, where the ID is different from the previous track, is no better than the state-of-the-art method.

The paper describes not using object appearance features and ignoring short-term and long-term occlusion issues. In addition, it is designed to delete tracks at a fast timing for the following two reasons.

Constant velocity model is a poor predictor of the true dynamics.
Re-identification of objects is beyond the scope of this work.

Specifically, tracks that have not been updated for two consecutive frames are deleted. As a result, after the bbox of the object corresponding to the existing track is not detected for 2 consecutive frames due to occlusion, etc., a track with a new ID is created for the bbox that is detected again. This results in an ID switch.

We know that SORT does not use the appearance features of objects, so it is difficult to deal with long-term occlusion, but we wonder why it is not designed to handle short-term occlusion.

We speculate that the reason for this is that SORT’s simple algorithm has only one condition for deleting a track, and as a result it cannot handle short-term occlusion in order to deal with false positives from the detector.

Summary

In this article, we reviewed SORT, which was published in 2016 and has influenced the current MOT.

We believe that SORT will make the following three major contributions in the MOT field.

Detection quality is identified as a key factor of tracking performance, and the effectiveness of CNN-based object detection models is demonstrated in the MOT field.
A pragmatic tracking approach based on Kalman filter and Hungarian algorithm is proposed.
The code is open sourced.