DeepSORT: SORT with a Deep Association Metric

In this article, we will discuss DeepSORT, which was published in 2017 and has influenced current multiple object tracking.

The DeepSORT paper Simple Online and Realtime Tracking with a Deep Association Metric is available on ArXiv and the implementation deep_sort is available on GitHub.

Overview

Simple Online and Realtime Tracking (SORT), introduced in the related article, is a multiple object tracking method that emphasizes real-time performance, published in 2016. However, in order to keep its algorithm simple, SORT does not use object appearance features and is designed not to address short-term and long-term occlusion issues. For this reason, the SORT issue is that when occlusion occurs, it is prone to ID switches where the ID of a track changes.

A method published in 2017 to improve the above SORT issues is Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT). DeepSORT creates a cost matrix from the appearance features output by the deep learning model, and performs data association using the cost matrix. As a result, DeepSORT has succeeded in significantly reducing ID switches.

DeepSORT

DeepSORT uses a deep learning (DL) model to output appearance features for bounding box regions detected by an object detection model. Appearance features from the DL model are converted into a cost matrix whose components are cosine distances to the appearance features stored in the track, and used for data association. Data association employs an algorithm called matching cascade. In the matching cascade, matching with detections is done in order from the most recently updated tracks. Tracks with corresponding detections by data association store the corresponding detection’s appearance features and state estimates updated by the Kalman filter.

Core Components of DeepSORT

Here, we will look at the core components of DeepSORT.

The paper shows the following four core components.

Track Handling and State Estimation
Assignment Problem
Matching Cascade
Deep Appearance Descriptor

Let’s look at the details in the order above, including the differences from SORT.

Track Handling and State Estimation

Track handling is different between SORT and DeepSORT. There is only one condition for deleting tracks in SORT, but there are two in DeepSORT.

In SORT, tracks with no assignments for 2 consecutive frames are deleted.

In DeepSORT, tracks are treated as tentative for the first 3 frames, and tracks that are not successfully assigned during this time are deleted. This condition has been added to remove tracks generated by single false positives. For tracks not deleted above, the track will not be deleted until the number of frames since the last assignment exceeds the set number of frames A_max. Note that A_max is set to 30 frames in the experiments in the paper. This makes it less likely that an ID switch will occur if the occlusion effect is within 30 frames since the corresponding track has not been deleted.

In DeepSORT, part of the state vector is changed from SORT, but there is no change in the framework for state estimation using the Kalman filter.

Assignment Problem

The assignment problem is a big part of the difference between SORT and DeepSORT. First, we’ll look at how to construct the cost matrix.

SORT uses intersection-over-union (IoU) to create a cost matrix.

DeepSORT can create a cost matrix by combining the cosine distance of appearance feature vectors and the Mahalanobis distance often used in radar. However, in the experiment, the weights of the cosine distance and the Mahalanobis distance are set to 1 and 0, respectively, which means that only the cosine distance is used to create the cost matrix.

Next, we will look at the details of DeepSORT’s cosine distance.

An appearance feature vector corresponding to each detected bounding box is generated using the deep appearance descriptor described later. Let r_j denote the appearance feature vector corresponding to the j-th detection. The track stores up to 100 newest appearance feature vectors corresponding to the assigned detections. Let us denote the k-th appearance feature vector of the i-th track as r_k⁽ⁱ⁾.

The cosine distance d(i, j) between the ith track and the jth detection is defined as
d(i, j) = min{1 – r_j^Tr_k⁽ⁱ⁾ | r_k⁽ⁱ⁾ ∈ {r_k⁽ⁱ⁾}_k=1^L_k}.
The cosine distance d(i, j) is the smallest cosine distance among the appearance feature vectors stored by the i-th track.

Also, DeepSORT has an additional gate matrix that SORT did not have.

Gating is also a popular method in radar, similar to Mahalanobis distance. Basically, a threshold is used to determine if the value is in or out of the gate. If it is judged to be out of the gate, it will be excluded from the assignment target. In the cost matrix, the Mahalanobis distance weights are set to 0, but in the gate matrix, both cosine and Mahalanobis distance results are used.

Matching Cascade

As mentioned above, DeepSORT does not delete confirmed tracks that survive the first 3 frames if the number of frames since the last assignment is less than or equal to the maximum number of frames A_max. This can result in mixed tracks with different number of frames from the last assignment.

The matching cascade addresses the above situation by making assignments to tracks with the lowest number of frames since the last assignment.

Deep Appearance Descriptor

DeepSORT uses an appearance feature vector corresponding to each detected bounding box, as described in assignment problem.

The deep appearance descriptor generates the above appearance feature vector. As you can imagine from the name of deep, convolutional neural network (CNN) is used.

Summary

In this article, we reviewed DeepSORT, which was published in 2017 and has influencing current multiple object tracking.

DeepSORT improves SORT performance by integrating appearance information. DeepSORT uses a deep learning model to output appearance features for bounding box regions detected by an object detection model. The appearance features from the deep learning model are converted to a cost matrix, and the cost matrix is used to perform data association, detection and track association. As a result, the number of ID switches due to occlusion, which is an issue of SORT, is effectively reduced.