reading note: detect to track and track to detect

TITLE: Detect to Track and Track to Detect

AUTHOR: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

ASSOCIATION: Graz University of Technology, University of Oxford

FROM: arXiv:1710.03958

CONTRIBUTION

  1. A ConvNet architecture is set up for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression.
  2. Correlation features that represent object co-occurrences across time are introduced to aid the ConvNet during tracking.
  3. Frame-level detections are linked to produce high accuracy detections at the video-level based on across-frame tracklets.

METHOD

For frame-level detections, this work adopts R-FCN as the base framework to detect objects in a single frame. The inter-frame correlation features are extracted from the feature maps of the two frames. A multi-task loss of localization, classification and displacement is used to train the net work. The workflow of this work is shown in the following figure.

Framework

The key innovation of this work is an operation denoted as ROI tracking. The input of this operation is the bounding box regression features of the two frames , and the correlation features , which are concatenated. The correlation layer performs point-wise feature comparison of two feature maps ,

where $-d leq p leq d$ and $-d leq q leq d$ are offsets to compare features in a square neighbourhood around the locations $i$, $j$ in the feature map, defined by the maximum displacement $d$.

The loss function is written as

A class-wise linking score is defined to combine detections and tracks across time

where the pairwise term $phi$ evaluates to 1 if the IoU overlap a track correspondences $T^{t,t+tau}$ with the detection boxes $D_{i}^{t}$, $D_{i}^{t+tau}$ is larger than 0.5. $p_{i,c}^{t}$, $p_{j,c}^{t+tau}$ is the softmax probability for class $c$. The optimal path across a video can be found by maximizing the scores over the duration $T$ of the video. Once the optimal tube is found, the detections corresponding to that tube are removed. Then reweight the detection scores in the tube by adding the mean of the 50% highest scores in that tube. And the procedure is applied again to the remaining detections.