may

Object as Point

1. Motivation:

  • Traditional detector need post-processing (NMS), hard to differentiate and train (not end-to-end).
  • One-stage detector: sliding anchors over imgs and classify them
  • Two-stage detector: recompute imgs features for potential boxs and classify them.

2. Methods:

2.1 Highlight:

  • CenterNet:
    • represent object by a single point of bbox
    • regress object size etc.
  • Inference:
    • single network forward-pass without NMS

2.2 Implement

  • Points of bbox:

    • generate heat map with a FCN (keypoint prediction network)

    • extract local peaks in the key point

      • background: CornerNet etc. use keypoint estimation to detect corner

  • Size (via regression):

    • image features at each peak predict the objects Bboxs

The network predicts a total of C + 4 outputs (C classes, 2 para for local offsets, 2 para for size) at each location. All output share a common FC backbone network. For each modality, the features of the backbone are then passed through a separate 3 3 convolution, ReLu and another 1 1 convolutions.

output stride R = 4, i.e. the resolution will reduce to be 4 times smaller

2.3 Loss

A weighted sum of 3 loss

  • $L_k$: penalty-reduced pixel wise logistic regression with focall loss
  • $L_{off}$: additionally predict a local offset,
    • to recover the discretization error cases by the output stride
    • each center point has 2 offset
  • $L_{size}$: regress object size for each object