structured knowledge distillation for semantic segmentation

CVPR 2019

arXiv https://arxiv.org/abs/1903.04197

Motivation

The need of neural netoworks with small model size, light computation cost and high segmentation accuracy for applications on mobile devices.
Deep neural networks have achieved significant improvement in segmentation accuracy.
Knowledge distillation has been verified valid in classification tasks.

Contributions

Knowledge distillation for accurate compact semantic segmentation network training.
Two structured knowledge distillation strategies for semantic segmentation, pair-wise distillation and holistic distillation.
Experimental validation on three datasets with different netowrk configurations.

Methods

T for teacher network (PSPNet with ResNet101), S for Student network (ResNet18, MobileNetV2Plus, ESPNet-C, and ESPNet).

Overall architecture
Consists of three parts: (a) Pair-wise distillation; (b) Pixel-wise distillation; (c) Holistic distillation.
Structured knowledge distillation
- Pixel-wise distillation
  Treat segmentation task as a collection of separate pixel classification problem. Direct align the class probability of each pixel produced by S with that produced by T.
  Pixel-wise distillation loss:
  
  q_i^s is the class probabilities from S. q_i^t is the class probabilities from T. KL() is the Kullback-Leibler divergence. R refers to all the pixels.
- Pair-wise distillation
  Pair-wise distillation is inspired by the pair-wise Markov random field framework. Instead of the pixel probabilities, pair-wise similarities among pixels are transfered.
  Similarity between two pixels is defined as:
  
  a_ij^s refers to the similarity between the i^th pixel and j^th pixel produced by S. a_ij^t refers to the similarity between the i^th pixel and j^th pixel produced by T.
  In implementation, the similarity between two pixels is simplified as:
  
  f_i and f_j are two feature maps.
- Holistic distillation
  Conditional generative adversarial learning is employed. S is treated as the generator conditioned on the input image I. The segmentation map Q^s is a fake sample. Q^t is regarded as a real sample. Q^s needs to be as similar as possible to Q^t. Wasserstein distance is used to evaluate the difference between Q^s and Q^t.
  
  E[] is the expectation operator. D() is an embedding network including five convolutions with two self-attention modules inserted between the final three layers. Q^s and I are concatenated and input into D().
Optimization and training
Overall loss function:

Optimization in two steps:
- Train the discriminator by minimizing l_ho(S,D)
- Train the segmentation network S by minimizing
  
  ()

Results

Effectiveness of the three distillation strategies
Cityscapes
CamVid
ADE20K

Conclusion

A new strategy to incorporate GAN into segmentation.Directly enforce the alignment between the segmentation map and the ground truth may limite the success of the discriminator of GAN as there is mismatch between the generator’s continuous output and the discrete true labels. Here, the segmentation map is compared with the continuous output of the teacher network.
The pixel-wise and holistic distillations were applied on the final score maps and the pair-wise distillation was applied on the feature maps of the last layer. For the comparison, it seems that attention transfer was only applied on the score maps. The full capacity of attention transfer may not be utilized.

structured knowledge distillation for semantic segmentation

Motivation

Contributions

Methods

Results

Conclusion

近期文章

近期评论

标签

热门

文章归档

分类目录

功能