unsupervised learning of visual representations using videos

Triggering Idea

  • Our key idea is that visual tracking provides the supervision

  • We argue that static images themselves might not have enough information to learn a good visual representation.

  • We design a Siamese-triplet network with ranking loss function to train the CNN representation. This ranking loss function enforces that in the final deep feature space the first frame patch should be much closer to the tracked patch than any other randomly sampled patch.

  • Most of the work in this area can be broadly divided into three categories.

  1. The first class of algorithms focus on learning generative models with strong priors
  2. The second class of algorithms use manually defined features such as SIFT or HOG and perform clustering over training data to discover semantic classes
  3. The third class of algorithms and more related to our paper is unsupervised learning of visual representations from the pixels themselves using deep learning approaches
  • since we do not have labels, it is not clear what should be the loss function and how we should optimize it.
    在reid中损失函数不一定与labels有关。有监督只是为了让Loss function更好。
  • Since we are tracking these patches, we know that the first and last tracked frames correspond to the same instance of the moving object or object part. Therefore, any visual representation that we learn should keep these two data points close in the feature space.