首页 > itarticle > unsupervised learning of visual representations using videos

unsupervised learning of visual representations using videos

admin 11月 14, 2020 0

Triggering Idea

Our key idea is that visual tracking provides the supervision
We argue that static images themselves might not have enough information to learn a good visual representation.
We design a Siamese-triplet network with ranking loss function to train the CNN representation. This ranking loss function enforces that in the ﬁnal deep feature space the ﬁrst frame patch should be much closer to the tracked patch than any other randomly sampled patch.
Most of the work in this area can be broadly divided into three categories.

The ﬁrst class of algorithms focus on learning generative models with strong priors
The second class of algorithms use manually deﬁned features such as SIFT or HOG and perform clustering over training data to discover semantic classes
The third class of algorithms and more related to our paper is unsupervised learning of visual representations from the pixels themselves using deep learning approaches

since we do not have labels, it is not clear what should be the loss function and how we should optimize it.
在reid中损失函数不一定与labels有关。有监督只是为了让Loss function更好。

Since we are tracking these patches, we know that the ﬁrst and last tracked frames correspond to the same instance of the moving object or object part. Therefore, any visual representation that we learn should keep these two data points close in the feature space.