What is it about?
Many participants were asked to group photos that they felt were similar to each other, and the similarity between photos was measured by the proportion of two photos in the same group. This similarity was called visual similarity. Similarity between photos by computer vision is calculated by cosine similarity between image features generated by image encoder, or Hellinger distance between class classification probability distributions in 1000 class classification. We developed a method for estimating the similarity between photos whose visual similarity has not been measured by combining the measured visual similarity and the similarity obtained by computer vision. The photo sets whose visual similarity had already been measured were randomly divided into two groups, A and B, and the visual similarity of the photo group A was used to predict the visual similarity of the photo group B. The goodness of the proposed method was evaluated by prediction accuracy. Prediction accuracy was measured by how well the 3D MDS coordinates of photo group B in the original photo set were restored by the 3D MDS coordinates for the predicted similarity. We compared prediction accuracy for three photo sets with different photo content. Vision Transformer pre-trained on ImageNet-21K and transfer-learned on ImageNet-1K had high prediction accuracy for image feature values. The prediction accuracy of the image encoder of CLIP was also relatively high. Combining multiple computer vision models improved the prediction accuracy. The prediction accuracy was not so high in the garden landscape photo set, which contained only garden photos. For the student life photo set, which includes various photos such as food, people, buildings, and landscapes, the prediction accuracy was very high, and the MDS coordinates could be restored almost perfectly. The prediction accuracy was high for the townscape photo set consisting of buildings and landscapes.
Featured Image
Why is it important?
We clarified which combination of various pre-trained computer vision models is best for obtaining similarity between photographs that is close to human perception. We found that for a photo set containing a variety of photos, the best computer vision model closely approximates human similarity perception.
Perspectives
Showed one way to bring computer vision closer to human senses.
Hiroshi Omori
Graduate school of Agricultural and Life Sciences, The University of Tokyo
Read the Original
This page is a summary of: Predict Inter-photo Visual Similarity via Pre-trained Computer Vision Models, December 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3579654.3579769.
You can read the full text:
Contributors
The following have contributed to this page







