Learning Dense Correspondences between Photos and Sketches

Abstract

Humans effortlessly grasp the connection between sketches and real-world objects, even when these sketches are far from realistic. Moreover, human sketch understanding goes beyond categorization - critically, it also entails understanding how individual elements within a sketch correspond to parts of the physical world it represents. What are the computational ingredients needed to support this ability?

Towards answering this question, we make two contributions: first, we introduce a new sketch-photo correspondence benchmark, PSC6K, containing 150K annotations of 6250 sketch-photo pairs across 125 object categories, augmenting the existing Sketchy dataset with fine-grained correspondence metadata. Second, we propose a self-supervised method for learning dense correspondences between sketch-photo pairs, building upon recent advances in correspondence learning for pairs of photos. Our model uses a spatial transformer network to estimate the warp flow between latent representations of a sketch and photo extracted by a contrastive learning-based ConvNet backbone.

We found that this approach outperformed several strong baselines and produced predictions that were quantitatively consistent with other warp-based methods. However, our benchmark also revealed systematic differences between predictions of the suite of models we tested and those of humans. Taken together, our work suggests a promising path towards developing artificial systems that achieve more human-like understanding of visual images at different levels of abstraction.

Creating the Photo-Sketch Correspondence Benchmark (PSC6K)

We develop a novel sketch-photo correspondence benchmark, PSC6K, augmenting the Sketchy dataset with fine-grained keypoint annotations. We recruited 1,384 participants to provide annotations with crowdsourcing. Our benchmark contains 150,000 annotations across 6,250 photo-sketch pairs from 125 object categories. Compared to existing datasets in sketch understanding that focus on category or instance-level information, our benchmark establishes detailed mapping between parts of a sketch with parts of the object it represents, allowing analysis in fine-grained multi-modal image understanding. We show examples from our benchmark below.

Learning Dense Photo-Sketch Correspondences

Our framework learns the correspondences between photos and sketches by estimating a dense displacement field that warps one image to the other. This approach embodies the hypothesis that sketches preserve key information about spatial relations between an object's parts, despite distortions in their size and shape. The framework consists of a multi-modal feature encoder that aligns the photo-sketch representation with a contrastive loss, and an Spatial Transformer Network (STN) based warp estimator to predict transformation that maximizes the similarity between feature maps of the two images. The estimator learns to optimize a combination of weighted perceptual similarity and forward-backward consistency.

Testing on Novel Photos and Sketches

We exhibit examples of photo-sketch correspondence estimated with our method. For each photo-sketch pair, we show the annotated keypoints from our benchmark PSC6K (first column), the predicted correspondences (second column), and the result of warping the photo to the sketch (third column).

Lastly, we show examples of three typical failure patterns. The method has degraded performance in 1) discriminating between commonly cooccurred objects; 2) aligning fine structures due to low-resolution feature maps; and 3) handling non-continuous transformations caused by large changes in perspective and structure. We think that improving model performance in such cases are prime targets for future work.

BibTeX

@article{lu2023learning,
  author    = {Lu, Xuanchen and Wang, Xiaolong and Fan, Judith E},
  title     = {Learning Dense Correspondences between Photos and Sketches},
  journal   = {ICML},
  year      = {2023},
}