Learning Correlation Structures for Vision Transformers

1POSTECH,  2Korea University,  3Google Research
Equal contribution

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 \& V2, Diving-48, and FineGym.

Structural Self-Attention (StructSA)

Given an input video and a query indicated by the red box in (a), the query-key correlation maps in (b) clearly reveal the structures of spatial layout and motion with respect to the query. We here introduce a novel self-attention mechanism, named structural self-attention (StructSA), in (c) that effectively incorporates rich structural patterns of query-key correlation into contextual feature aggregation. The StructSA mechanism consists of two steps: (i) structural query-key attention and (ii) contextual value aggregation. Unlike the vanilla query-key attention where individual correlation values themselves are used as attention scores, the structural query-key attention takes the correlation map as a whole and detect structural patterns from it in attention scoring. The subsequent contextual value aggregation then combines the attention scores together to compute diverse sets of kernel weights that are used for dynamically collecting local contexts of value features.

Experimental Results

Image Understanding

ImageNet-1K
Interpolate start reference image.

*Trained with Token Labeling.

MS COCO

Interpolate start reference image.

AP$^\textrm{b}$ and AP$^\textrm{m}$ indicates box mAP and mask mAP, respectively. We measure FLOPs at $800\times1280$ resolution.





ADE20K

Interpolate start reference image.

We measure FLOPs using $512 \times 2048$ resolution images.

Video Understanding

Kinetics-400

Interpolate start reference image.

Something-Something

Interpolate start reference image.

Ablation Study

Interpolate start reference image.

Effects of the structure dimension D (Left), kernel size M (center), and contextual aggregation (right) on ImageNet-1K and Something-Something-V1.

Qualitative Results

Attention map visualization of SA, ConvSA, and StructSA

Interpolate start reference image.

The query location $i$ is set to the center of the image and the kernel size $M=3 \times 3$. Given (a) input images, we illustrate (b) attention maps of SA, (c) dynamic kernels $\kappa$$^{\mathrm{conv}}_{i,j}$, (d) final attention maps of ConvSA, \ie, aggregated weights of $\kappa$$^{\mathrm{conv}}_{i,j}$, (e) dynamic kernels $\kappa$$^{\mathrm{struct}}_{i,j}$, and (f) final attention maps of StructSA, i.e., aggregated weights of $\kappa$$^{\mathrm{struct}}_{i,j}$, respectively. Note that in (c) and (e), each location $j$ has an aggregation map of the kernel size $M=3 \times 3$ and thus we show enlarged images for three different sampled locations $j$. The figures demonstrate that StructSA contextualizes the entire features in a structure-aware manner considering objects’ layouts or shapes; for instance, StructSA aggregates global contexts distinguishing different parts of an orange (2nd row) or an ostrich (3rd row)



Kernel visualization of spatio-temporal StructSA

Interpolate start reference image.

The top row shows the input frames that contain the input spatiotemporal local context (indicated by green boxes) used in the dynamic kernel computation. The bottom row presents the resulting dynamic kernels $\kappa$$^{\mathrm{struct}}_{i,j}$ for a StructSA head when $i=j$. Note that the computed dynamic kernels are computed with self-similarity map ($i=j$) to illustrate its effectiveness in capturing motions in videos. We use StructViT-S-4-1 with $M=5 \times 5 \times 5$.

BibTeX

@inproceedings{kim2024learning,
      title={Learning Correlation Structures for Vision Transformers},
      author={Kim, Manjin and Seo, Paul Hongsuck and Schmid, Cordelia and Cho, Minsu},
      journal={Proceedings of the IEEE conference on computer vision and pattern recognition},
      year={2024},
      }