Test-time Adaptation with Slot-Centric Models

1Carnegie Mellon University, 2Google DeepMind, 3Google Research
ICML 2023
Seg Acc Curve.

Slot-TTA improves segmentation accuracy by optimizing over reconstruction loss per test sample.

Abstract

Current visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification.

In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases. Recent slot-centric generative models attempt to decompose scenes into entities in a self-supervised manner by reconstructing pixels.

Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised slot-centric scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives. We evaluate Slot-TTA across multiple input modalities, images or 3D point clouds, and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods.

Slot-TTA

Slot TTA method.

Model architecture for Slot-TTA for posed multi-view or single view RGB images (top) and 3D point clouds (bottom).

Slot-TTA maps the input (multi-view posed) RGB images or 3D point cloud to a set of token features with appropriate encoder backbones. It then maps these token features to a set of slot vectors using Slot Attention. Finally, it decodes each slot into its respective segmentation mask and RGB image or 3D point cloud. It uses weighted averaging or maxpooling to fuse renders across all slots. For RGB images, we show results for multi-view and single-view settings, where in the multi-view setting the decoder is conditioned on a target camera-viewpoint. We train Slot-TTA using reconstruction and segmentation losses. At test time, we optimize only the reconstruction loss.

Decomposing 3D point clouds

We visualize point cloud reconstruction and segmentation during test-time adaptation steps. As can be seen, segmentation improves when optimizing over reconstruction objective via gradient descent at test-time on a single test sample. In this setup, We train supervised using some categories of PartNet and test using novel-categories.

Seg Acc Curve.
Seg Acc Curve.

Some More Examples.

Decomposing RGB images in multi-view scenes

We visualize RGB reconstruction and segmentation during test-time adaptation steps. As can be seen, segmentation of the target viewpoint improves when optimizing over view synthesis objective via gradient descent at test-time on a single test scene. In this setup, We train supervised using Kubric's MSN-Easy and test using MSN-Hard.

Seg Acc Curve.
Seg Acc Curve.

Predicted 360° multi-view videos of RGB and Segmentation mask.

We find that novel view synthesis ability of Slot-TTA improves after test-time adaptation.

Decomposing 2D RGB images

We visualize RGB reconstruction during test-time adaptation steps. As can be seen, segmentation improves when optimizing over reconstruction objective via gradient descent at test-time on a single test sample. In this setup, We train supervised using the CLEVR dataset and test using CLEVR-TEX.

Seg Acc Curve.
Seg Acc Curve.

BibTeX

@misc{prabhudesai2022test,
              title={Test-time Adaptation with Slot-Centric Models}, 
              author={Prabhudesai, Mihir and Goyal, Anirudh and Paul, Sujoy and van Steenkiste, Sjoerd and Sajjadi, Mehdi SM and Aggarwal, Gaurav and Kipf, Thomas and Pathak, Deepak and Fragkiadaki, Katerina},
              year={2203},
              eprint={2203.11194},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
              }