FORLA

FORLA: Federated Object-Centric Representation Learning with Slot Attention

Guiqiu Liao Matjaž Jogan Eric Eaton Daniel A. Hashimoto

PCASO Laboratory, Dept. of Surgery, University of Pennsylvania
Dept. of Computer and Information Science, University of Pennsylvania

NeurIPS2025

Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling client-specific factors without supervision. We thus introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student–teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Input

SAM (SA adapted)

DINO (SA adapted)

FORLA

@article{liao2025forla, title = {FORLA: Federated Object-Centric Representation Learning with Slot Attention}, author = {Liao, Guiqiu and Jogan, Matjaž and Eaton, Eric and Hashimoto, Daniel A.}, journal = {NeurIPS}, year = {2025} }

FORLA: Federated Object-Centric Representation Learning with Slot Attention

Challenges

Solutions

Multi-domain unsupervised FL

FORLA can be used for unsupervised federated feature representation adaptation ; FORLA can also be used for unsupervised federated object segmentation.

Comparison across SAM (feature adapted with slot attention ), DINO (SA adapted), and FORLA on sample videos. With vanilla RNN video inference, FORLA is able to produce more robust slot attention masks without using other modalities like motion or depth.

Individual/Centralized training vs. FORLA

FORLA scales up SA models under different domain combinations and outperforms centralized training.

FL representation on surgical domain

FL representation on natural domain

BibTeX