FORLA: Federated Object-centric Representation Learning with Slot Attention

Guiqiu Liao1 Matjaž Jogan1 Eric Eaton2 Daniel A. Hashimoto1,2
1PCASO Laboratory, Dept. of Surgery, University of Pennsylvania
2Dept. of Computer and Information Science, University of Pennsylvania
NeurIPS2025
FORLA two-stage pipeline
Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling client-specific factors without supervision. We thus introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student–teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Challenges

(1) Learn features jointly informative across clients with no labels.
(2) Disentangle client-specific factors while preserving shared object structure.

Solutions

(1) A shared adapter + slot attention trained collaboratively.
(2) A teacher–student architecture with EMA and (Local+Global) FedAvg to align object-level slots across clients.

Multi-domain unsupervised FL

Stage 1 overview

FORLA can be used for unsupervised federated feature representation adaptation ; FORLA can also be used for unsupervised federated object segmentation.

Input SAM (SA adapted) DINO (SA adapted) FORLA
Original 1 SAM 1 DINO 1 FORLA 1
Original 2 SAM 2 DINO 2 FORLA 2
Original 3 SAM 3 DINO 3 FORLA 3
Original 4 SAM 4 DINO 4 FORLA 4

Comparison across SAM (feature adapted with slot attention ), DINO (SA adapted), and FORLA on sample videos. With vanilla RNN video inference, FORLA is able to produce more robust slot attention masks without using other modalities like motion or depth.

Individual/Centralized training vs. FORLA

Stage 1 overview

FORLA scales up SA models under different domain combinations and outperforms centralized training.

FL representation on surgical domain

Stage 1 overview

FL representation on natural domain

Stage 4 overview

BibTeX

@article{liao2025forla,
  title   = {FORLA: Federated Object-centric Representation Learning with Slot Attention},
  author  = {Liao, Guiqiu and Jogan, Matjaž and Eaton, Eric and Hashimoto, Daniel A.},
  journal = {NeurIPS},
  year    = {2025}
}