Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Abstract

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets.

In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato.

Affogato-Engine

Affogato-Engine is a three-stage automated pipeline that generates open-vocabulary affordance annotations for 3D objects by combining vision-language models and multi-view aggregation.

Stage 1. Open-vocabulary affordance query generation: Given multi-view images of a 3D object, Gemma3 generates natural language affordance queries via chain-of-thought prompting, conditioned on rendered views.
Stage 2. Language-guided interaction point prediction: Molmo-predicted points are used as prompts for SAM to generate 2D masks, which are projected and aggregated via voting across views to form 3D affordance heatmaps.
Stage 3. Affordance heatmap generation and aggregation: Based on Molmo-predicted points, we use SAM to generate segmentation masks, which are aggregated across views into 3D affordance heatmaps via projection and voting.
Rendering 2D affordance heatmaps: We project the aggregated 3D affordance heatmaps onto 2D image planes to obtain consistent and accurate 2D affordance heatmaps. The resulting 2D data is used to train a 2D affordance grounding model.

Dataset Analysis

Semantic breadth of object classes and affordance queries : Affogato provides a large and diverse open-vocabulary dataset with over 150K 3D object instances and 750K affordance query–heatmap pairs.

Object classes

Affordance classes
High coverage and diversity of affordance annotations : Our heatmaps capture a wide range of interaction patterns—from fine-grained point interactions (e.g., pressing a button) to broad surface-level actions (e.g., holding). In terms of both diversity and coverage, Affogato demonstrates significantly stronger performance than existing datasets.

Diverse annotations

Heatmap Quality
Comparison with prior datasets : Affogato overcomes key limitations of prior 3D affordance datasets—such as annotation mismatch, incomplete heatmap coverage and insufficient data resolution-by generating dense, semantically aligned heatmaps through an automated pipeline.

Model Architecture

We present Espresso, a minimal yet effective architecture for open-vocabulary affordance grounding. Built upon a shared design, Espresso comes in two variants: Espresso-3D for point clouds and Espresso-2D for images. Each model comprises a modality-specific visual encoder, a text encoder, and a text-conditioned heatmap decoder. Instead of using learnable queries, the decoder leverages text embeddings as queries, supporting open-vocabulary affordance grounding without predefined categories.

Experimental Results

3D Affordance Grounding

Cross-dataset Evaluation

Open-vocabulary Generalization

2D Affordance Grounding

Zero-shot evaluation — Pre-training on the 2D rendered Affogato dataset significantly improves performance on AGD20K,
demonstrating strong zero-shot and fine-tuned generalization despite the domain gap.

Heatmap quality — Pre-training on the 2D rendered Affogato dataset significantly improves performance on AGD20K,
demonstrating strong zero-shot and fine-tuned generalization despite the domain gap.

Visualizations

3D AffordanceNet (Top) vs. Affogato-Engine (Bottom)

BibTeX

@article{lee2025affogato,
      title={Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
      author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
      journal={arXiv preprint arXiv:2506.12009},
      year={2025}
    }