Affogato Icon Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

1Pohang University of Science and Technology (POSTECH), 2RLWRLD

* indicates equal contribution

Teaser image

Affogato is the largest 3D affordance grounding dataset to date.
Our dataset provides 150K 3D object instances with open-vocabulary natural text queries paired with spatially localized heatmap annotations, surpassing all existing datasets in scale and diversity.

Abstract

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets.

In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato.

Affogato-Engine

Data Generation Pipeline Diagram
Affogato-Engine is a three-stage automated pipeline that generates open-vocabulary affordance annotations for 3D objects by combining vision-language models and multi-view aggregation.
  • Stage 1. Open-vocabulary affordance query generation: Given multi-view images of a 3D object, Gemma3 generates natural language affordance queries via chain-of-thought prompting, conditioned on rendered views.
  • Stage 2. Language-guided interaction point prediction: Molmo-predicted points are used as prompts for SAM to generate 2D masks, which are projected and aggregated via voting across views to form 3D affordance heatmaps.
  • Stage 3. Affordance heatmap generation and aggregation: Based on Molmo-predicted points, we use SAM to generate segmentation masks, which are aggregated across views into 3D affordance heatmaps via projection and voting.
  • Rendering 2D affordance heatmaps: We project the aggregated 3D affordance heatmaps onto 2D image planes to obtain consistent and accurate 2D affordance heatmaps. The resulting 2D data is used to train a 2D affordance grounding model.

Dataset Analysis

  • Semantic breadth of object classes and affordance queries : Affogato provides a large and diverse open-vocabulary dataset with over 150K 3D object instances and 750K affordance query–heatmap pairs.
    Object classes

    Object classes

    Affordance classes

    Affordance classes

  • High coverage and diversity of affordance annotations : Our heatmaps capture a wide range of interaction patterns—from fine-grained point interactions (e.g., pressing a button) to broad surface-level actions (e.g., holding). In terms of both diversity and coverage, Affogato demonstrates significantly stronger performance than existing datasets.

    Diverse annotations

    Diverse annotations

    Heatmap quality

    Heatmap Quality

  • Comparison with prior datasets : Affogato overcomes key limitations of prior 3D affordance datasets—such as annotation mismatch, incomplete heatmap coverage and insufficient data resolution-by generating dense, semantically aligned heatmaps through an automated pipeline.

Model Architecture

Data Generation Pipeline Diagram
We present Espresso, a minimal yet effective architecture for open-vocabulary affordance grounding. Built upon a shared design, Espresso comes in two variants: Espresso-3D for point clouds and Espresso-2D for images. Each model comprises a modality-specific visual encoder, a text encoder, and a text-conditioned heatmap decoder. Instead of using learnable queries, the decoder leverages text embeddings as queries, supporting open-vocabulary affordance grounding without predefined categories.

Experimental Results

3D Affordance Grounding

Cross-dataset Evaluation

Cross-dataset generalization

Training on Affogato leads to significantly better cross-dataset generalization due to greater scale and diversity.

Open-vocabulary Generalization

Data Generation Pipeline Diagram

Espresso-3D consistently outperforms existing methods on cross-category splits,
demonstrating superior generalization capability of model to unseen object categories.


2D Affordance Grounding

Zero-shot evaluation

Zero-shot evaluation on AGD20K

Heatmap quality

Comparison with training data

Pre-training on the 2D rendered Affogato dataset significantly improves performance on AGD20K,
demonstrating strong zero-shot and fine-tuned generalization despite the domain gap.

Visualizations

3D AffordanceNet (Top) vs. Affogato-Engine (Bottom)

3D AffordanceNet (Top) vs. Affogato-Engine (Bottom)

main_qual

Qualitative results of Espresso-3D on the LASO test split

2d qual

Qualitative results of Espresso-2D (Affogato pretrained, AGD20K-Full fine-tuned) on AGD20K

BibTeX

@article{lee2025affogato,
      title={Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale},
      author={Lee, Junha and Park, Eunha and Park, Chunghyun and Kang, Dahyun and Cho, Minsu},
      journal={arXiv preprint arXiv:2506.12009},
      year={2025}
    }