Affordance grounding-localizing object regions based on natural language descriptions
of interactions-is a critical challenge for enabling intelligent agents to understand and
interact with their environments.
However, this task remains challenging due to the need for fine-grained part-level localization,
the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets.
In this work, we introduce Affogato,
a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions
and corresponding 3D affordance heatmaps across a diverse set of objects and interactions.
Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.
Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato.