CAVER

Curious Audiovisual Exploring Robot

Anonymous Authors1

Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object’s visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed endeffector, attachable to parallel grippers, that excites objects’ audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations.

Overview

Method Overview

CAVER utilizes a curiosity-driven exploration strategy to build multimodal representations. The system integrates local visual features, global scene context, and acoustic responses generated by our custom 3D-printed end-effector.

Experiments

We evaluate CAVER across three primary scenarios:

  • Active Exploration: Comparing curiosity-based interaction against random and uncertainty-only baselines.
  • Material Classification: Testing how audiovisual features improve the identification of diverse objects.
  • Audio-to-Action Imitation: Demonstrating the robot's ability to recreate sounds from human demonstrations.

Results

Our results indicate that CAVER achieves higher coverage of unique acoustic signatures with 30% fewer interactions than standard baselines. Furthermore, the combined audiovisual representation significantly outperforms vision-only models in cross-modal retrieval tasks.