Sounds That Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

1Chung-Ang University, Republic of Korea

*Indicates Corresponding Author
ICASSP 2026

More details coming soon!

"Imagine What Shape Sound Could Take"


tl;dr: Introducing audio-driven 3D mesh and texture generation system with pretrained 2D diffusion models from a single audio input.


Method Overview



Audio-to-3D Generation Results


Audio: πŸ”Š (Fire Cracking)

A23D Result:

Audio: πŸ”Š (Forest)

A23D Result:

Audio: πŸ”Š (Underwater Bubbling)

A23D Result:

Audio: πŸ”Š (Snow)

A23D Result:

Our system enables audio-based 3D mesh and texture generation using pre-trained 2D diffusion models with a single audio file.


Audio-Driven Text-to-3D Variation Results


Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A vase"

Ours:

Audio: πŸ”Š (Underwater)
Text: "A shoe"

Ours:

Audio: πŸ”Š (Forest)
Text: "A cup"

Ours:

Audio: πŸ”Š (Splashing water)
Text: "A chair"

Ours:

With auxiliary text prompts, it allows for more expressive and realistic 3D content through attribute-decoupled denoising guidance.


Ablation Study: Modality-Cross 3D Generation


Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling ..."

Text-to-2D:

Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling ..."

Text-to-3D:

Audio: πŸ”Š (Fire Cracking)
Text: πŸ’¬ "A Chair"

(pure) Audio-Driven Text-to-3D:

Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A Chair"

Ours:

Our attribute-decoupled guidance produces more realistic 3D structures that align each property into textual prompt and audio inputs.


Q&A


  1. Why Gaussian Splatting instead of NeRFs?

    Gaussian Splatting offers balanced performance and efficiency on training time, intuition and computing resource. No one want to wait 5+ hours to create simple 3D object indeed. Taking GS as 3D representation allows for fast optimization and easier manipulation of 3D objects within 2 minutes and 12GB VRAM occupancy.


  2. Why SDS?

    At the time of this research, SDS-based 3D generation method (e.g., DreamFusion, DreamGaussian) provides substantial flexibility which enables 3D content creation from single condition upon pretrained diffusion models without 3D awareness or cross-modal data requirements. Thanks to this capability, we can realize audio-to-3D system upon pretrained audio-to-image diffusion models for 3D mesh.