Sounds That Shape: Audio-Driven 3D Mesh Generation with Attribute-Decoupled Score Distillation Sampling

1Chung-Ang University, Republic of Korea

*Indicates Corresponding Author
ICASSP 2026

"Imagine What Shape Sound Could Take"


tl;dr: Introducing audio-driven 3D mesh and texture generation system with pretrained 2D diffusion models from a single audio input.


Method Overview



Audio-to-3D Generation Results


Audio: πŸ”Š (Fire Cracking)

A23D Result:

Audio: πŸ”Š (Forest)

A23D Result:

Audio: πŸ”Š (Underwater Bubbling)

A23D Result:

Audio: πŸ”Š (Snow)

A23D Result:

Our system enables audio-based 3D mesh and texture generation using pre-trained 2D diffusion models with a single audio file.


Audio-Driven Text-to-3D Variation Results


Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A vase"

Ours:

Audio: πŸ”Š (Underwater)
Text: "A shoe"

Ours:

Audio: πŸ”Š (Forest)
Text: "A cup"

Ours:

Audio: πŸ”Š (Splashing water)
Text: "A chair"

Ours:

With auxiliary text prompts, it allows for more expressive and realistic 3D content through attribute-decoupled denoising guidance.


Ablation Study: Modality-Cross 3D Generation


Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling ..."

Text-to-2D:

Audio: πŸ”Š (Null)
Text: πŸ’¬ "A chair with fire crackling ..."

Text-to-3D:

Audio: πŸ”Š (Fire Cracking)
Text: πŸ’¬ "A Chair"

(pure) Audio-Driven Text-to-3D:

Audio: πŸ”Š (Fire Crackling)
Text: πŸ’¬ "A Chair"

Ours:

Our attribute-decoupled guidance produces more realistic 3D structures that align each property into textual prompt and audio inputs.


Q&A


  1. Why Gaussian Splatting instead of NeRFs?

    Gaussian Splatting offers balanced performance and efficiency on training time, intuition and computing resource. No one want to wait 5+ hours to create simple 3D object indeed. Taking GS as 3D representation allows for fast optimization and easier manipulation of 3D objects within 2 minutes and 12GB VRAM occupancy.


  2. Why SDS?

    At the time of this research, SDS-based 3D generation method (e.g., DreamFusion, DreamGaussian) provides substantial flexibility which enables 3D content creation from single condition upon pretrained diffusion models without 3D awareness or cross-modal data requirements. Thanks to this capability, we can realize audio-to-3D system upon pretrained audio-to-image diffusion models for 3D mesh.


Reference


  1. (SDS) DreamFusion: https://dreamfusion3d.github.io/
  2. (3DGS) 3D Gaussian Splatting: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  3. (SDS + 3DGS) DreamGaussian: https://dreamgaussian.github.io/
  4. (Text-to-Image) MVDream: https://mv-dream.github.io/
  5. (Audio-to-Image) SonicDiffusion: https://cyberiada.github.io/SonicDiffusion/
  6. (Conditional Sampling) Classifier-free Guidance: https://arxiv.org/abs/2207.12598
  7. (Audio Embedding) CLAP: https://arxiv.org/pdf/2206.04769
  8. (Text Embedding) CLIP: https://arxiv.org/pdf/2103.00020