Audio: π (Fire Cracking)
Audio: π (Forest)
Audio: π (Underwater Bubbling)
Audio: π (Snow)
Audio: π (Fire Crackling)
Text: π¬ "A vase"
Audio: π (Underwater)
Text: "A shoe"
Audio: π (Forest)
Text: "A cup"
Audio: π (Splashing water)
Text: "A chair"
Audio: π (Null)
Text: π¬ "A chair with fire crackling ..."
Audio: π (Null)
Text: π¬ "A chair with fire crackling ..."
Audio: π (Fire Cracking)
Text: π¬ "A Chair"
Audio: π (Fire Crackling)
Text: π¬ "A Chair"
Why Gaussian Splatting instead of NeRFs?
Gaussian Splatting offers balanced performance and efficiency on training time, intuition and computing resource. No one want to wait 5+ hours to create simple 3D object indeed. Taking GS as 3D representation allows for fast optimization and easier manipulation of 3D objects within 2 minutes and 12GB VRAM occupancy.
Why SDS?
At the time of this research, SDS-based 3D generation method (e.g., DreamFusion, DreamGaussian) provides substantial flexibility which enables 3D content creation from single condition upon pretrained diffusion models without 3D awareness or cross-modal data requirements. Thanks to this capability, we can realize audio-to-3D system upon pretrained audio-to-image diffusion models for 3D mesh.