SAMOSA: Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering
What if sounds in your XR headset actually matched the room you're in โ reverberating off the right walls, absorbing into the right surfaces? SAMOSA makes virtual audio feel real by understanding your physical environment in real time.
๐ Paradise Hotel Busan, Sydney Room โ Busan, Republic of Korea
๐ฏ The Problem
In XR, realistic sound is crucial for immersion โ but existing spatial audio systems use static, one-size-fits-all acoustics. A cathedral and a closet get the same reverb. This mismatch between what you see and what you hear breaks the illusion.
SAMOSA fixes this by building a real-time understanding of your physical space โ its geometry, materials, and acoustic character โ and using that to synthesize audio that sounds like it truly belongs there.
๐๏ธ How It Works
SAMOSA fuses three real-time sensing streams into a rich multimodal scene representation:
**1. Room Geometry Estimation** Real-time 3D understanding of room shape and dimensions from device sensors. **2. Surface Material Detection** Visual identification of surfaces (concrete, wood, glass, carpet) that determine how sound reflects and absorbs. **3. Semantic Acoustic Context** LLM-driven interpretation of the scene's acoustic character โ understanding that a "library" sounds different from a "gym" even at similar sizes.
These three streams feed into an efficient acoustic calibration engine that synthesizes a realistic Room Impulse Response (RIR) โ the acoustic fingerprint of your space โ in real time.
๐ฌ Demo
๐ Results
We validated SAMOSA through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12).
Aspect
Finding
Room configurations
Validated across diverse geometries and surface materials
Sound types
Tested with speech, music, and environmental audio
In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSAโs feasibility and efficacy in enhancing XR auditory realism.
@inproceedings{xu2025samosa,author={Xu, Tianyu and Li, Jihan and Zu, Penghe and Sahay, Pranav and Kim, Maruchi and Obeng-Marnu, Jack and Miller, Farley and Qian, Xun and Passarella, Katrina and Rachumalla, Mahitha and Nongpiur, Rajeev and Shin., D.},title={{Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering}},booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25)},year={2025},address={Busan, Republic of Korea},publisher={Association for Computing Machinery},keywords={extended reality, spatial audio rendering, rir synthesis, multimodal machine learning, large language models, scene representation, room acoustics},doi={10.1145/3746059.3747730},}