SAMOSA: Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering

What if sounds in your XR headset actually matched the room you're in — reverberating off the right walls, absorbing into the right surfaces? SAMOSA makes virtual audio feel real by understanding your physical environment in real time.

📄 Paper: doi:10.1145/3746059.3747730 · Published at UIST 2025

🎤 Talk: Mon, Sep 29 | 11:24 AM – 11:36 AM

📍 Paradise Hotel Busan, Sydney Room — Busan, Republic of Korea

🎯 The Problem

In XR, realistic sound is crucial for immersion — but existing spatial audio systems use static, one-size-fits-all acoustics. A cathedral and a closet get the same reverb. This mismatch between what you see and what you hear breaks the illusion.

SAMOSA fixes this by building a real-time understanding of your physical space — its geometry, materials, and acoustic character — and using that to synthesize audio that sounds like it truly belongs there.

🏗️ How It Works

SAMOSA fuses three real-time sensing streams into a rich multimodal scene representation:

**1. Room Geometry Estimation** Real-time 3D understanding of room shape and dimensions from device sensors. **2. Surface Material Detection** Visual identification of surfaces (concrete, wood, glass, carpet) that determine how sound reflects and absorbs. **3. Semantic Acoustic Context** LLM-driven interpretation of the scene's acoustic character — understanding that a "library" sounds different from a "gym" even at similar sizes.

These three streams feed into an efficient acoustic calibration engine that synthesizes a realistic Room Impulse Response (RIR) — the acoustic fingerprint of your space — in real time.

🎬 Demo

📊 Results

We validated SAMOSA through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12).

Aspect	Finding
Room configurations	Validated across diverse geometries and surface materials
Sound types	Tested with speech, music, and environmental audio
Expert evaluation	12 audio professionals confirmed enhanced realism
On-device	Runs in real time on XR hardware

References

2025

SAMOSA
Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering

Tianyu Xu, Jihan Li, Penghe Zu, Pranav Sahay, Maruchi Kim, Jack Obeng-Marnu, Farley Miller, Xun Qian, Katrina Passarella, Mahitha Rachumalla, Rajeev Nongpiur, and D. Shin.

In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25)

Abs DOI arXiv Bib Video Website

In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA’s feasibility and efficacy in enhancing XR auditory realism.
@inproceedings{xu2025samosa, author = {Xu, Tianyu and Li, Jihan and Zu, Penghe and Sahay, Pranav and Kim, Maruchi and Obeng-Marnu, Jack and Miller, Farley and Qian, Xun and Passarella, Katrina and Rachumalla, Mahitha and Nongpiur, Rajeev and Shin., D.}, title = {{Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering}}, booktitle = {Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25)}, year = {2025}, address = {Busan, Republic of Korea}, publisher = {Association for Computing Machinery}, keywords = {extended reality, spatial audio rendering, rir synthesis, multimodal machine learning, large language models, scene representation, room acoustics}, doi = {10.1145/3746059.3747730}, }