MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Imagine sitting in a noisy cafΓ© where a guitarist, a cellist, and two people are all talking at once. What if your XR headset could let you tap on the guitarist to hear only their melody β€” or mute a conversation you're not part of? That's MoXaRt.

πŸ“„ Paper: arXiv:2603.10465 Β· Accepted to CHI 2026

🎀 Talk: Fri, Apr 17 at 9:00 AM

πŸ“ Barcelona International Convention Centre, P1 β€” Room 128


🎯 The Problem

In real-world XR environments, sound sources are entangled β€” voices overlap with music, instruments bleed into each other. Existing spatial audio techniques can filter by direction, but they can’t separate two sources coming from the same location.

MoXaRt solves this by combining what you see with what you hear β€” using visual detection of sound-producing objects (faces, instruments) to guide precise audio separation.


πŸ—οΈ How It Works

MoXaRt uses a cascaded architecture with three stages:

**1. Coarse Audio-Only Separation** Initial blind separation of the mixed audio into approximate source streams. **2. Visual Source Detection** Real-time detection of sound-producing objects in the scene β€” faces, instruments, speakers β€” using the XR headset's cameras. **3. Visually-Guided Refinement** The visual anchors guide a second-stage model that isolates individual sources with high precision, resolving ambiguities the audio-only stage can't handle.

🎬 Demos

See MoXaRt in action β€” real-time source separation controlled through visual object selection.

Instrument Separation

A live performance with multiple instruments playing simultaneously. MoXaRt identifies each instrument visually and separates their audio streams in real time.

Speech vs. Music Separation

A scenario with overlapping speech and background music β€” MoXaRt cleanly separates the two, letting you focus on either the conversation or the performance.

Multi-Speaker Separation

Multiple people speaking at once. By visually selecting a specific person, MoXaRt isolates their voice from the crowd.


πŸ“Š Results

We validated MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study.

Metric Result
Speech intelligibility 36.2% increase in listening comprehension (p < 0.01)
Cognitive load Significantly reduced (p < 0.001)
Concurrent sources Up to 5 (e.g., 2 voices + 3 instruments)
Processing latency ~2 seconds
(Xu et al., 2026)

References

2026

  1. MoXaRt
    moxart.jpg
    Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, and Adarsh Kowdle.
    In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26)