MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

MoXaRt is a real-time XR system that uses audio-visual cues to separate entangled sound sources and enable fine-grained sound interaction. Accepted to CHI 2026, Barcelona, Spain.


Overview

MoXaRt’s core is a cascaded architecture that performs:

  1. Coarse audio-only separation — initial separation of mixed audio sources
  2. Visual detection of sources — identifying sound-producing objects (e.g., faces, instruments) in the scene
  3. Visually-guided refinement — using the visual anchors to isolate individual sources with high precision

The system separates complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with approximately ~2 second processing latency, making it suitable for real-time XR interaction.


Results

We validated MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study.

Metric Result
Speech intelligibility 36.2% increase in listening comprehension (p < 0.01)
Cognitive load Significantly reduced (p < 0.001)
Concurrent sources Up to 5 (e.g., 2 voices + 3 instruments)
Processing latency ~2 seconds

Citation

(Xu et al., 2026)

References

2026

  1. MoXaRt
    moxart.jpg
    Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, and Adarsh Kowdle.
    In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26)