Cady Tianyu Xu

Hello! 👋 I’m Cady, a researcher at Google DeepMind. 🌀 My work focuses on multimodal LLM agents that move beyond open-ended generation toward closed-loop execution, bridging the gap between high-level reasoning and autonomous task completion in real-world environments.

Previously, I was a machine learning engineer on the Google XR team, where I developed perception models and multimodal LLMs for immersive, context-aware interaction. 🕶️ My recent work also spans multimodal human-centered systems and interactive AI, with publications at UIST and CHI.

Prior to my roles at Google, I was a Software Engineer at Apple. I received my Bachelor’s degree in both Computer Science and Political Science from UC Berkeley. 🐻

I’m always excited to connect with researchers and practitioners working on LLMs, XR, or autonomous agents. Reach out via LinkedIn or email, or check out my Google Scholar for potential collaborations! 🐑

news

Apr 08, 2026	Presenting MoXaRt at CHI 2026 in Barcelona! 🇪🇸 Come find us at the Barcelona International Convention Centre, P1 — Room 128 on Fri, Apr 17 at 9:00 AM. 🎤
Mar 21, 2026	Thrilled to be a speaker at the 2026 Silicon Valley Women in Engineering Conference! I’ll be presenting “Sound, Space and Agency: Building Context-Aware Wearable Systems” in the Emerging Technologies C2 · UX & Wearable Technology session, Sat 3/21 at 1:45–2:45 PM. 🎙️
Mar 08, 2026	Our paper GazeAnywhere has been accepted to CVPR 2026! 🎉 The first foundation model for promptable gaze target estimation. 👀

selected publications

GazeAnywhere
Gaze Target Estimation Anywhere with Concepts

Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M. Rehg.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)

Abs Bib Code Website

Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose. We introduce the Promptable Gaze Target Estimation (PGE) task, a new end-to-end, concept-driven paradigm for gaze analysis that conditions gaze prediction on flexible user text or visual prompts. We propose GazeAnywhere, the first foundation model designed for PGE, which uses a multi-layer transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation.
@inproceedings{cao2026gazeanywhere, author = {Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg., James M.}, title = {{Gaze Target Estimation Anywhere with Concepts}}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)}, year = {2026}, keywords = {gaze estimation, foundation model, promptable vision, computer vision, SAM, transformer}, }
MoXaRt
MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, and Adarsh Kowdle.

In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26)

Abs DOI arXiv Bib PDF Website

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt’s core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with 2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
@inproceedings{xu2026moxart, author = {Xu, Tianyu and Kim, Sieun and Zheng, Qianhui and Xu, Ruoyu and Ravi, Tejasvi and Kulkarni, Anuva and Passarella-Ward, Katrina and Zhu, Junyi and Kowdle., Adarsh}, title = {{MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR}}, booktitle = {Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)}, year = {2026}, address = {Barcelona, Spain}, publisher = {Association for Computing Machinery}, keywords = {extended reality, audio-visual interaction, multimodal machine learning, spatial audio, sound synthesis, object-guided interaction}, doi = {10.1145/3772318.3791929}, }
SAMOSA
Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering

Tianyu Xu, Jihan Li, Penghe Zu, Pranav Sahay, Maruchi Kim, Jack Obeng-Marnu, Farley Miller, Xun Qian, Katrina Passarella, Mahitha Rachumalla, Rajeev Nongpiur, and D. Shin.

In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST ’25)

Abs DOI arXiv Bib Video Website

In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA’s feasibility and efficacy in enhancing XR auditory realism.
@inproceedings{xu2025samosa, author = {Xu, Tianyu and Li, Jihan and Zu, Penghe and Sahay, Pranav and Kim, Maruchi and Obeng-Marnu, Jack and Miller, Farley and Qian, Xun and Passarella, Katrina and Rachumalla, Mahitha and Nongpiur, Rajeev and Shin., D.}, title = {{Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering}}, booktitle = {Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25)}, year = {2025}, address = {Busan, Republic of Korea}, publisher = {Association for Computing Machinery}, keywords = {extended reality, spatial audio rendering, rir synthesis, multimodal machine learning, large language models, scene representation, room acoustics}, doi = {10.1145/3746059.3747730}, }
EI-Lite
EI-Lite: Electrical Impedance Sensing for Micro-gesture Recognition and Pinch Force Estimation

Junyi Zhu, Tianyu Xu, Jiayu Wang, Emily Guan, JaeYoung Moon, Stiven Morvan, D Shin, Andrea Colaço, Stefanie Mueller, Karan Ahuja, Yiyue Luo, and Ishan Chatterjee.

In Proceedings of UIST 2025

Abs DOI Bib

Micro-gesture recognition and fine-grain pinch press enables intuitive and discreet control of devices, offering significant potential for enhancing human-computer interaction (HCI). In this paper, we present EI-Lite, a lightweight wrist-worn electrical impedance sensing device for micro-gesture recognition and continuous pinch force estimation. We elicit an optimal and simplified device architecture through an ablation study on electrode placement with 13 users, and implement the elicited designs through 3D printing. We capture data on 15 participants on (1) six common micro-gestures (plus idle state) and (2) index finger pinch forces, then develop machine learning models that interpret the impedance signals generated by these micro-gestures and pinch forces. Our system is capable of accurate recognition of micro-gesture events (96.33% accuracy), as well as continuously estimating the pinch force of the index finger in physical units (Newton), with the mean-squared-error (MSE) of 0.3071 (or mean-force-variance of 0.55 Newtons) over 15 participants. Finally, we demonstrate EI-Lite’s applicability via three applications in AR/VR, gaming, and assistive technologies.
@inproceedings{zhu2025EILite, author = {Zhu, Junyi and Xu, Tianyu and Wang, Jiayu and Guan, Emily and Moon, JaeYoung and Morvan, Stiven and Shin, D and Colaço, Andrea and Mueller, Stefanie and Ahuja, Karan and Luo, Yiyue and Chatterjee., Ishan}, title = {{EI-Lite: Electrical Impedance Sensing for Micro-gesture Recognition and Pinch Force Estimation}}, booktitle = {Proceedings of UIST 2025}, year = {2025}, address = {Busan, Republic of Korea}, publisher = {Association for Computing Machinery}, keywords = {Micro-gesture Recognition, Input, Natural User Interfaces, Interaction Technique, Extended Reality, EIT}, doi = {10.1145/3746059.3747671}, }
Steerable Chatbots
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering

Jessica Y. Bo, Tianyu Xu, Ishan Chatterjee, Katrina Passarella-Ward, Achin Kulshrestha, and D. Shin.

Abs arXiv Bib

As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is essential for enhancing user satisfaction and retention. However, untrained lay users have poor prompt specification abilities and often struggle with conveying their latent preferences to AI assistants. To address this, we leverage activation steering to guide LLMs to align with interpretable preference dimensions during inference. In contrast to memory-based personalization methods that require longer user history, steering is extremely lightweight and can be easily controlled by the user via an linear strength factor. We embed steering into three different interactive chatbot interfaces and conduct a within-subjects user study (n=14) to investigate how end users prefer to personalize their conversations. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with hidden user preferences, and highlight further insights on how diverse values around control, usability, and transparency lead users to prefer different interfaces.
@misc{bo2025steerablechatbotspersonalizingllms, title = {{Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering}}, author = {Bo, Jessica Y. and Xu, Tianyu and Chatterjee, Ishan and Passarella-Ward, Katrina and Kulshrestha, Achin and Shin., D.}, year = {2025}, archiveprefix = {arXiv}, primaryclass = {cs.HC}, keywords = {LLM Personalization, Activation Steering, Chatbot Interfaces}, }
CaliPSO
CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function

Alexandre Capone, Kamron Zaidi, Tianyu Xu, Brian Yang, Geoff Pleiss, and Jeff Schneider.

In ICML 2025 Workshop on Methods and Opportunities at Small Scale

Abs Bib Website

Conformal prediction methods have become increasingly common for accurately capturing uncertainty with machine learning models. However, conformal prediction typically recalibrates an existing model, making it heavily reliant on the quality of the uncalibrated model. Moreover, they either enforce marginal calibration strictly, yielding potentially coarse predictive intervals, or attempt to strike a balance between interval coarseness and calibration. Motivated by these shortcomings, we present CaliPSo a neural network model that is marginally calibrated out-of-the-box and stays so throughout training. This property is achieved by adding a model-dependent constant to the model prediction that shifts it in a way that ensures calibration. During training, we then leverage this to focus exclusively on sharpness - the property of returning tight predictive intervals - rendering the model more useful at test time. We show thorough experimental results, where our method exhibits superior performance compared to several state-of-the-art approaches.
@inproceedings{capone2025calipso, title = {{CaliPSo: Calibrated Predictive Models with Sharpness as Loss Function}}, author = {Capone, Alexandre and Zaidi, Kamron and Xu, Tianyu and Yang, Brian and Pleiss, Geoff and Schneider., Jeff}, booktitle = {ICML 2025 Workshop on Methods and Opportunities at Small Scale}, year = {2025}, keywords = {Conformal Prediction, Uncertainty Quantification, Calibration, Sharpness, Predictive Intervals, Neural Networks}, }
Liquid EIT
Liquids Identification and Manipulation via Digitally Fabricated Impedance Sensors

Junyi Zhu, Young Joong Lee, Yiyue Luo, Tianyu Xu, Chao Liu, Daniela Rus, Stefanie Mueller, and Wojciech Matusik.

In 2024 IEEE International Conference on Robotics and Automation (ICRA)

Abs DOI Bib

Despite recent exponential advancements in computer vision and reinforcement learning, it remains challenging for robots to interact with liquids. These challenges are particularly pronounced due to the limitations imposed by opaque containers, transparent liquids, fine-grained splashes, and visual obstructions arising from the robot’s own manipulation activities. Yet, there exists a substantial opportunity for robotics to excel in liquid identification and manipulation, given its potential role in chemical handling in laboratories and various manufacturing sectors such as pharmaceuticals or beverages. In this work, we present a novel approach for liquid class identification and state estimation leveraging electrical impedance sensing. We design and mount a digitally embroidered electrode array to a commercial robot gripper. Coupled with a customized impedance sensing board, we collect data on liquid manipulation with a swept frequency sensing mode and a frequency-specific impedance measuring mode. Our developed learning-based model achieves an accuracy of 93.33% in classifying 9 different types of liquids (8 liquids + air), and 97.65% in estimating the liquid state. We investigate the effectiveness of our system with a series of ablation studies. These findings highlight our work as a promising solution for enhancing robotic manipulation in liquid-related tasks.
@inproceedings{zhu2024liquids, title = {{Liquids Identification and Manipulation via Digitally Fabricated Impedance Sensors}}, author = {Zhu, Junyi and Lee, Young Joong and Luo, Yiyue and Xu, Tianyu and Liu, Chao and Rus, Daniela and Mueller, Stefanie and Matusik., Wojciech}, booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)}, year = {2024}, pages = {18164--18171}, doi = {10.1109/ICRA57147.2024.10610518}, keywords = {electrodes, liquids, robot sensing systems, sensors, frequency measurement, impedance, state estimation}, }