GazeAnywhere: Gaze Target Estimation Anywhere with Concepts

What if you could ask an AI "where is the boy in the red shirt looking?" and get an instant answer from any image? GazeAnywhere is the first foundation model that understands gaze through natural language.

πŸ“„ Paper: Accepted to CVPR 2026

πŸ’» Code: github.com/IrohXu/GazeAnywhere


🎯 The Problem

Estimating where people are looking in real-world images is notoriously tough. Current methods rely on brittle, multi-stage pipelines that require rigid inputs like head bounding boxes and human pose. Detection errors cascade through the pipeline, and there’s no way to use natural language to specify who you want to analyze.


πŸ’‘ Key Idea: Promptable Gaze Target Estimation

We define a new task β€” Promptable Gaze Target Estimation (PGE) β€” that replaces fragile pipelines with a single, flexible model:

**Flexible Prompting** Use natural language ("the boy in the red shirt") or visual prompts (a specific coordinate) to identify who you want to analyze. **End-to-End Integration** PGE merges subject localization with gaze estimation in a single pass β€” no cascading errors. **Foundation Model Architecture** GazeAnywhere uses a multi-layer transformer to simultaneously solve subject localization, in/out-of-frame presence, and gaze target heatmap estimation.

⚑ Why GazeAnywhere?

  • πŸ”¬ SAM 3-style gaze target estimation foundation model
  • πŸ’¬ The first text and visual concept-driven gaze estimation model
  • πŸ“‹ Defines the Promptable Gaze Target Estimation (PGE) task
  • πŸ€– Includes AnyGaze Agent β€” connecting GazeAnywhere to Gemini APIs

πŸ‘₯ Team

A collaboration between UIUC Rehg Lab and Google AR.

(Cao et al., 2026)

References

2026

  1. GazeAnywhere
    gazeanywhere.png
    Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M. Rehg.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)