ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

Yuzhuo Ao1*, Anbang Wang1*, Yu-Wing Tai2, Chi-Keung Tang1
1The Hong Kong University of Science and Technology, 2Dartmouth College
*Equal contribution
ReasonNavi Pipeline

ReasonNavi uses a reason-then-act paradigm to enable zero-shot embodied navigation by coupling MLLMs with deterministic planners.

Abstract

Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model’s semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation.

Video

Global Reasoning

ReasonNavi leverages MLLMs to perform global reasoning on a top-down map, identifying potential target nodes based on semantic instructions.

Local Execution

The selected waypoints are grounded into executable trajectories using a deterministic action planner for robust local navigation.

Related Links

We would like to express our gratitude to the creators of the HM3D dataset. Our work is evaluated on the HM3D dataset, which provides high-resolution 3D scans of real-world environments.

BibTeX

@article{ao2026reasonnavihumaninspiredglobalmap,
      title={ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation}, 
      author={Yuzhuo Ao and Anbang Wang and Yu-Wing Tai and Chi-Keung Tang},
      year={2026},
      eprint={2602.15864},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.15864}, 
}