Kestrel icon

KESTREL

Grounding Self-Refinement for LVLM Hallucination Mitigation

1UC Santa Cruz    2UC Berkeley    3Apple    4UNC-Chapel Hill

KESTREL teaser

Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis --- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

Paradigm Comparison

Compared with prior strategies, KESTREL emphasizes external grounding tools, deterministic evidence, and verification-driven self-refinement.

Paradigm comparison

Framework

KESTREL follows an agent grounding refinement loop: intialization, agent grounding, claim-level verification, and self-refinement.

KESTREL framework

Given an image-question pair, Kestrel follows a training-free four-stage pipeline for LVLM hallucination mitigation: (1) Initialization, which obtains an initial answer and rewrites it into question-aligned verifiable claims with associated visual entities and claim types; (2) Agent Grounding, which invokes an external SAM3-based grounding agent to collect explicit visual evidence (e.g., segmentation overlays, boxes, and crop-and-zoom views) and convert them into structured textual evidence; (3) Claim-level Verification, which verifies each claim against the cited evidence to produce claim-wise verdicts, confidence scores, and a top-level verification decision; and (4) Self-Refinement, which performs evidence-gated answer updating based on the current and previous verification traces.

Illustrative Examples

Representative cases covering object existence, counting, attributes, and spatial or relational reasoning.

Correction Behavior

A high-level view of how initial answers are preserved, corrected, or over-corrected after refinement.

Correction behavior summary

Quantitative Results

POPE

Higher accuracy (↑) indicates better performance. Best results are bolded and second-best results are underlined.

Backbone Method MS-COCO Rand. MS-COCO Pop. MS-COCO Adv. A-OKVQA Rand. A-OKVQA Pop. A-OKVQA Adv. GQA Rand. GQA Pop. GQA Adv.
Qwen3-VL Baseline89.0086.9286.2092.3686.6781.8790.7083.6381.50
Qwen3-VL agent91.0388.0686.1392.8785.2078.0391.4181.8178.30
VCD90.4088.8087.4193.5387.8682.0091.5685.7681.93
Woodpecker89.9788.0387.1093.2388.9083.3391.2786.2782.77
RITUAL86.2083.6782.2787.6783.5077.7686.8682.3078.23
OPERA90.5088.8387.5093.7689.5083.8691.8087.1183.30
DeGF90.3388.1686.9092.9687.7082.6191.1383.7982.00
Kestrel (ours)91.5389.3087.5393.4691.7386.7691.6790.3386.27
InternVL3.5 Baseline90.7788.1085.7392.6787.8381.5389.7784.1081.31
VCD91.3589.2287.6092.8789.7383.7391.6085.0783.37
Woodpecker91.2089.1187.5093.7389.8084.0091.4385.1683.26
RITUAL91.6089.0387.4893.7189.7583.9091.3985.1883.29
OPERA91.5389.1887.5593.5589.7983.8191.4585.2083.31
DeGF91.4389.1287.3793.3989.6884.1191.4885.0683.20
Kestrel (ours)91.2789.2788.1093.5791.8087.1391.5789.8786.53

MME-Hallucination

Higher scores (↑) indicate better performance. We report both backbone-specific tables below.

Qwen3-VL

Backbone Method Existence Count Position Color MME Score
Qwen3-VLBaseline195.00175.00168.33193.33731.66
Qwen3-VL agent200.00181.67168.33193.33743.33
VCD195.00180.00168.33193.33736.66
Woodpecker195.00173.33168.33195.00731.66
RITUAL195.00180.00168.33193.33736.66
OPERA195.00180.00168.33200.00743.33
Kestrel (ours)200.00186.67180.00193.33760.00

InternVL3.5

Backbone Method Existence Count Position Color MME Score
InternVL3.5Baseline200.00175.00175.00193.33743.33
VCD200.00175.00175.00193.33736.66
Woodpecker200.00166.67161.67186.67715.01
RITUAL195.00175.00175.00193.33738.33
OPERA195.00173.33175.00195.00738.33
DeGF195.00175.00168.33188.33726.66
Kestrel (ours)200.00186.67181.67195.00763.34

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃

@misc{mao2026kestrelgroundingselfrefinementlvlm,
      title={Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation}, 
      author={Jiawei Mao and Hardy Chen and Haoqin Tu and Yuhan Wang and Letian Zhang and Zeyu Zheng and Huaxiu Yao and Zirui Wang and Cihang Xie and Yuyin Zhou},
      year={2026},
      eprint={2603.16664},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16664},
}