InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
Yang Yue1 Fangyun Wei2 Tianyu He2 Jinjing Zhao2 Zanlin Ni1 Zeyu Liu1
Jiayi Guo1 Lei Shi2 Yue Dong2 Li Chen2 Ji Li2 Gao Huang1,✉ Dong Chen2,✉
1Tsinghua University 2Microsoft Research
InsightTok is a discrete visual tokenizer designed to improve the fidelity of text and faces, two of the most challenging yet perceptually important structures in autoregressive image generation.
Existing visual tokenizers are typically trained with generic reconstruction objectives, which do not explicitly prioritize these fidelity-critical regions. InsightTok addresses this limitation through localized, content-aware perceptual supervision, enabling substantially better preservation of textual content and facial details under a compact discrete bottleneck.
- State-of-the-art text and face reconstruction among discrete visual tokenizers at the same compression rate, using 16× downsampling and a compact 16,384-entry codebook
- Minimal additional training overhead over a vanilla VQGAN-style tokenizer
- No changes required to downstream generative modeling. Readily compatible with standard autoregressive image generation pipelines
- Tokenizer improvements transfer effectively to downstream text-to-image generation, yielding clearer text and more faithful facial details
InsightTok delivers substantial improvements in both text and face reconstruction quality while maintaining strong general reconstruction performance.
The benefits of InsightTok also transfer to downstream autoregressive image generation.
Below is a gallery of images generated by InsightAR.
Model checkpoints are available at https://huggingface.co/yueyang2000/InsightTok.
InsightTok follows the standard VQGAN-style autoencoding interface:
# image encoding
latents, _, [_, _, indices] = vq_model.encode(input_image_tensor)
# image decoding
recon_image_tensor = vq_model.decode(latents)We also provide a simple image reconstruction demo in recon_demo.py:
python recon_demo.py \
--ckpt_path <model-checkpoint-path> # will download from hf if not provided \
--input assets/valset \
--output outputs/reconThis project builds upon the excellent open-source efforts of LlamaGen, Seed-Voken, Janus-Pro, TokBench, DocTR, and InsightFace.
We sincerely thank the authors and contributors of these projects and benchmarks for making this research possible.
If you find this work useful, please consider citing our paper.
@article{yue2026insighttok,
title={InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation},
author={Yue, Yang and Wei, Fangyun and He, Tianyu and Zhao, Jinjing and Ni, Zanlin and Liu, Zeyu and Guo, Jiayi and Shi, Lei and Dong, Yue and Chen, Li and Li, Ji and Huang, Gao and Chen, Dong},
journal={arXiv preprint arXiv:2605.14333},
year={2026}
}If you have any questions, please feel free to contact the authors.
Yang Yue: yueyang22@mails.tsinghua.edu.cn



