InsightTok

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Yang Yue¹ Fangyun Wei² Tianyu He² Jinjing Zhao² Zanlin Ni¹ Zeyu Liu¹

Jiayi Guo¹ Lei Shi² Yue Dong² Li Chen² Ji Li² Gao Huang^1,✉ Dong Chen^2,✉

¹Tsinghua University ²Microsoft Research

Overview

InsightTok is a discrete visual tokenizer designed to improve the fidelity of text and faces, two of the most challenging yet perceptually important structures in autoregressive image generation.

Existing visual tokenizers are typically trained with generic reconstruction objectives, which do not explicitly prioritize these fidelity-critical regions. InsightTok addresses this limitation through localized, content-aware perceptual supervision, enabling substantially better preservation of textual content and facial details under a compact discrete bottleneck.

Highlights

State-of-the-art text and face reconstruction among discrete visual tokenizers at the same compression rate, using 16× downsampling and a compact 16,384-entry codebook
Minimal additional training overhead over a vanilla VQGAN-style tokenizer
No changes required to downstream generative modeling. Readily compatible with standard autoregressive image generation pipelines
Tokenizer improvements transfer effectively to downstream text-to-image generation, yielding clearer text and more faithful facial details

Main Results

Tokenizer Reconstruction

InsightTok delivers substantial improvements in both text and face reconstruction quality while maintaining strong general reconstruction performance.

Autoregressive Image Generation

The benefits of InsightTok also transfer to downstream autoregressive image generation.

Below is a gallery of images generated by InsightAR.

Usage

Model checkpoints are available at https://huggingface.co/yueyang2000/InsightTok.

InsightTok follows the standard VQGAN-style autoencoding interface:

# image encoding
latents, _, [_, _, indices] = vq_model.encode(input_image_tensor)
# image decoding
recon_image_tensor = vq_model.decode(latents)

We also provide a simple image reconstruction demo in recon_demo.py:

python recon_demo.py \
  --ckpt_path <model-checkpoint-path> # will download from hf if not provided \
  --input assets/valset \
  --output outputs/recon

Acknowledgments

This project builds upon the excellent open-source efforts of LlamaGen, Seed-Voken, Janus-Pro, TokBench, DocTR, and InsightFace.

We sincerely thank the authors and contributors of these projects and benchmarks for making this research possible.

Citation

If you find this work useful, please consider citing our paper.

@article{yue2026insighttok,
  title={InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation},
  author={Yue, Yang and Wei, Fangyun and He, Tianyu and Zhao, Jinjing and Ni, Zanlin and Liu, Zeyu and Guo, Jiayi and Shi, Lei and Dong, Yue and Chen, Li and Li, Ji and Huang, Gao and Chen, Dong},
  journal={arXiv preprint arXiv:2605.14333},
  year={2026}
}

Contact

If you have any questions, please feel free to contact the authors.

Yang Yue: yueyang22@mails.tsinghua.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
recon_demo.py		recon_demo.py
requirements.txt		requirements.txt
resize_rec.py		resize_rec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InsightTok

Overview

Highlights

Main Results

Tokenizer Reconstruction

Autoregressive Image Generation

Usage

Acknowledgments

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InsightTok

Overview

Highlights

Main Results

Tokenizer Reconstruction

Autoregressive Image Generation

Usage

Acknowledgments

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages