Skip to content

LeapLabTHU/InsightTok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsightTok

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Yang Yue1Fangyun Wei2Tianyu He2Jinjing Zhao2Zanlin Ni1Zeyu Liu1

Jiayi Guo1Lei Shi2Yue Dong2Li Chen2Ji Li2Gao Huang1,✉Dong Chen2,✉

1Tsinghua University  2Microsoft Research

arXiv Model

Overview

InsightTok is a discrete visual tokenizer designed to improve the fidelity of text and faces, two of the most challenging yet perceptually important structures in autoregressive image generation.

Existing visual tokenizers are typically trained with generic reconstruction objectives, which do not explicitly prioritize these fidelity-critical regions. InsightTok addresses this limitation through localized, content-aware perceptual supervision, enabling substantially better preservation of textual content and facial details under a compact discrete bottleneck.

Highlights

  • State-of-the-art text and face reconstruction among discrete visual tokenizers at the same compression rate, using 16× downsampling and a compact 16,384-entry codebook
  • Minimal additional training overhead over a vanilla VQGAN-style tokenizer
  • No changes required to downstream generative modeling. Readily compatible with standard autoregressive image generation pipelines
  • Tokenizer improvements transfer effectively to downstream text-to-image generation, yielding clearer text and more faithful facial details

Main Results

Tokenizer Reconstruction

InsightTok delivers substantial improvements in both text and face reconstruction quality while maintaining strong general reconstruction performance.

Autoregressive Image Generation

The benefits of InsightTok also transfer to downstream autoregressive image generation.

Below is a gallery of images generated by InsightAR.

Usage

Model checkpoints are available at https://huggingface.co/yueyang2000/InsightTok.

InsightTok follows the standard VQGAN-style autoencoding interface:

# image encoding
latents, _, [_, _, indices] = vq_model.encode(input_image_tensor)
# image decoding
recon_image_tensor = vq_model.decode(latents)

We also provide a simple image reconstruction demo in recon_demo.py:

python recon_demo.py \
  --ckpt_path <model-checkpoint-path> # will download from hf if not provided \
  --input assets/valset \
  --output outputs/recon

Acknowledgments

This project builds upon the excellent open-source efforts of LlamaGen, Seed-Voken, Janus-Pro, TokBench, DocTR, and InsightFace.

We sincerely thank the authors and contributors of these projects and benchmarks for making this research possible.

Citation

If you find this work useful, please consider citing our paper.

@article{yue2026insighttok,
  title={InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation},
  author={Yue, Yang and Wei, Fangyun and He, Tianyu and Zhao, Jinjing and Ni, Zanlin and Liu, Zeyu and Guo, Jiayi and Shi, Lei and Dong, Yue and Chen, Li and Li, Ji and Huang, Gao and Chen, Dong},
  journal={arXiv preprint arXiv:2605.14333},
  year={2026}
}

Contact

If you have any questions, please feel free to contact the authors.

Yang Yue: yueyang22@mails.tsinghua.edu.cn

About

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages