Overview
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. Specifically, we propose VA-π, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-π formulates the generator–tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-π introduces a RL-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens.
Observations
1. Alignment to pixel-space
VA-π aligns autoregressive image generation with the ground-truth distribution in pixel space. Qualitatively, VA-π corrects off-manifold token sequences that decode into distorted structures, producing more coherent and faithful reconstructions. Quantitatively, both embedding density estimation (KDE) and low-dimensional projections (t-SNE) show that VA-π shifts generated images closer to the ground-truth manifold.
2. VA-π is better than naive post-train
We benchmark VA-π against naive post-training baselines on two representative settings: class-to-image generation on ImageNet-1K (LlamaGen-XL/XXL; evaluated with and without CFG), and text-to-image reasoning on GenEval (LlamaGen-XL and the unified multimodal model Janus-Pro 1B). We report standard generation metrics (FID/IS/Precision/Recall for C2I; GenEval sub-scores and overall for T2I) and include tuning time and whether an external reward is used.
- Lightweight gains. Without external reward, VA-π improves LlamaGen-XXL on ImageNet-1K from FID 14.36 → 7.65 (w/o CFG) in 25 minutes, and improves LlamaGen-XL from FID 15.55 → 9.23 in 20 minutes.
- Better quality at low cost. VA-π boosts perceptual quality significantly: for LlamaGen-XXL (w/o CFG) IS 86.55 → 116.70, and for LlamaGen-XL (w/o CFG) IS 79.16 → 111.59; with CFG it reaches IS 299.63 on LlamaGen-XL while still keeping tuning time at 20 minutes.
- Generalizes beyond C2I. On GenEval, VA-π improves LlamaGen-XL overall from 0.306 → 0.339, and improves the unified multimodal model Janus-Pro 1B from 0.725 → 0.744, without task-specific fine-tuning on GenEval.
| Model | Ext. Rwd | Time (min) ↓ | w/o cfg | w/ cfg | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | IS ↑ | Pre. ↑ | Rec. ↑ | FID ↓ | IS ↑ | Pre. ↑ | Rec. ↑ | |||
| LlamaGen-XL (775M) | -- | -- | 15.55 | 79.16 | 0.62 | 0.69 | 2.79 | 286.88 | 0.84 | 0.54 |
| + AR-GRPO | ✓ | 149 | -- | -- | -- | -- | 3.63 | 293.07 | 0.86 | 0.48 |
| + VA-π (Ours) | × | 20 | 9.23 | 111.59 | 0.71 | 0.59 | 2.94 | 299.63 | 0.84 | 0.53 |
| LlamaGen-XXL (1.4B) | -- | -- | 14.36 | 86.55 | 0.63 | 0.69 | 2.37 | 252.16 | 0.81 | 0.59 |
| + Post-train Tokenizer | × | 18 | 14.26 | 86.70 | 0.63 | 0.68 | 2.72 | 246.97 | 0.80 | 0.59 |
| + Post-train Tokenizer (longer) | × | 207 | 22.99 | 72.49 | 0.56 | 0.68 | 4.31 | 221.57 | 0.75 | 0.58 |
| + STE based Post-train AR | × | 381 | 11.46 | 102.21 | 0.68 | 0.61 | 4.17 | 267.34 | 0.83 | 0.51 |
| + VA-π (Ours) | × | 25 | 7.65 | 116.70 | 0.71 | 0.64 | 2.28 | 273.53 | 0.83 | 0.56 |
| Model | Ext. Rwd | Position ↑ | Color ↑ | Attr. Bind. ↑ | Counting ↑ | Single Obj. ↑ | Two Obj. ↑ | Overall ↑ |
|---|---|---|---|---|---|---|---|---|
| LlamaGen-XL | -- | 0.042 | 0.550 | 0.032 | 0.197 | 0.750 | 0.263 | 0.306 |
| + AR-GRPO | ✓ | 0.040 | 0.593 | 0.030 | 0.228 | 0.791 | 0.263 | 0.324 |
| + VA-π (Ours) | × | 0.050 | 0.606 | 0.040 | 0.238 | 0.769 | 0.328 | 0.339 |
| Janus-Pro 1B | - | 0.605 | 0.902 | 0.540 | 0.531 | 0.972 | 0.801 | 0.725 |
| + VA-π (Ours) | × | 0.600 | 0.912 | 0.585 | 0.540 | 0.988 | 0.835 | 0.744 |
In addition to quantitative improvements, we provide a qualitative comparison to highlight the visual differences among naive post-training baselines and VA-π. Under identical decoding settings (ImageNet-1K, CFG = 1.0), VA-π produces samples with more coherent structure and fewer token-induced artifacts.
For additional qualitative comparisons (C2I and T2I), see More Qualitative Results.
Ablation Study
We ablate key components in VA-π, including reward composition, prior regularization, and contextual noise.
Reward and Loss Composition
Reconstruction-only rewards (LMSE/Lp) are unstable because the policy drifts from the pre-trained token distribution. Adding token-level prior regularization (Lprior, cross-entropy) stabilizes training, and the full objective achieves the best overall trade-off.
| LMSE | Lp | Lprior | FID ↓ | IS ↑ | Pre. ↑ | Rec. ↑ |
|---|---|---|---|---|---|---|
| 14.36 | 86.55 | 0.63 | 0.69 | |||
| ✓ | 38.76 | 49.78 | 0.48 | 0.46 | ||
| ✓ | ✓ | 38.63 | 48.14 | 0.49 | 0.46 | |
| ✓ | 14.17 | 88.78 | 0.63 | 0.69 | ||
| ✓ | ✓ | ✓ | 7.65 | 116.70 | 0.68 | 0.64 |
Prior Regularization Term
We vary the regularization weight (β) for KL vs CE variants. Without regularization, optimization diverges (FID 38.63); moderate regularization (β = 0.1) improves both FID and IS, while too-strong regularization (β = 1.0) hurts diversity. CE consistently outperforms KL, with the best results at β = 0.1.
Contextual Noise
We ablate the corruption probability (ξ) for contextual noise in the LlamaGen T2I post-train setting. Moderate noise (ξ = 0.5) performs best on GenEval (Overall 0.339), while ξ = 0 or overly strong noise (ξ > 0.75) degrades performance.
| ξ | PT | CL | AB | CT | SO | TO | Overall |
|---|---|---|---|---|---|---|---|
| 0 | 0.048 | 0.566 | 0.023 | 0.159 | 0.688 | 0.326 | 0.302 |
| 0.25 | 0.043 | 0.598 | 0.025 | 0.215 | 0.700 | 0.306 | 0.315 |
| 0.5 | 0.050 | 0.606 | 0.040 | 0.238 | 0.769 | 0.328 | 0.339 |
| 0.75 | 0.075 | 0.641 | 0.028 | 0.163 | 0.750 | 0.333 | 0.332 |
| 0.95 | 0.043 | 0.652 | 0.040 | 0.181 | 0.728 | 0.328 | 0.329 |
More Qualitative Results
Class-to-image generation (ImageNet-1K)
We provide additional qualitative comparisons on C2I generation across ImageNet-1K classes. All samples use identical decoding settings (CFG = 1.0, temperature = 1.0, top-k = 0, top-p = 1.0).
Text-to-image generation (GenEval)
We present additional T2I qualitative comparisons on GenEval prompts, focusing on harder compositional tasks (attribute binding, counting, position, two-object combination). All samples use identical decoding settings (CFG = 5.0, temperature = 1.0, top-k = 0, top-p = 1.0).
Citation
@misc{vapi2025,
title={VA-$\pi$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation},
author={Xinyao Liao and Qiyuan He and Kai Xu and Xiaoye Qu and Yicong Li and Wei Wei and Angela Yao},
year={2025},
eprint={2512.19680},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.19680},
}