Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality.
To address this limitation, we propose IntroSVG, an introspective SVG generation framework that uses a unified VLM as both generator and critic. Through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and provide feedback on rendered outputs. We systematically convert early-stage failures into error-correction training data to enhance model robustness. Then, we leverage a teacher VLM to construct a preference dataset and apply Direct Preference Optimization (DPO) to align the generator's policy. During inference, the model iteratively generates, renders, critiques, and refines SVG code until quality threshold is met. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple key metrics.
Overview: data construction → Stage 1 (SFT) → Stage 2 (DPO) → introspective inference loop.
Comparison with state-of-the-art methods. Best in red, 2nd in blue, 3rd in bold.
| Method | Avg. Token ↓ | RSR% ↑ | FID ↓ | CLIP-T2I ↑ | Aesthetic ↑ | HPS ↑ |
|---|---|---|---|---|---|---|
| GPT-4o | 273.73 | 100 | 37.00 | 0.2748 | 4.4103 | 0.1941 |
| Grok-4 | 360.68 | 100 | 33.07 | 0.2717 | 4.4546 | 0.1944 |
| Claude 4.5 Sonnet | 439.42 | 100 | 39.67 | 0.2853 | 4.5724 | 0.1998 |
| Gemini 2.5 Pro | 356.00 | 100 | 30.52 | 0.2754 | 4.5854 | 0.1981 |
| GPT-5 | 452.34 | 100 | 34.07 | 0.2779 | 4.5232 | 0.1962 |
| Qwen2.5-VL-72B | 300.69 | 94.86 | 42.68 | 0.2533 | 4.4168 | 0.1906 |
| DeepSeek-R1 | 314.45 | 99.92 | 33.98 | 0.2734 | 4.5232 | 0.1962 |
| DeepSeek-V3.1 | 367.58 | 100 | 36.18 | 0.2736 | 4.5539 | 0.1965 |
| OmniSVG (Qwen2.5-3B) | 2260.54 | 75.36 | 142.38 | 0.2297 | 4.7232 | 0.1877 |
| SVGen (Qwen2.5-Coder-7B) | 1531.42 | 84.64 | 26.27 | 0.2339 | 4.5858 | 0.1916 |
| IntroSVG (Ours) | 1707.77 | 99.26 | 26.18 | 0.2529 | 4.8894 | 0.1969 |
↓ lower is better | ↑ higher is better
Effect of each component: SFT data composition, DPO, and iterative loop.
| Model | Training Data | Iteration | FID ↓ | CLIP-T2I ↑ | Aesthetic ↑ | HPS ↑ |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Base) | N/A (Zero-shot) | × | 71.10 | 0.2365 | 4.3240 | 0.1820 |
| MSFT | DSFT | × | 30.15 | 0.2472 | 4.8069 | 0.1910 |
| MFinal | DSFT ∪ Dpref-G | × | 29.76 | 0.2480 | 4.8372 | 0.1919 |
| MFinal (Iterative) | DSFT ∪ Dpref-G | ✓ | 26.18 | 0.2529 | 4.8894 | 0.1969 |
DSFT: Direct Supervised Fine-Tuning | DPO (Direct Preference Optimization) improves first-shot quality | Iterative loop provides final boost
@inproceedings{wang2026introsvg,
title = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator--Critic Framework},
author = {Wang, Feiyu and Yang, Jiayuan and Zhao, Zhiyuan and Zhang, Da and Li, Bingyu and Liu, Peng and Gao, Junyu},
booktitle = {CVPR},
year = {2026},
}