IntroSVG

IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework

CVPR 2026 Text-to-SVG Introspective Loop
Feiyu Wang1,2*, Jiayuan Yang3, Zhiyuan Zhao2†, Da Zhang2,3, Bingyu Li2,4, Peng Liu1, Junyu Gao2,3
1Fudan University    2TeleAI    3Northwestern Polytechnical University    4University of Science and Technology of China
*Work done during an internship at TeleAI.    Corresponding author.
A unified VLM plays both Generator and Critic in a closed loop, using rendered PNG feedback to iteratively refine SVG code.
IntroSVG Teaser (Figure 1)
Teaser. IntroSVG uses a unified VLM that drafts SVG, critiques its rendered PNG, and refines the code iteratively.

Abstract

Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality.

To address this limitation, we propose IntroSVG, an introspective SVG generation framework that uses a unified VLM as both generator and critic. Through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and provide feedback on rendered outputs. We systematically convert early-stage failures into error-correction training data to enhance model robustness. Then, we leverage a teacher VLM to construct a preference dataset and apply Direct Preference Optimization (DPO) to align the generator's policy. During inference, the model iteratively generates, renders, critiques, and refines SVG code until quality threshold is met. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple key metrics.

How IntroSVG Works

Overview: data construction → Stage 1 (SFT) → Stage 2 (DPO) → introspective inference loop.

Framework overview (Figure 2)
Framework overview. Mixed datasets for direct generation, correction, and critique; SFT trains a unified model; DPO optimizes generation; inference performs generate–introspect–refine loop.
Three-stage pipeline (high level)
  • Data construction: synthesize direct generation, correction, and critique data using an early checkpoint and a teacher VLM.
  • Stage 1 (SFT): train a unified VLM to learn both generation and critique behaviors.
  • Stage 2 (DPO): use teacher-evaluated preference pairs to improve first-shot generation quality.
  • Inference loop: same model alternates roles—generate SVG → render PNG → critique + score → refine prompt → repeat.
Introspective refinement loop
  • Draft: the model produces an initial SVG program from the text prompt.
  • Render: the SVG is rasterized into a PNG, providing explicit visual feedback.
  • Critique: the same model evaluates the rendered image against the prompt and emits a structured critique + score.
  • Rewrite: critique is converted into a correction prompt; the model revises the SVG accordingly.
  • Repeat: iterate until the score passes a threshold or a maximum number of rounds is reached.
Inference uses the same VLM for both generation and self-critique, forming a closed "render-and-repair" loop.

Results

Main Results (Table 2)

Comparison with state-of-the-art methods. Best in red, 2nd in blue, 3rd in bold.

Method Avg. Token ↓ RSR% ↑ FID ↓ CLIP-T2I ↑ Aesthetic ↑ HPS ↑
GPT-4o 273.73 100 37.00 0.2748 4.4103 0.1941
Grok-4 360.68 100 33.07 0.2717 4.4546 0.1944
Claude 4.5 Sonnet 439.42 100 39.67 0.2853 4.5724 0.1998
Gemini 2.5 Pro 356.00 100 30.52 0.2754 4.5854 0.1981
GPT-5 452.34 100 34.07 0.2779 4.5232 0.1962
Qwen2.5-VL-72B 300.69 94.86 42.68 0.2533 4.4168 0.1906
DeepSeek-R1 314.45 99.92 33.98 0.2734 4.5232 0.1962
DeepSeek-V3.1 367.58 100 36.18 0.2736 4.5539 0.1965
OmniSVG (Qwen2.5-3B) 2260.54 75.36 142.38 0.2297 4.7232 0.1877
SVGen (Qwen2.5-Coder-7B) 1531.42 84.64 26.27 0.2339 4.5858 0.1916
IntroSVG (Ours) 1707.77 99.26 26.18 0.2529 4.8894 0.1969

↓ lower is better  |  ↑ higher is better

Ablation Study (Table 3)

Effect of each component: SFT data composition, DPO, and iterative loop.

Model Training Data Iteration FID ↓ CLIP-T2I ↑ Aesthetic ↑ HPS ↑
Qwen2.5-VL-7B (Base) N/A (Zero-shot) × 71.10 0.2365 4.3240 0.1820
MSFT DSFT × 30.15 0.2472 4.8069 0.1910
MFinal DSFT ∪ Dpref-G × 29.76 0.2480 4.8372 0.1919
MFinal (Iterative) DSFT ∪ Dpref-G 26.18 0.2529 4.8894 0.1969

DSFT: Direct Supervised Fine-Tuning  |  DPO (Direct Preference Optimization) improves first-shot quality  |  Iterative loop provides final boost

Qualitative results (Figure 3)
Qualitative results. The iterative refinement process progressively improves SVG quality.

Citation

@inproceedings{wang2026introsvg,
  title     = {IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator--Critic Framework},
  author    = {Wang, Feiyu and Yang, Jiayuan and Zhao, Zhiyuan and Zhang, Da and Li, Bingyu and Liu, Peng and Gao, Junyu},
  booktitle = {CVPR},
  year      = {2026},
}
© IntroSVG Team · Built as a static site (HTML/CSS/JS)