Yan Zheng
Department of Computer Science
The University of Texas at Austin
Dissertation Defense · April 16, 2026
Committee: Zhangyang Wang (advisor), Qiang Liu, Georgios Pavlakos, Amy Zhang, Mingyuan Zhou
How should AI understand and interact with the 3D physical world?
Industrial graphics engines provide causal grounding and symbolic interfaces that complement neural generative models — enabling both verifiable spatial intelligence evaluation and geometry-grounded visual synthesis.
Three investigations:
| Part I: Flow Generation |
OscillationInversion AAAI 2026 Oral Yan Zheng, et al., Zhangyang Wang FlowMorph WACV 2026 Yan Zheng, et al., Zhangyang Wang Flow-Optimizer / Straight-SDS CVPR'25 WS Yan Zheng, et al., Zhangyang Wang |
| Part II: Neural 3D Geometry |
Neural Volumetric Mesh Generator NeurIPS 2022 Workshop Yan Zheng, Lemeng Wu, Xingchao Liu, Zhen Chen, Qiang Liu, Qixing Huang |
| Part III: Verifiable SSI |
VoxelCodeBench ICML 2026 (review) Yan Zheng, Florian Bordes VeriWorld Bench (in progress) Yan Zheng, Zhangyang Wang |
Oscillation Inversion (AAAI 2026) · FlowMorph (WACV 2026) · Straight-SDS
Fixed-point iteration for flow inversion: \(z^{(k+1)} = y - (\sigma_0 - \sigma_{t_0}) v_\theta(z^{(k)}, \sigma_{t_0})\)
Discovery: In large flow models (FLUX, HunyuanVideo), this does not converge — it oscillates between semantically coherent clusters. Jacobian has singular values > 1 → locally expanding → oscillation guaranteed.
Goal: find intermediate latent \(z_{t_0}\) such that one-step generation recovers image \(y\):
\(z_{t_0} + (\sigma_0 - \sigma_{t_0})\, v_\theta(z_{t_0}, \sigma_{t_0}) = y\)
Fixed-point iteration to solve:
\(z^{(k+1)} = y - (\sigma_0 - \sigma_{t_0})\, v_\theta(z^{(k)}, \sigma_{t_0})\)
(a) Toy Gaussian mixture setting. (b–d) Averaging odd/even clusters recovers the true fixed point — validated by Theorem 1.
Trained flow matching on toy distribution. Columns: 1, 2, 4 input images. Row (a): inverted latents. Row (b): one-step predictions. Row (c): trajectory distances — more inputs → more regular oscillation.
Instead of inverting one image, cycle through a group \(\{y_1, \dots, y_m\}\):
\(z^{(k+1)} = y_{(k \bmod m)} - (\sigma_0 - \sigma_{t_0})\, v_\theta(z^{(k)}, \sigma_{t_0})\)
Image Enhancement: Input A (low quality) twice + B → 3 clusters. Clusters 1,2 expelled (low quality). Cluster 3 pushed onto high-quality manifold.
Group inversion fuses low-quality inputs → high-quality output.
| Method | Denoise PSNR↑ | Deblur LPIPS↓ | 4× SR LPIPS↓ | Time |
|---|---|---|---|---|
| BlindDPS | — | 0.257 | 0.345 | 270s |
| GDP | — | 0.304 | 0.357 | 118s |
| BIRD | — | 0.225 | 0.306 | 234s |
| Piscart | 28.21 | 0.15 | 0.12 | 7.8s |
| Ours | 25.50 | 0.12 | 0.17 | +9.5s |
Best LPIPS on denoise/deblur. Training-free, 8.74s/image on A6000.
Per-frame Topaz → inconsistent. Group inversion (A+B) → consistent.
| Method | flow_L1↓ | flicker↓ | T-LPIPS↓ | CLIP_TSC↑ |
|---|---|---|---|---|
| Topaz baseline | 5.090 | 0.132 | 0.0215 | 0.9910 |
| Ours | 5.150 | 0.138 | 0.0179 | 0.9922 |
Better T-LPIPS and CLIP consistency. Any per-frame editor → video editor, training-free.
In rectified flow, geometry and semantics live in separable variables at a single noise level.
Smooth, identity-preserving transitions across poses and expressions.
\(\mathbf{s}(\boldsymbol{\Delta}, \mathbf{u}) = (z_{t_i}^{(y)} + \boldsymbol{\Delta}) - \delta\sigma \cdot \mathbf{u}\)
\(\boldsymbol{\Delta}\) = geometry | \(\mathbf{u}\) = semantics | \(\delta\sigma\) = step length
Flow-Optimizer: optimize \((\Delta, u)\) → match target
Flow-Interpolation: linear \(\Delta\) + SLERP \(\mathbf{u}\) → smooth morph
Both training-free on any frozen flow model.
vs RF-Inversion, DiffMorpher, SDEditInterp, FreeMorph — ours preserves geometry with smoother transitions
Smooth morphing across identities, expressions, and styles
Composite loss: blend identity + expression + age + style simultaneously by combining multiple target losses. Each target contributes a gradient toward a different attribute — the optimization finds a balanced point.
Combines Oscillation Inversion (AAAI) + FlowMorph (WACV) into a 3D pipeline.
Single reference image → 4K UV texture on MetaHuman mesh. ~5 min on A6000.
Reference
Multi-view renders (Peking Opera makeup)
Harley Quinn style transfer
Optimized 4K diffuse (kd) texture map — directly usable in UE5
Geisha style: reference → multi-view 3D
Works only because UE5 MetaHuman provides the geometric scaffold — the flow model handles appearance, the engine handles structure.
Flow models produce stunning visual content — but cannot generate or maintain 3D geometry on their own. Straight-SDS works only because MetaHuman provides the geometric scaffold.
This raises the question: can neural models generate 3D geometry end-to-end? → Part II
Can diffusion models generate production-quality 3D meshes end-to-end?
Voxel DDPM → volumetric division → neural surface deformation
Ablation: red = flipped faces. Even full model has artifacts.
Lesson: Neural mesh generation remains fragile — production-quality 3D geometry is better provided by engines than generated by networks. This motivates the engine-based approach.
Parts I & II show:
Conclusion: Graphics engines should provide the geometry. But can AI agents use them? Can they reason spatially through code? → Part III
VoxelCodeBench · VeriWorld Bench
A growing body of work uses executable code as the representation for 3D content, replacing raw mesh/voxel outputs with programs that generate geometry.
| Work | Input | Output | Engine | Key Idea |
|---|---|---|---|---|
| MeshCoder NeurIPS'25 |
Point cloud | Blender Python scripts | Blender | Part-decomposed, quad-dominant mesh via code. 41 categories, 86.75% IoU. |
| Code2Worlds ICML'26 |
Text | Simulation code (4D) | Blender | Text → physics-aware 4D scenes. Dual-stream generation + VLM critic for dynamic fidelity. |
| VoxelCodeBench (Ours) ICML'26 (review) |
Text | Python scripts | Unreal Engine 5 | Benchmark: evaluate code generation for 3D. 220 tasks, 8 models, automated visual reward. |
Our position: MeshCoder and Code2Worlds generate code for 3D content in Blender. VoxelCodeBench evaluates code generation in UE5 with deterministic metrics — complementary to generation-focused work.
Can LLMs build 3D worlds through code in Unreal Engine?
220 tasks across 3 complexity axes:
Open-source platform: VoxelCode renders LLM-generated Python in UE5 with Voxel Plugin 2.0
Representative outputs: characters, shapes, animals, vehicles, architecture
| Model | Shape % | Quality /10 |
|---|---|---|
| GPT-5 | 87.9 | 5.71 |
| GPT-5 Mini | 80.4 | 4.86 |
| Claude Sonnet 4.5 | 80.4 | 5.01 |
| GPT-5 Chat | 69.7 | 3.66 |
| Claude Opus 4 | 69.4 | 4.13 |
| Claude 3.5 Sonnet | 66.9 | 3.30 |
| Claude 3 Opus | 45.2 | 3.40 |
| Gemini Pro | 19.5 | 1.36 |
| Symbolic | Geometric | Artistic | |
|---|---|---|---|
| GPT-5 | 87.5 | 66.7 | 97.5 |
| Claude S. 4.5 | 90.3 | 52.8 | 89.5 |
Geometric construction is the bottleneck: 21pp drop from symbolic → geometric
Code-based generation produces objects with coherent internal geometry (ladders, cabin interiors, floor layouts) — impossible with surface-only neural 3D methods
Open-sourced: github.com/facebookresearch/voxelcodebench
Platform + benchmark + evaluation tools
Work done at Meta (FAIR)
11 custom UE5 plugins, 75K+ lines of C++, 45K+ lines of Python runtime, built over 2 years.
| Plugin | Lines | What it does |
|---|---|---|
| UELivePy | 45,677 | Embeds CPython 3.11 inside the game runtime. WebSocket hot-injection, per-frame Tick callbacks, dynamic reflection of all BlueprintCallable functions. |
| SlangCudaPlugin | 30,472 | Integrates Slang shader compiler + CUDA compute into UE5. Agents write GPU shaders at runtime — compiled, executed, and hot-reloaded without restart. |
| MotionHelper | 18,061 | Exposes animation, IK solving, and motion matching to Python. Enables AI-driven character behavior. |
| MovieHelper | 15,480 | Runtime MovieRenderQueue + LevelSequence control. Agents can record videos, set up cinematic cameras programmatically. |
| RuntimeCore | 8,997 | Low-level C++ runtime bridge: tick scheduling, memory management, inter-plugin communication. |
| NiagaraHelper | 5,975 | Particle system control — spawn, configure, animate Niagara effects from Python. |
| VoxelHelper | 5,963 | Terrain manipulation: heightmaps, material weights, stamps — integrates VoxelPlugin 2.0. |
| ChaosHelper | 2,990 | Physics destruction: fracture meshes, apply forces, trigger Chaos physics events. |
| ClothHelper | 2,420 | Cloth simulation control: wind, constraints, material properties at runtime. |
Open-sourced for the research community. The infrastructure enables reproducible, large-scale spatial reasoning evaluation at low cost.
Existing engine Python scripting runs only in the editor. We embed a full runtime inside the game process.
This is how LLM agents control the engine: write code → inject via WebSocket → execute inside running game → observe result → iterate.
ws.send(json.dumps({
"jsonrpc": "2.0",
"method": "python_exec",
"params": {"code": """
import unreal_runtime as ur
actor = ur.Engine.GameplayStatics
.GetPlayerCharacter(None, 0)
"""}
}))
def spotlight_follow(dt, elapsed, actors, p):
char, light = actors
pos = char.GetActorLocation()
pos.Z += 500
light.SetActorLocation(pos)
return elapsed < p["duration"]
AI character behavior · World generation from prompt
"Build a dark misty forest at dusk" → agent composes 8+ skill folders
30+ API calls composed. No predefined tool set could anticipate this combination.
Agent–Engine bridge: LLM → WebSocket → Python + CUDA/Slang → UE5
import unreal_runtime as ur
import inspect
# Discover all engine modules
dir(ur.Engine)
# → ['Actor', 'GameplayStatics',
# 'KismetMathLibrary', ...]
# Discover methods on a class
dir(ur.Engine.GameplayStatics)
# → ['SpawnActor', 'GetPlayerController', ...]
# Read function signature
inspect.signature(
ur.Engine.GameplayStatics.SpawnActor
)
# → (ActorClass, SpawnTransform, ...)
MCP equivalent: manually write JSON schema for 10,000+ engine functions. Every update requires maintenance. Doesn't scale.
An open platform where anyone can:
No UE expertise needed. Write a task spec → get a benchmark instance.
VeriWorld uses this platform to systematically evaluate VLM spatial reasoning:
Platform → Benchmark → Diagnosis.
The infrastructure enables the science.
Same maze task under controlled input conditions. Structured (raycast) passes; visual-only fails. This controlled comparison isolates perception as the bottleneck.
Lean 4 spec (proves solvability) → parametric instance generation → interactive UE5 environment → agent closed loop → deterministic pass/fail verifier.
Interactive 3D tasks with deterministic verification. Agent observes, acts, receives feedback in a closed loop.
BoxFold — fold cube net into closed cube
BallSlide — deform surface so ball reaches target
Same task, same verifier, same environment — vary only the information exposed:
| Condition | Agent Receives | What It Tests |
|---|---|---|
| V (Visual) | Screenshots / video only | Can the model extract spatial structure from pixels? |
| S (Structured) | Coordinates, geometry, physics params | Can the model reason given ground truth? |
| C (Combined) | V + S (all information) | Does more information help? |
| Csel (Selective) | V + selected structured info | Which specific information bridges the gap? |
The S−V gap measures perception difficulty. C vs Csel reveals that selective exposure matters more than amount.
Across task families (BoxFold, MazeNavFPS, DropToTarget), models achieve 0.75–0.91 pass rate under S but only 0.05–0.15 under V.
Takeaway: The bottleneck is not reasoning — models can solve tasks when given structure. The bottleneck is extracting that structure from visual input.
What to expose matters more than how much to expose.
Overhead camera — cannot determine fold direction
Side camera — oscillates between +90° and -90°
Given position data → computes fold signs algebraically → completes cube
Same model, same task. S pass + V fail → perception is the bottleneck.
Horizontal: visual extraction difficulty. Vertical: symbolic reasoning difficulty. No single modality dominates — the selection of what to expose matters more than the amount.
| VP Batch | VP Single | VRP Batch | VRP Single | N | |
|---|---|---|---|---|---|
| Gemini | 92%/9.3 | 10%/24.0 | 100%/15.9 | 90%/16.2 | 42 |
| Opus | 77%/17.6 | 0%/34.0 | 80%/17.9 | 20%/24.8 | 43 |
| GPT-4.1 | 31%/27.4 | 0%/34.0 | 30%/25.2 | 10%/32.5 | 43 |
| VP Active | VP Batch | VP Single | VRP Active | VRP Batch | VRP Single | N | |
|---|---|---|---|---|---|---|---|
| Gemini | 75%/20 | 40%/22 | 0%/30 | 75%/23 | 33%/12 | 0%/30 | 40 |
| Opus | 12%/14 | 25%/26 | 38%/25 | 12%/15 | 33%/26 | 25%/27 | 43 |
| GPT-4.1 | 25%/29 | 0%/30 | 0%/30 | 62%/23 | 0%/30 | 38%/27 | 42 |
R0: MISS (wrong timing)
R3: closer (adjusted pitch)
R6: HIT ✓
R1: FAIL — slope too steep, overshot
R2: PASS ✓ — reduced angle, lands in circle
Observe: bumpy Gaussian terrain
R1: wrong angle
R3: triangulated target, terrain deflects
Spatial reasoning in VLMs is not a single capability — it is a combination of perception, structure extraction, reasoning, and harness-dependent execution. VeriWorld makes each component independently testable.
| Flow Generation | Oscillation dynamics, geometry-semantics decomposition, MetaHuman texture synthesis | AAAI WACV |
| Neural 3D Geometry | Volumetric mesh generation — reveals limits of neural 3D | Meta |
| Verifiable SSI | VoxelCodeBench + VeriWorld: diagnostic benchmark for spatial intelligence | ICML |
Graphics engines provide the causal scaffolding that neural generation needs, and the verifiable environment that spatial intelligence evaluation requires.
Questions?
Yan Zheng · yan.zheng@utexas.edu