This is a research preprint. For an accessible overview, see Energy, Structure, and Shaped Generation. The full technical document is presented below unabridged.

ASTRAL: Constraint-Shaped Code Synthesis via AST-Anchored Discrete Flow Matching with Energy-Based Guidance

Rand Arete | Ananke | Denver, CO

February 2026

Abstract. We present ASTRAL (AST-aware Refinement And Learning), an architecture for code generation that frames program synthesis as energy minimization over a constraint-shaped landscape, rather than probabilistic text completion. ASTRAL integrates four complementary mechanisms: (1) AST-anchored discrete flow matching that generates code through iterative refinement respecting syntactic hierarchy, (2) state space model conditioning for $O(n)$ repository-scale context encoding, (3) a three-tier constraint system separating hard admissibility pruning, learned energy guidance, and exact verification, and (4) adaptive verification with localized repair. We introduce the principle of shaped generation: constraints should shape what is generated, not filter what was generated, and formalize it through a correspondence between partially-denoised programs in discrete flow matching and typed holes in programming language theory. Each masked position carries type, scope, and constraint information that propagates bidirectionally during denoising, mirroring the progressive refinement of programs with holes. We establish this correspondence formally and show that denoising trajectories under energy-guided flow matching correspond to descent on a composite constraint energy landscape whose terms decompose over independent constraint domains. ASTRAL draws on recent advances in discrete flow matching theory [4, 31, 33], AST-anchored denoising [18], CFG-constrained decoding for diffusion LLMs [16, 17], variable-length infilling [19, 20], provable self-correction via remasking [29], and controlled evidence that diffusion training can surpass autoregressive models on code tasks [21]. We situate ASTRAL within a broader convergence across four independent research threads (constrained AR decoding, discrete diffusion for code, energy-based reasoning in continuous space, and formal verification integration), all arriving at the same structural insight: code generation is constrained search, and the generation paradigm is the search strategy. We provide detailed architectural specifications, formal analysis of constraint propagation, complexity bounds, and an evaluation plan targeting repository-scale editing with verifiable constraints.

1 Introduction

1.1 Shaped Generation

The dominant paradigm for neural code generation, autoregressive (AR) prediction of the next token given all preceding tokens, treats programs as strings. This is a representational mismatch: programs are not strings. They are structured objects inhabiting a space defined by syntactic rules, type systems, semantic constraints, and cross-module dependencies. The string representation is a serialization of this richer structure.

This mismatch manifests as a fundamental commitment problem. AR models must commit to function signatures before knowing what the body requires, choose variable names before seeing their usage patterns, and declare types before understanding downstream constraints. Code has bidirectional dependencies. The implementation informs the signature as much as the signature constrains the implementation. But AR generation forces a unidirectional flow.

The alternative: shaped generation. Rather than generating unconstrained token sequences and filtering for validity post-hoc, constraints should shape the generation process itself. Valid programs are energy minima in a constraint-shaped landscape. Generation descends this landscape: each refinement step moves toward lower energy, where energy means "satisfies more constraints more completely."

This principle is paradigm-independent. It applies to AR models through token masking (as in constrained decoding [10, 16]), to diffusion models through energy-guided denoising trajectories, and to continuous latent-space models through gradient-based optimization. The generation paradigm determines the search strategy; the constraint architecture determines what "valid" means.

In this work, we instantiate shaped generation through discrete flow matching, a non-autoregressive framework where generation proceeds by iterative refinement of all positions simultaneously, naturally supporting the bidirectional information flow that code requires.

1.2 The Typed Hole Correspondence

The connection between shaped generation and flow matching is not metaphorical. It is structural.

In programming language theory, a typed hole is a placeholder in a program that carries a type, a scope environment, and constraints. A program with holes is still well-defined: it can be typechecked, reasoned about, and in systems like Hazel [30], even executed. Filling a hole is a local operation with global consequences: the fill is validated against the hole's captured environment, and information propagates to constrain neighboring holes.

A partially-denoised program in discrete flow matching is a program with typed holes. Every masked position is a hole. Every revealed position is a fill. The constraints that typed holes carry (expected type, scope environment, cross-domain requirements) are exactly the constraints that should guide which tokens are unmasked next and what values they take.

The distinction from AR: in an autoregressive model, holes can only be filled left-to-right. You cannot skip ahead to fill a return type that would constrain the function body, then come back. The fill order is locked to the token order, regardless of the constraint structure. In flow matching, the fill order follows the constraint structure. Hierarchical corruption ensures that high-level holes (signatures, type annotations) fill first, and their constraints propagate downward to implementation-level holes. This is exactly the progressive refinement that typed hole systems are designed for.

This maps directly to bidirectional typing. When a type annotation position is unmasked (synthesis $\uparrow$ ), that type information flows down to constrain the function body positions that are still holes (analysis $\downarrow$ ). When an implementation detail is filled, its synthesized type flows up to validate against the already-filled signature. The bidirectional information flow that typed holes enable theoretically is what diffusion models do natively.

1.3 Contributions

Shaped generation as a design principle: We formalize the principle that constraints should shape generation trajectories, not filter outputs, and show it unifies AR constrained decoding, energy-guided diffusion, and continuous-space optimization under a common framework (§2).
The typed hole correspondence: We establish a formal correspondence between partially-denoised programs in discrete flow matching and programs with typed holes in PL theory, providing a principled basis for constraint propagation during generation (§3).
Three-tier constraint compilation: We introduce a constraint compiler that separates hard admissibility (CFG intersection, type masks), soft guidance (learned energy functions), and exact verification (compilers, tests), each serving a categorically different role in shaping generation (§4).
Learned guidance with uncertainty gating: We replace the computationally prohibitive per-token candidate evaluation of prior energy-guided approaches with a learned guidance head that predicts logit adjustments directly, gated by calibrated uncertainty estimates (§5).
Integration of recent work: We incorporate AST-anchored denoising [18], CFG-constrained decoding for diffusion LLMs [16, 17], variable-length infilling [19, 20], provable self-correction via remasking [29], and evidence that diffusion surpasses AR under controlled conditions [21] (§6).

1.4 Evidence for Non-Autoregressive Code Generation

Recent results support non-autoregressive architectures as the primary generation engine for code editing and constrained generation:

DiffuCoder [5] demonstrated that a 7B masked diffusion model trained on 130B code tokens achieves competitive performance with AR models, with global planning properties and a diffusion-native RL scheme (coupled-GRPO). Stable-DiffCoder [21] (January 2026) provided the first controlled evidence that under identical architecture and data, block diffusion surpasses its AR counterpart across a broad suite of code benchmarks. AnCoder [18] (February 2026) introduced AST-anchored denoising that prioritizes syntactically salient tokens, improving functional correctness. DICE [27] demonstrated that domain-specific diffusion models (for CUDA kernels) can outperform both AR and generic diffusion models. CFG-constrained decoding [16] achieved near-perfect syntactic correctness for diffusion LLMs with manageable overhead. LAVE [17] (February 2026) improved this to reliable constraints with negligible overhead. DreamOn [19] and CAL [20] solved variable-length infilling for diffusion models. PRISM [29] introduced provable per-token self-correction for masked diffusion via learned quality scores, directly supporting our localized repair strategy.

1.5 The Convergence of Paradigms

Four research threads, developed independently by different teams from different intellectual traditions, are converging on the same structural insight:

Constrained decoding for AR models. Synchromesh [10], llguidance, Outlines, and Ananke enforce constraints through token masks at each generation step. This is shaped generation within the AR paradigm: constraints prune the vocabulary before each prediction. The approach works now and ships value now, but is limited by the left-to-right commitment ordering. Constraints can only look backward.

Discrete diffusion and flow matching for code. DiffuCoder [5], LLaDA [8], Mercury, Stable-DiffCoder [21], and ASTRAL operate in discrete token space with parallel refinement and bidirectional conditioning. Constraints shape the denoising trajectory rather than filtering per-token predictions. The generation order follows the constraint structure, not the serialization order.

Energy-based reasoning in continuous space. Kona [35] (Logical Intelligence) reasons in a continuous latent space, using approximate gradients to iteratively edit complete solutions toward constraint satisfaction. The analogy Logical Intelligence draws is seeing a maze from above instead of wandering through it: all constraints are evaluated simultaneously across the entire output. Kona achieves 96.2% solve rate on hard Sudoku versus 2% for frontier LLMs. Not because it is "smarter" but because simultaneous constraint satisfaction is architecturally native rather than emergent.

Formal verification integration. Aleph (Logical Intelligence), Lean-based proof search, and AlphaProof provide ground-truth correctness signals. Deterministic verification of probabilistic generation creates the feedback loop that makes constraint satisfaction reliable rather than approximate. Aleph is an LLM-based orchestration agent for formal verification, distinct from Kona's EBM reasoning core; it reached 99.4% (668/672) on PutnamBench as of January 2026.

These are different surfaces of the same thesis: code generation is constrained search; the generation paradigm is the search strategy. The typed hole is the unifying abstraction. In AR + masking, each step fills one hole left-to-right. In diffusion, all positions start as holes and fill in parallel. In continuous EBMs, the entire output is a single high-dimensional hole being iteratively refined. The paradigm determines the fill strategy; the constraint architecture determines what "valid fill" means.

Paradigm	Search Strategy	Constraint Mechanism	Fill Order
AR + masking	Greedy/beam, per-step filtering	Token masks from CFG/types	Left-to-right (fixed)
Diffusion/flow	Parallel iterative refinement	Energy guidance on trajectory	Constraint-driven (learned)
EBM (continuous)	Gradient optimization	Energy function minimization	Simultaneous (global)
Hybrid (ASTRAL)	Structured refinement	Three-tier compilation	AST-hierarchical

Energy Matching [23] provides a theoretical umbrella: flow matching and EBMs are unified through a single time-independent scalar potential whose gradient field governs generative dynamics. ASTRAL instantiates this in the discrete setting for code. Kona instantiates it in continuous space for general constraint satisfaction. The architectural details differ; the principle (valid outputs are energy minima in a constraint-shaped landscape) is invariant across paradigms.

2 Theoretical Foundations

2.1 Discrete Flow Matching over Code

We adopt the discrete flow matching (DFM) framework of Gat et al. [4], extended by the kinetic-optimal perspective of Holderrieth et al. [33] and the theoretical convergence analysis of Su et al. [34].

State space. Let $\mathcal{V}$ be a vocabulary of size $|\mathcal{V}| = V$ augmented with a mask token $\texttt{[M]}$ , so $\bar{\mathcal{V}} = \mathcal{V} \cup \{\texttt{[M]}\}$ . A code sequence of length $L$ lives in $\bar{\mathcal{V}}^L$ .

Probability paths. A probability path $p_t(x)$ for $t \in [0, 1]$ interpolates between the data distribution $p_0 = p_{\text{data}}$ and a fully masked source $p_1 = \delta_{\texttt{[M]}^L}$ . Generation proceeds from $t = 1$ (fully masked) toward $t = 0$ (clean data).

Probability velocity. The CTMC is characterized by a per-position velocity field $u_t^i(y \mid x)$ specifying the rate at which position $i$ transitions from state $x^i$ to state $y$ at time $t$ . Following the single-variable-change restriction [2, 4], only one position changes per infinitesimal step. The velocity generates $p_t$ via the Kolmogorov forward (continuity) equation:

$\frac{d}{dt}p_t(x) = \sum_{i=1}^{L}\sum_{y \neq x^i} \Big[ u_t^i(y \mid x \oplus_i y) \cdot p_t(x \oplus_i y) - u_t^i(x^i \mid x) \cdot p_t(x) \Big]$

where $x \oplus_i y$ denotes $x$ with position $i$ replaced by $y$ . The velocity must satisfy the rate conditions: $u_t^i(y \mid x) \geq 0$ for $y \neq x^i$ and $\sum_{y \neq x^i} u_t^i(y \mid x) \cdot h \leq 1$ for sufficiently small $h$ .

Conditional flow matching. Direct optimization of the marginal velocity is intractable. Gat et al. [4] introduce conditional velocities $u_t^i(y \mid x; x_0)$ conditioned on target data $x_0$ , and show that the marginal velocity can be recovered:

$u_t^i(y \mid x) = \mathbb{E}_{x_0 \sim p_{0|t}(\cdot | x)}\!\Big[u_t^i(y \mid x; x_0)\Big]$

This yields the conditional DFM training loss:

$\mathcal{L}_{\text{CDFM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}[0,1],\; x_0 \sim p_{\text{data}},\; x_t \sim p_{t|0}(\cdot | x_0)} \Big[-\sum_{i: x_t^i = \texttt{[M]}} \log p_\theta^i(x_0^i \mid x_t, t)\Big]$

where $p_\theta^i$ is the denoiser's predicted distribution at position $i$ . Crucially, this loss has the same gradient as the marginal DFM loss [4, Theorem 15].

Masked construction. For the masked probability path with scheduler $\kappa_t$ , each position $i$ is independently masked:

$p_{t|0}(x_t^i \mid x_0^i) = \begin{cases} 1 - \kappa_t & \text{if } x_t^i = x_0^i \\ \kappa_t & \text{if } x_t^i = \texttt{[M]} \end{cases}$

with $\kappa_0 = 0$ (clean data at $t = 0$ ) and $\kappa_1 = 1$ (fully masked at $t = 1$ ). The standard choice is $\kappa_t = t$ .

Connection to any-order autoregressive modeling. A subtlety identified by Hoogeboom et al. and Zheng et al.: under the masked construction, discrete flow matching is equivalent to learning any-order autoregressive models. The model learns conditional distributions $p(x^i \mid x_{\text{unmasked}})$ for every possible subset of observed positions. This equivalence is not a limitation; it is precisely the property that enables constraint-driven fill ordering. The model can unmask positions in any order, and ASTRAL exploits this by choosing an order that follows the constraint hierarchy rather than the token sequence.

Kinetic-optimal paths. Holderrieth et al. [33] show that the masked construction, while popular, does not fully exploit the design space of discrete probability paths. They derive kinetic-optimal velocities that minimize the total kinetic energy:

$\text{KE}[u] = \int_0^1 \sum_{x}\sum_{i}\sum_{y \neq x^i} \frac{[u_t^i(y \mid x)]^2}{p_t(x)} \, dt$

subject to generating a given probability path. The kinetic-optimal velocity admits a closed form and can work with arbitrary probability paths beyond masking, including paths tailored to AST structure (§3.3). This theoretical result justifies our AST-anchored corruption schedule: we are not restricted to uniform masking.

Sampling. Given a trained velocity model, generation proceeds from $t = 1$ toward $t = 0$ by discretizing the CTMC into $N$ steps with step size $h = 1/N$ :

$x_{t-h}^i = \begin{cases} y & \text{with probability } u_\theta^i(y \mid x_t) \cdot h, \quad y \neq x_t^i \\ x_t^i & \text{with probability } 1 - \sum_{y \neq x_t^i} u_\theta^i(y \mid x_t) \cdot h \end{cases}$

Corrector steps [4, §5] can improve sample quality by adding stochastic perturbation followed by denoising, analogous to Langevin corrections in continuous diffusion.

Convergence guarantees. Su et al. [34] establish the first convergence bounds for DFM: if the velocity estimator achieves $\epsilon$ -error, the generated distribution is $O(\epsilon \sqrt{L})$ away from the data distribution in total variation. This provides theoretical grounding for the quality of DFM-generated code as a function of model quality.

2.2 Energy-Based Constraint Guidance

Energy-based models (EBMs) define distributions through unnormalized energy functions. For code generation, we define a composite constraint energy landscape over a product of constraint domains.

Constraint domains. Let $\mathcal{D} = \{d_1, \ldots, d_K\}$ be the set of constraint domains: syntax ( $d_{\text{syn}}$ ), types ( $d_{\text{type}}$ ), imports ( $d_{\text{imp}}$ ), control flow ( $d_{\text{cf}}$ ), semantics ( $d_{\text{sem}}$ ), and formal specifications ( $d_{\text{spec}}$ ). Each domain $d$ has a violation function $C_d: \bar{\mathcal{V}}^L \to \mathbb{R}_{\geq 0}$ where $C_d(x) = 0$ iff $x$ satisfies all constraints in domain $d$ .

Composite energy. The constraint energy decomposes additively over domains:

$E(x) = \sum_{d \in \mathcal{D}} \lambda_d \, C_d(x)$

with domain-specific weights $\lambda_d > 0$ controlling relative importance. This defines the constrained distribution:

$p_{\text{constrained}}(x) \propto p_{\text{base}}(x) \cdot \exp\!\big(-\beta \, E(x)\big)$

Valid programs occupy the zero-energy basin: $E(x) = 0 \iff C_d(x) = 0 \;\forall\, d \in \mathcal{D}$ .

Constraint algebra. The constraint domains are not independent. They form a partially ordered set with information-flow morphisms between them. Formally, define the constraint category $\mathbf{Con}$ where:

Objects are constraint domains $d \in \mathcal{D}$
Morphisms $\phi_{d \to d'}: d \to d'$ are propagation functions mapping constraint updates in domain $d$ to induced updates in domain $d'$

Key morphisms in the code constraint category:

$\phi_{\text{type} \to \text{imp}}: \text{newly revealed type} \mapsto \text{required import}$ $\phi_{\text{cf} \to \text{type}}: \text{unreachable branch} \mapsto \text{relaxed type constraint}$
$\phi_{\text{sem} \to \text{type}}: \text{test constraint} \mapsto \text{narrowed expected type}$ $\phi_{\text{syn} \to *}: \text{syntactic structure} \mapsto \text{constraints in all domains}$

Connection to flow matching. Energy Matching [23] (NeurIPS 2025) provides the theoretical bridge between flow matching and EBMs. The key insight: both can be unified through a single time-independent scalar potential $\Phi: \mathbb{R}^d \to \mathbb{R}$ whose negative gradient defines the velocity field:

$v(x) = -\nabla_x \Phi(x)$

Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. Near the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium:

$p_{\text{eq}}(x) \propto \exp\!\big(-\beta\,\Phi(x)\big)$

In continuous space, Energy Matching learns $\Phi$ via:

$\mathcal{L}_{\text{EM}}(\Phi) = \mathbb{E}_{t, x_t}\!\Big[\big\|\nabla_x \Phi(x_t) - v_t^*(x_t)\big\|^2\Big]$

where $v_t^*$ is the optimal transport velocity. The potential $\Phi$ simultaneously serves as generator (via gradient flow) and prior (via Boltzmann distribution).

Discrete analog. ASTRAL instantiates this in the discrete setting. We cannot take gradients of discrete sequences, so we modify transition rates rather than following a gradient field. For a surrogate energy $E_\phi$ and time-dependent guidance strength $\alpha_t$ :

$\tilde{u}_t^i(y \mid x) = u_t^i(y \mid x) \cdot \exp\!\Big(-\alpha_t \big[E_\phi(x \oplus_i y) - E_\phi(x)\big]\Big)$

This Boltzmann-weighted rate modification is the discrete analog of Langevin dynamics: transitions that decrease energy are exponentially favored. The modification preserves the rate conditions when $\alpha_t$ is bounded.

Proposition 2 (Rate preservation). If $u_t^i(y \mid x)$ satisfies the rate conditions and $\alpha_t \leq \bar{\alpha}$ for some constant $\bar{\alpha}$ , then $\tilde{u}_t^i(y \mid x)$ satisfies the rate conditions with a modified normalization. Specifically:

$\sum_{y \neq x^i} \tilde{u}_t^i(y \mid x) \cdot h \leq \exp(\alpha_t \cdot \Delta E_{\max}) \sum_{y \neq x^i} u_t^i(y \mid x) \cdot h$

where $\Delta E_{\max} = \max_{i, y, x} |E_\phi(x \oplus_i y) - E_\phi(x)|$ . The modified rates define a valid CTMC for sufficiently small $h$ .

Guidance schedule. Since generation proceeds from $t = 1$ (fully masked) toward $t = 0$ (clean data), guidance strength should increase as $t$ decreases toward zero, applying stronger constraint pressure as the program takes shape. Motivated by the Energy Matching observation that energy terms become dominant near the data manifold:

$\alpha_t = \alpha_{\max} \cdot \sigma(k(t_0 - t))$

where $\sigma$ is the sigmoid function, $t_0$ controls the inflection point, and $k$ controls the sharpness. Early in generation ( $t \gg t_0$ , near the noise end) the sigmoid is near zero, allowing the base model to establish global structure with minimal guidance. Late in generation ( $t \ll t_0$ , near the data end) the sigmoid saturates, applying strong guidance to ensure constraint satisfaction.

Connection to COLD Decoding. COLD Decoding [9] demonstrated energy-based constrained text generation via Langevin dynamics in a continuous relaxation of the token space. ASTRAL's guidance operator can be viewed as a discrete-native alternative that avoids the continuous relaxation, operating directly on token transition rates. The energy functions serve the same role (shaping generation toward constraint satisfaction), but the mechanism respects the discrete structure of code.

Connection to Kona. Logical Intelligence's Kona [35] operates entirely in continuous latent space, using Langevin calibration to refine complete solutions against hard constraints. ASTRAL operates in discrete token space with structured transition rates. Both implement the same principle: reasoning as energy minimization, with constraints defining the landscape. Kona's continuous space enables true gradient descent; ASTRAL's discrete space enables direct token manipulation with exact verification. The approaches are complementary: Kona excels at problems requiring simultaneous global constraint satisfaction (combinatorial puzzles, grid optimization), while ASTRAL targets the structure-rich, tool-verifiable domain of code.

2.3 The Typed Hole Formalism

We formalize the correspondence between denoising states and typed holes.

Definition 1 (Typed Hole). A typed hole $\square_\sigma^\Gamma$ is a placeholder in a program carrying:

A type $\sigma$ constraining what can fill this position
A scope environment $\Gamma$ capturing the variables, types, and imports visible at this position
A set of cross-domain constraints $\{c_1, \ldots, c_k\}$ from the constraint compiler

Definition 2 (Holey Program). A holey program $P[\square_1, \ldots, \square_n]$ is a program with $n$ typed holes. It is well-scoped if each hole's environment $\Gamma_i$ is consistent with the surrounding program structure.

Proposition 1 (Denoising–Hole Correspondence). Let $x_t$ be a partially-denoised sequence at time $t$ in a discrete flow matching process. Define the induced holey program $H(x_t)$ by:

For each position $i$ where $x_t^i \neq \texttt{[M]}$ : the token $x_t^i$ is a fill
For each position $i$ where $x_t^i = \texttt{[M]}$ : position $i$ is a typed hole $\square_{\sigma_i}^{\Gamma_i}$ where $\sigma_i$ and $\Gamma_i$ are inferred from the surrounding revealed tokens and the AST structure

Then the denoising trajectory $x_1 \to x_0$ corresponds to a progressive refinement of $H(x_t)$ : each unmasking step fills a typed hole, and the filled value is constrained by the hole's type and environment.

Constraint propagation. When a position $i$ is unmasked to value $v$ , the constraint environments of neighboring holes update:

$\Gamma_j \leftarrow \Gamma_j \cup \{i \mapsto v\}, \quad \sigma_j \leftarrow \text{narrow}(\sigma_j, i \mapsto v)$

for all holes $j$ in the scope of $i$ . This propagation is exactly what the denoiser's bidirectional attention computes: conditioning on all revealed positions to predict masked ones.

Cross-domain morphisms. The constraint domains form a category where morphisms propagate information:

Types $\to$ Imports: a newly revealed type may require an import
ControlFlow $\to$ Types: unreachable branches relax type constraints
Semantics $\to$ Types: test constraints narrow expected types
Syntax $\to$ all: syntactic structure constrains all other domains

Each unmasking step triggers a cascade of cross-domain updates that narrow what remaining holes can be. This is precisely the energy gradient descent of §2.2.

2.4 State Space Models for Context

For repository-scale context ( $n_{\text{ctx}} \gg n_{\text{out}}$ ), quadratic attention is prohibitive. State space models (SSMs) process context in $O(n)$ time via the recurrence:

$h_t = A_t h_{t-1} + B_t x_t, \qquad y_t = C_t^\top h_t$

Mamba-2 [3] introduces selective parameters $(A_t, B_t, C_t)$ that vary with input, and reveals deep connections to attention through structured state space duality (SSD). Codestral Mamba [22] demonstrates production-quality code generation with 256k+ context using this architecture.

For ASTRAL, the SSM encodes all conditioning information (surrounding code, specifications, type signatures, cross-file dependencies) into a compressed representation. The linear complexity enables processing entire repositories as context, which is essential for understanding import relationships, type hierarchies, and project-wide conventions.

3 Architecture

3.1 Overview

ASTRAL comprises six modules:

Loading diagram...

3.2 SSM Context Encoder

The context encoder processes conditioning information into a compressed representation. We employ a Mamba-2 backbone with AST-aware positional encoding that captures tree depth, sibling position, scope nesting level, and syntactic node type.

class ContextEncoder(nn.Module):
    """Mamba-2 context encoder with AST-aware positions."""
    def __init__(self, d_model=1024, n_layers=24, d_state=128):
        self.embedding = ASTAwareEmbedding(d_model)  # Depth + sibling + scope + node type
        self.mamba_layers = nn.ModuleList([
            Mamba2Block(d_model, d_state, d_conv=4, expand=2)
            for _ in range(n_layers)
        ])
        self.projector = nn.Linear(d_model, d_context)
    
    def forward(self, tokens, ast_positions):
        x = self.embedding(tokens, ast_positions)
        for layer in self.mamba_layers:
            x = layer(x)  # O(n) per layer
        return self.projector(x)

Cross-file symbol resolution uses sparse attention masks constructed from tree-sitter: import edges (imports → definitions), call edges (calls → declarations), type edges (annotations → type definitions).

Context compression for sequences exceeding 32k tokens uses Perceiver-style [26] cross-attention to a fixed set of $K$ latent vectors, reducing denoiser cross-attention cost to $O(n_{\text{out}} \cdot K)$ regardless of context length.

3.3 AST-Anchored Flow Denoiser

The denoiser combines two complementary AST-aware strategies:

Corruption-time anchoring (TreeDiff [14]): Corrupt code respecting hierarchical structure. Entire AST subtrees are masked coherently, with corruption probability weighted by AST depth: high-level constructs (function signatures, class declarations) are corrupted last, recovered first.

Denoising-time anchoring (AnCoder/AnchorTree [18]): Prioritize unmasking syntactically salient tokens. Keywords (if, while, def, return) and identifiers are unmasked before surrounding detail, establishing structural scaffolds:

$\text{priority}(i) = \begin{cases} 1.0 & \text{if } \text{role}(i) = \text{keyword} \\ 0.8 & \text{if } \text{role}(i) = \text{identifier} \\ 0.5 \cdot (1 - \text{depth}(i)/d_{\max}) & \text{otherwise} \end{cases}$

This mirrors how developers write code: architecture before implementation, structure before detail.

Denoiser architecture: A bidirectional transformer with cross-attention to the SSM context:

$\hat{x}_0 = f_\theta(x_t, t, \text{ctx}) = \text{TransformerDec}\big(\text{Embed}(x_t) + \text{TimeEmb}(t) + \text{ASTEmb}(x_t),\; \text{ctx}\big)$

Training objective:

$\mathcal{L} = \underbrace{\mathcal{L}_{\text{DFM}}}_{\text{flow matching}} + \lambda_{\text{ast}} \underbrace{\mathcal{L}_{\text{AST}}}_{\text{parse validity}} + \lambda_{\text{cal}} \underbrace{\mathcal{L}_{\text{cal}}}_{\text{uncertainty calibration}}$

where $\mathcal{L}_{\text{AST}}$ penalizes syntactically invalid intermediates by counting tree-sitter ERROR nodes in the decoded prediction:

$\mathcal{L}_{\text{AST}} = \mathbb{E}_{t, x_t}\!\big[\text{ErrorNodeCount}\!\big(\text{parse}(\text{decode}(\hat{x}_0))\big)\big]$

Variable-length infilling follows DreamOn [19]: the model can emit [EXPAND] to grow and [DELETE] to shrink the output region, requiring no architectural changes. Alternatively, CAL's [20] training-free length estimation from first-step denoising confidence provides a simpler integration path.

3.4 Incremental Parsing Infrastructure

Tree-sitter provides error-tolerant parsing. It always produces a tree, marking unparseable regions with ERROR nodes rather than failing. This gives us:

AST-aware corruption/denoising: Extract structure for anchoring
Parse-repair loops: Localize errors, remask, regenerate
$\mathcal{L}_{\text{AST}}$ computation: Count ERROR nodes for the validity regularizer

Performance requirements (validated by microbenchmarks in Phase 1):

Operation	Target Latency
Parse (10k bytes)	< 0.5 ms
Error localization	< 0.1 ms
Span-to-token mapping	< 0.05 ms
Full decode-parse-localize-remask cycle	< 1 ms

Critical path components are implemented in Zig or Rust for zero FFI overhead with tree-sitter's C core.

4 Three-Tier Constraint Compilation

The constraint compiler is the architectural heart of shaped generation. It translates heterogeneous constraint sources into three coordinated outputs serving categorically different roles:

$\texttt{compile}: \text{Specification} \to (\underbrace{\text{HardDecodeIR}}_{\text{Tier 1: prune}},\; \underbrace{\text{SoftEnergyIR}}_{\text{Tier 2: guide}},\; \underbrace{\text{ExactVerifierIR}}_{\text{Tier 3: verify}})$

4.1 Tier 1: Hard Admissibility (HardDecodeIR)

Hard constraints prune. They define the admissible region by eliminating token choices that provably cannot lead to valid programs.

Grammar constraints. Following Mündler et al. [16], we reduce admissibility checking to deciding emptiness of the intersection between the target CFG and the regular language of valid completions of the current partial output. For a partial sequence $x$ with infill regions $R_1, \ldots, R_k$ :

$\text{admissible}(x, v, i) \iff \mathcal{L}(G) \cap \mathcal{L}(\text{completion}(x \oplus_i v)) \neq \emptyset$

where $G$ is the target grammar and $\text{completion}(x \oplus_i v)$ is the regular language of all strings obtainable by filling remaining holes.

LAVE [17] improves reliability by exploiting diffusion models' parallel predictions for lookahead verification, checking that proposed tokens maintain extendability of the partial output, not merely local validity.

Type admissibility masks. Where type environments are available, restrict variables to type-compatible alternatives [25].

Overhead. CFG-constrained decoding incurs ~125% median overhead per Mündler et al. [16], growing with the number of infill regions. We make this configurable: enabled by default for verification-critical generation; the repair loop suffices for latency-sensitive applications.

4.2 Tier 2: Soft Energy Guidance (SoftEnergyIR)

Soft constraints guide. They rank trajectories within the admissible region without hard pruning.

The energy landscape decomposes over constraint domains with time-dependent scheduling:

$E_{\text{soft}}(x, t) = \sum_{d \in \mathcal{D}} \alpha_d(t) \cdot E_d(x, \text{ctx})$

where the schedule $\alpha_d(t)$ controls when each domain exerts influence. Since generation proceeds from $t = 1$ (noise) to $t = 0$ (data), domains relevant to global structure activate first (when $t$ is large) and fine-grained semantic constraints activate later (when $t$ is small):

Domain	Early ( $t > 0.7$ )	Mid ( $0.3 < t < 0.7$ )	Late ( $t < 0.3$ )
Syntax	Low	Medium	High
Types	Low	High	High
Imports	Off	Medium	High
Control flow	Medium	Medium	Medium
Semantics	Off	Low	High

This schedule reflects the typed hole correspondence: structural constraints matter first, semantic constraints matter as the program takes shape.

The energy terms are computed by the learned guidance head (§5), not by enumerating candidates.

4.3 Tier 3: Exact Verification (ExactVerifierIR)

Exact verifiers are the authority. They provide ground-truth correctness signals.

Each verifier is specified declaratively:

@dataclass
class ExactVerifierSpec:
    domain: str              # 'syntax', 'types', 'tests', 'formal'
    tool: str                # 'tree-sitter', 'mypy', 'pytest', 'z3'
    trigger_policy: str      # 'adaptive', 'final-only', 'every-n'
    required: bool           # Must pass for acceptance
    error_localizer: Callable # Maps errors to token spans

Verifier outputs are structured: pass/fail by domain, localized error spans, error messages, and causal hints (where available).

Acceptance contract: A candidate is accepted only if all required exact verifiers pass. Degradation contract: If required verifiers are unavailable, the system returns explicit degraded status. Never silently claiming verification.

5 Learned Guidance Head with Uncertainty Gating

5.1 Motivation

The original ASTRAL design evaluated surrogate energy functions on top- $k$ candidate tokens per position, an $O(k \cdot |\text{masked}|)$ inner loop that is computationally prohibitive for large vocabularies. We replace this with a learned guidance head that directly predicts logit adjustments from the denoiser's hidden states.

5.2 Architecture

The guidance head shares the denoiser backbone and attaches a lightweight projection:

$g_\phi: (\text{hidden state } h, \text{ context } c) \mapsto (\Delta\text{logits}, \text{uncertainty})$

class LearnedGuidanceHead(nn.Module):
    def __init__(self, d_model, d_context):
        self.proj = nn.Linear(d_model, d_model)
        self.ctx_attn = nn.MultiheadAttention(d_model, n_heads=4)
        self.delta_head = nn.Linear(d_model, vocab_size)  # Logit adjustments
        self.uncert_head = nn.Linear(d_model, 1)           # Uncertainty ∈ [0,1]
    
    def forward(self, hidden, context):
        h = self.proj(hidden)
        h, _ = self.ctx_attn(h, context, context)  # Cross-file awareness
        return self.delta_head(h), torch.sigmoid(self.uncert_head(h))

Context-awareness is essential: type errors often depend on imported definitions. A local-only guidance head would fail to predict errors from cross-file symbols.

5.3 Uncertainty-Gated Application

Guidance is applied only where the head is confident:

$\text{logits}' = \text{logits} + \alpha_t \cdot \mathbb{1}[\hat{u}_i < \tau] \cdot \Delta\text{logits}_i$

where $\hat{u}_i$ is the predicted uncertainty at position $i$ , $\tau$ is the gating threshold, and $\alpha_t = \alpha \cdot (1 - t)$ increases guidance strength as generation converges toward $t = 0$ .

When uncertainty is high, the system falls back to the base flow model plus hard constraints. This prevents over-commitment to unreliable surrogate gradients, a failure mode identified in the original proposal.

5.4 Training

Phase 1: Supervised pre-training. Train against exact verifier labels on a dataset of (partial_code, context, verifier_outcome) triples. Training data includes:

High-temperature AR model outputs (realistic error distribution)
On-policy DFM trajectory samples (generator-specific failure modes)
Error annotation via exact verifiers (mypy, pytest, tree-sitter)

Phase 2: On-policy refinement. Collect trajectories from the model's own inference, run exact verifiers, retrain on distribution-shifted examples. Following DiffuCoder's [5] coupled-GRPO for RL-style improvement.

Calibration. The uncertainty head is trained with expected calibration error (ECE):

$\mathcal{L}_{\text{cal}} = \sum_{b=1}^{B} \frac{|S_b|}{N}\Big|\text{acc}(S_b) - \text{conf}(S_b)\Big|$

where $S_b$ is the set of predictions falling in confidence bin $b$ .

Fidelity gate. Spearman $\rho > 0.7$ between guidance head predictions and exact verifier outcomes, measured on a held-out evaluation set. If this gate is not met, surrogate guidance is disabled. The threshold is empirically motivated, not theoretically justified.

Connection to PRISM [29]. The guidance head's uncertainty estimate is conceptually related to PRISM's per-token quality scores: both learn to identify low-quality tokens for correction. PRISM provides provable guarantees that the quality score converges to the true posterior; our guidance head additionally predicts the direction of correction (logit adjustments), not merely which tokens are wrong.

6 Adaptive Verification and Localized Repair

6.1 Trigger-Driven Exact Verification

Rather than running exact verifiers on a fixed schedule (every $N$ steps), we trigger verification adaptively based on generation state:

Trigger	Signal	Rationale
Structural instability	Parse error count jumps	New errors indicate the trajectory is diverging
Energy divergence	$E_\phi(x_{t-h}) > E_\phi(x_t) + \epsilon$	Moving away from valid programs
Uncertainty plateau	$\bar{u} > \tau$ for $k$ consecutive steps	Guidance head cannot assess quality
Imminent finalization	$t < 0.1$	Last chance to correct before commitment
Cross-region inconsistency	Conflicting types/symbols across edit regions	Multi-region coordination failure

6.2 Localized Repair

When verification fails, we remask only the minimal region containing errors. The principle: localize failure, recover minimally.

Localize: Map verifier errors to token spans
Dilate: Expand mask slightly to include syntactic context
Remask: Set error region to [M]
Re-denoise: Run flow matching on masked region only, conditioned on surrounding fills
Re-verify: Check patched region; repeat up to $k_{\max}$ times

This is more principled than global regeneration and directly implements the typed hole metaphor: a repair creates new holes in a mostly-complete program, to be filled under the constraints of the surrounding structure.

7 Inference Algorithm

The complete inference procedure integrates all components:

Algorithm 1: ASTRAL Shaped Generation

Input: Hole specification $H$ , repository context $R$ , constraint specification $S$
Output: Generated code $x^*$ , verification status

$\text{ctx} \leftarrow \text{SSMEncoder}(R)$ // $O(|R|)$ context encoding

$(F_{\text{hard}}, E_{\text{soft}}, V_{\text{exact}}) \leftarrow \text{ConstraintCompiler}(S)$ // Three-tier compilation

$x_1 \leftarrow \text{InitializeFromHole}(H)$ // All-mask or partial

for $t = 1, t_1, \ldots, t_N = 0$ (decreasing) do

$\text{logits}, h \leftarrow f_\theta(x_t, t, \text{ctx})$ // Denoiser forward pass

$\text{logits} \leftarrow F_{\text{hard}}.\text{filter}(\text{logits}, x_t)$ // Tier 1: Hard pruning

$\Delta, \hat{u} \leftarrow g_\phi(h, \text{ctx})$ // Guidance head

$\text{logits} \leftarrow \text{logits} + \alpha_t \cdot \mathbb{1}[\hat{u} < \tau] \cdot \Delta$ // Tier 2: Gated guidance

$x_{t-h} \leftarrow \text{SampleTransition}(x_t, \text{logits}, t)$ // CTMC step

if $\text{ShouldVerify}(x_t, x_{t-h}, t)$ then // Adaptive trigger

$r \leftarrow V_{\text{exact}}.\text{verify}(\text{decode}(x_{t-h}))$ // Tier 3: Exact check

if $\neg\, r.\text{passed}$ then $x_{t-h} \leftarrow \text{LocalRepair}(x_{t-h}, r, \text{ctx})$

end for

$r_{\text{final}} \leftarrow V_{\text{exact}}.\text{verify}(\text{decode}(x_0))$ // Final verification gate

return $x_0$ , $r_{\text{final}}.\text{status}$

8 Complexity Analysis

Operation	AR Transformer	ASTRAL
Context encoding	$O(n_{\text{ctx}}^2)$	$O(n_{\text{ctx}})$ via SSM
Per denoising step	—	$O(n_{\text{out}}^2)$ self-attention
Cross-attention to context	—	$O(n_{\text{out}} \cdot K)$ via bottleneck
Full generation	$O(n_{\text{out}} \cdot n_{\text{ctx}})$	$O(k \cdot n_{\text{out}}^2 + k \cdot n_{\text{out}} \cdot K)$ for $k$ steps
Editing $m$ tokens	$O(n_{\text{out}} \cdot m)$ minimum	$O(k \cdot n_{\text{out}}^2)$ regardless of $m$
Hard constraints	—	~125% overhead (configurable)
Guidance head	—	Same forward pass as denoiser

The key efficiency wins: (1) linear context encoding enables repository-scale conditioning; (2) $k$ denoising steps (typically 10–50) vs. $n$ sequential AR steps; (3) editing cost is independent of edit location; (4) the guidance head adds negligible cost since it shares the denoiser's computation.

Note on compute parity. Stable-DiffCoder [21] demonstrated that with appropriate training recipes, block diffusion can exceed AR performance under identical compute budgets for code, a stronger result than the original proposal's assumption of an inherent compute penalty.

9 Training Strategy

9.1 Phase A: Base NAR Pretraining

Step A1: SSM context encoder. Pretrain Mamba-2 backbone on code with next-token prediction and AST-aware positional encoding. The SSM must learn to compress repository-scale context into useful representations for cross-attention. Training data: 50B+ tokens of code with parsed AST metadata.

Step A2: Flow denoiser, from AR to diffusion. Following the Stable-DiffCoder [21] recipe, which provided the first controlled evidence that block diffusion surpasses AR on code:

Initialize from a pretrained AR checkpoint (e.g., CodeLlama-7B or equivalent). This provides strong code representations.
Continual pretraining with block diffusion:
- Divide sequences into blocks of size $B$ (typically 64–256 tokens)
- Within each block, apply masked diffusion with the CDFM loss
- Between blocks, maintain AR left-to-right conditioning
- Use a tailored warmup: start with $B = 16$ , gradually increase to $B = 256$
Block-wise clipped noise schedule: Clip the masking ratio to $[0.1, 0.9]$ during early training to prevent degenerate gradient signals from fully-masked or nearly-clean inputs
AST-anchored corruption: Replace uniform random masking with hierarchical masking respecting AST depth (§3.3). This requires parsing during data loading but imposes no architectural changes.

The key finding from Stable-DiffCoder: this recipe not only matches but surpasses AR on HumanEval, MBPP, and LiveCodeBench under identical architecture and data budgets.

Step A3: Joint fine-tuning. Unfreeze the SSM encoder, add cross-attention between denoiser and SSM, train jointly with:

$\mathcal{L}_{\text{joint}} = \mathcal{L}_{\text{CDFM}} + \lambda_{\text{ast}} \cdot \mathcal{L}_{\text{AST}} + \lambda_{\text{ctx}} \cdot \mathcal{L}_{\text{ctx-recon}}$

where $\mathcal{L}_{\text{ctx-recon}}$ is a context reconstruction auxiliary loss ensuring the SSM representations retain useful information through the bottleneck.

9.2 Phase B: Constraint and Guidance Learning

Train the guidance head (§5) against exact verifier labels:

Generate training data: Run the Phase A model at high temperature to collect (partial_code, context, verifier_outcome) triples. Include:
- On-policy DFM trajectory samples at various denoising steps
- High-temperature AR model outputs (diverse error distribution)
- Deliberately injected common error patterns (wrong imports, type mismatches, off-by-one)
Train guidance head: $\mathcal{L}_{\text{guide}} = \mathcal{L}_{\text{CE}}(\Delta\text{logits}, \text{verifier\_labels}) + \lambda_{\text{ece}} \cdot \text{ECE}(\hat{u}, \text{accuracy})$
Fidelity gate: Compute Spearman $\rho$ on held-out evaluation set. If $\rho < 0.7$ , do not proceed to Phase C. The guidance head can't reliably steer generation.

9.3 Phase C: On-Policy Refinement with RL

Apply DiffuCoder's [5] coupled-GRPO (Group Relative Policy Optimization) adapted for guided generation:

Sample trajectories from the full ASTRAL pipeline (SSM + denoiser + guidance + verification)
Score with exact verifiers: parse@1, type@1, test_pass@1
Compute group-relative advantages: For each prompt, sample $G$ completions, rank by verifier scores, compute advantages relative to the group mean
Update policy: $\mathcal{L}_{\text{GRPO}} = -\mathbb{E}\Big[\hat{A}_g \cdot \log p_\theta(x_g) - \beta \cdot \text{KL}[p_\theta \| p_{\text{ref}}]\Big]$

where $\hat{A}_g$ is the group-relative advantage and $\beta$ controls the KL penalty against the reference policy.
Retrain guidance head on Phase C trajectories to maintain calibration under the shifted distribution.

9.4 Data Requirements

Tier	Raw Code	Typed Code	Tested Code	Formal Specs
MVP	50–100B tokens	1B+	100M+	—
Frontier	500B+	10B+	1B+	100M+ (synthetic)

9.5 Research Roadmap

Phase	Months	Deliverable	Go/No-Go
M0: Infrastructure	0–2	Tree-sitter integration, AST encoding, parsing benchmarks	Latency targets met (§3.4)
M1: Base flow model	2–5	1B parameter DFM for code, HumanEval competitive	pass@1 within 10% of AR baseline
M2: SSM context	5–7	SSM encoder + cross-attention, 32k+ context	Cross-file resolution accuracy > 80%
M3: Constraint compiler	7–9	Three-tier compilation, CFG-constrained decoding	parse@1 > 99% with constraints
M4: Guidance head	9–10	Trained guidance + uncertainty gate	Fidelity $\rho > 0.7$
M5: Integration	10–12	Full pipeline, ablation study, paper	All go/no-go gates passed

Diffusion LLMs for code. DiffuCoder [5] established masked diffusion for code generation at the 7B scale with coupled-GRPO for RL-based improvement. LLaDA [8] demonstrated that masked diffusion scales to 8B parameters competitive with LLaMA3 (NeurIPS 2025 oral). Stable-DiffCoder [21] provided the first controlled evidence that block diffusion surpasses AR under identical conditions. AnCoder [18] introduced AST-anchored denoising that prioritizes syntactically salient tokens. DICE [27] demonstrated domain-specific diffusion for CUDA kernel generation. Mercury demonstrated commercial-scale discrete diffusion with claimed 5–10× throughput improvements.

Constrained decoding for diffusion. Mündler et al. [16] presented CFG-constrained decoding for diffusion LLMs via formal language intersection, achieving near-perfect syntactic validity with ~125% overhead. LAVE [17] improved reliability by exploiting parallel predictions for lookahead verification with negligible overhead. Cardei et al. [24] demonstrated training-free constrained discrete diffusion via in-sampling enforcement. Mündler et al. [25] extended constrained decoding to type constraints.

Self-correction and remasking. PRISM [29] introduced provable per-token quality scores for remasking, with theoretical guarantees that the quality score converges to the true posterior. Prism (TTS) [32] introduced hierarchical trajectory search for test-time scaling of diffusion LLMs. ReMDM demonstrated training-free low-confidence remasking. RemeDi combined supervised and RL-trained confidence heads. These methods address the fundamental limitation that unmasked tokens in MDMs remain fixed.

Variable-length infilling. DreamOn [19] introduced expand/delete tokens for dynamic-length generation in diffusion models. CAL [20] discovered that first-step denoising confidence implicitly estimates optimal infilling length, enabling training-free length adaptation.

Energy-based generation. COLD Decoding [9] demonstrated energy-based constrained text generation via Langevin dynamics. EDLM [12] incorporated sequence-level energy into discrete diffusion training. Energy Matching [23] (NeurIPS 2025) unified flow matching and EBMs through a time-independent scalar potential, the direct theoretical antecedent for ASTRAL's energy-guided flow matching. Kona [35] demonstrated commercial-scale energy-based constraint satisfaction reasoning in continuous latent space.

AST-aware code models. TreeDiff [14] introduced AST-guided masking during corruption. AST-T5 [7] demonstrated AST-aware pretraining benefits. Russell et al. [15] operated diffusion directly on syntax tree structures rather than token sequences.

SSMs for code. Codestral Mamba [22] demonstrated production-quality code generation with 256k+ context using Mamba architecture. Mamba-2 [3] revealed deep connections between SSMs and attention through structured state space duality.

Test-time compute for diffusion. McDiffuSE [28] optimized denoising order via MCTS. Prism (TTS) [32] demonstrated hierarchical trajectory search with self-verified feedback for efficient test-time scaling.

Novelty Differentiation

Capability	DiffuCoder	AnCoder	TreeDiff	LAVE	PRISM	ASTRAL
AST-anchored corruption	—	—	✓	—	—	✓
AST-anchored denoising	—	✓	—	—	—	✓
CFG-constrained decoding	—	—	—	✓	—	✓
Learned energy guidance	—	—	—	—	—	✓
Uncertainty-gated guidance	—	—	—	—	—	✓
Three-tier constraints	—	—	—	—	—	✓
Typed hole correspondence	—	—	—	—	—	✓
SSM context (repo-scale)	—	—	—	—	—	✓
Adaptive exact verification	—	—	—	—	—	✓
Provable self-correction	—	—	—	—	✓	via repair
Variable-length infilling	—	—	—	—	—	✓
RL refinement (GRPO)	✓	—	—	—	—	✓

ASTRAL's contribution is not in individual components but in their principled integration under the shaped generation framework: three-tier constraint compilation grounded in the typed hole correspondence, with learned guidance + uncertainty gating replacing expensive enumeration, SSM conditioning enabling repository-scale context, and adaptive verification with localized repair providing practical reliability.

11 Evaluation Plan

11.1 Benchmarks

Generation: HumanEval(+), MBPP(+), LiveCodeBench. Editing/Infilling: CodeXGLUE refinement, bug fixing, multi-hole completion, HumanEval-Infilling, SantaCoder-FIM. Constrained: Type-constrained generation, specification-guided synthesis. Proposed ASTRAL Editing Benchmark: Multi-hole infill with repository context (10–50 files), type hints, and test suites, requiring parse + type-check + test pass.

11.2 Metrics

pass@ $k$ , parse@1, type@1, verify@1, time_to_first_valid, repair_cycles_per_success, surrogate_verifier $\rho$ , calibration error

11.3 Ablations

(1) SSM vs. Transformer context encoder. (2) AST-anchored vs. random corruption/denoising. (3) Learned guidance head vs. no guidance vs. top- $k$ probing. (4) Uncertainty gating on/off. (5) Hard CFG constraints on/off. (6) Adaptive vs. fixed-step verification. (7) Variable-length infilling method. (8) NAR-first vs. AR-only on editing tasks. (9) Quality vs. guidance fidelity (surrogate correlation sweep). (10) Context bottleneck vs. full cross-attention.

11.4 Go/No-Go Gates

Surrogate $\rho \geq 0.7$ with exact verifiers
No increase in exact violation rate when guidance is enabled
Measurable reduction in time_to_first_valid vs. AR baselines on multi-region editing
Deterministic degraded-mode behavior under verifier outages

12 Discussion

12.1 What's Hard

Coherent designs are cheap. Working systems are expensive. These are the hard problems.

Surrogate fidelity. If learned surrogates drift from real verifiers, guidance is noise. There are no formal guarantees that a surrogate trained on partially-generated code will remain calibrated as the generation distribution shifts through training and inference. PRISM [29] provides provable per-token quality scores that converge to the true posterior, but ASTRAL's guidance head additionally predicts the direction of correction (logit adjustments), which requires stronger assumptions. Periodic recalibration against ground truth is the best available approach, and "best available" is far from "solved." The $\rho > 0.7$ gate is empirically motivated, not theoretically justified.

Discrete vs. continuous optimization. Kona's continuous latent space is mathematically clean for gradient-based optimization: smooth landscapes, well-defined gradients, convergence guarantees. Code is discrete. The Boltzmann-weighted rate modification (§2.2) is the discrete analog of Langevin dynamics, but the discrete energy landscape has combinatorial structure that continuous optimization theory doesn't address. Straight-through estimators and Gumbel-softmax reparameterization are heuristics that work surprisingly well in practice, but the theory is thin.

Training scale. LLaDA [8] required 2.3T tokens and 0.13M H800 GPU hours to train 8B parameters. DiffuCoder [5] needed 7B parameters and 130B code tokens. Adding AST-aware positional encoding, hierarchical corruption schedules, and energy-based guidance increases training complexity. The compute requirements are substantial, and the training dynamics of guided discrete flow models are less understood than standard AR or even standard diffusion.

Verification bottleneck. Ground-truth verifiers (type checkers, test suites, SMT solvers) are expensive. Running them at every denoising step is prohibitive. Running them too rarely lets surrogates drift. The adaptive trigger policy (§6.1) is our mitigation, but the right verification frequency is task-dependent and there is no general answer.

Integration complexity. SSM context encoder + AST-anchored flow denoiser + learned guidance head + uncertainty gate + three-tier constraint compiler + adaptive verifiers + localized repair. That's a lot of moving parts. Each is individually tractable. Integration is where complexity compounds, where interaction effects between components create behaviors absent from any individual specification.

12.2 Scope of Claims

Claims:

Shaped generation (constraints shaping trajectories rather than filtering outputs) is a principled design principle with theoretical grounding in the EBM/flow matching unification [23]
The typed hole correspondence (Proposition 1) provides a formal basis for constraint propagation during denoising
Three-tier constraint compilation separates concerns that prior work conflates
AST-anchored denoising with learned guidance and adaptive verification is a coherent architecture for constrained code generation

Does not claim:

Formal correctness guarantees from soft constraints (the guidance head is a heuristic, not a prover)
That ASTRAL will outperform frontier AR models on unconstrained generation tasks
That the compute tradeoffs will be favorable in all regimes
That the system will work out of the box without substantial engineering

12.3 Timing and Trajectory

The timing of this work is significant. In the 14 months since the original ASTRAL proposal (December 2024), discrete diffusion for code has moved from niche research to a validated paradigm: LLaDA [8] matched LLaMA3 at the 8B scale (NeurIPS 2025 oral), Stable-DiffCoder [21] surpassed AR under controlled conditions, DiffuCoder [5] demonstrated RL for diffusion code models, and multiple groups independently developed CFG-constrained decoding for diffusion LLMs [16, 17, 24]. The infrastructure for ASTRAL's design is maturing rapidly.

Simultaneously, energy-based reasoning has moved from theoretical promise to commercial deployment. Kona, Logical Intelligence's EBM reasoning core, achieved 96.2% on hard Sudoku (vs. 2% for frontier LLMs). Aleph, their separate LLM-based orchestration agent for formal verification, progressed from 76% on PutnamBench (December 2025) to 99.4% (668/672) by January 2026. Logical Intelligence has entered pilot programs with energy, semiconductor, and manufacturing companies. The convergence between ASTRAL's discrete flow + energy approach and Kona's continuous EBM approach is not coordinated. It reflects genuine structural properties of constraint satisfaction problems.

The question this paper leaves open: will these approaches converge into hybrid systems? SSMs for context ( $O(n)$ , repo-scale), flow matching for generation (bidirectional, structure-aware), energy-based guidance for constraints (continuous optimization), and formal verification for ground truth (deterministic correctness). Each component to its strength, composed through orchestration. That is the direction ASTRAL points, and it is the direction the Kona+Aleph architecture already demonstrates at coarser granularity.

13 Conclusion

ASTRAL frames code generation as energy minimization over a constraint-shaped landscape. The principle of shaped generation (constraints shape what is generated, not filter what was generated) is instantiated through discrete flow matching, where partially-denoised programs correspond to programs with typed holes, and the denoising trajectory corresponds to progressive refinement under bidirectional constraint propagation.

The architecture separates concerns that prior work conflates: hard admissibility (what is possible), soft guidance (what is likely to be correct), and exact verification (what is correct). Each tier uses the appropriate tool: formal language intersection for grammar, learned neural surrogates for per-step guidance, and compilers/tests/provers for ground truth.

We are explicit about what ASTRAL provides and what it does not. It provides empirically high constraint satisfaction through shaped generation and repair loops, not formal guarantees from soft constraints. It provides practical syntactic validity through CFG-constrained decoding or error-driven repair, configurable based on requirements. It surfaces verification status rather than silently claiming guarantees.

The generation paradigm matters less than the constraint architecture. Get the constraints right, and the generation substrate becomes a swappable implementation detail.

Appendix A Proof Sketches

A.1 Proposition 1: Denoising–Hole Correspondence

Sketch. We construct the correspondence by induction on the number of unmasked positions.

Base case. At $t = 1$ , $x_1 = \texttt{[M]}^L$ . The induced holey program $H(x_1)$ consists entirely of typed holes with vacuous environments. This is the "empty program" with $L$ holes.

Inductive step. Suppose at time $t$ , $H(x_t)$ is a well-scoped holey program with $m$ masked positions as holes. At time $t - h$ , position $j$ is unmasked to value $v$ . We must show that $H(x_{t-h})$ is a well-scoped holey program with $m - 1$ holes.

The key observation: the denoiser's bidirectional attention computes $p_\theta^j(v \mid x_t, t)$ by conditioning on all revealed positions. This conditioning is precisely the environment $\Gamma_j$ of the typed hole at position $j$ . The predicted distribution $p_\theta^j(\cdot \mid x_t, t)$ is the denoiser's estimate of which fills are consistent with the environment, the analog of type-checking a hole fill.

After unmasking $j$ to $v$ , the environments of remaining holes update:

$\Gamma_k \leftarrow \Gamma_k \cup \{j \mapsto v\} \quad \text{for all remaining holes } k$

This update is computed by the denoiser's attention in the next forward pass. The newly revealed position $j$ becomes part of the conditioning context for all remaining masked positions.

Well-scopedness. If the denoiser predicts a value $v$ that is syntactically valid in the context of $\Gamma_j$ , then $H(x_{t-h})$ remains well-scoped. The AST-anchored corruption schedule (§3.3) ensures that structural tokens (which define scopes) are unmasked before detail tokens (which must respect scopes), maintaining well-scopedness with high probability throughout the trajectory. $\square$

A.2 Proposition 2: Rate Preservation

Proof. By definition, $\tilde{u}_t^i(y \mid x) = u_t^i(y \mid x) \cdot \exp(-\alpha_t [E_\phi(x \oplus_i y) - E_\phi(x)])$ .

Since $u_t^i(y \mid x) \geq 0$ and the exponential is always positive, $\tilde{u}_t^i(y \mid x) \geq 0$ . ✓

For the sum condition:

$\sum_{y \neq x^i} \tilde{u}_t^i(y \mid x) \cdot h = \sum_{y \neq x^i} u_t^i(y \mid x) \cdot \exp(-\alpha_t \Delta E_{i,y}) \cdot h$

where $\Delta E_{i,y} = E_\phi(x \oplus_i y) - E_\phi(x)$ . Since $\exp(-\alpha_t \Delta E_{i,y}) \leq \exp(\alpha_t |\Delta E_{i,y}|) \leq \exp(\alpha_t \Delta E_{\max})$ :

$\sum_{y \neq x^i} \tilde{u}_t^i(y \mid x) \cdot h \leq \exp(\alpha_t \Delta E_{\max}) \sum_{y \neq x^i} u_t^i(y \mid x) \cdot h$

The original rates satisfy $\sum_{y \neq x^i} u_t^i(y \mid x) \cdot h \leq 1$ for small $h$ . Therefore $\sum_{y \neq x^i} \tilde{u}_t^i(y \mid x) \cdot h \leq \exp(\alpha_t \Delta E_{\max})$ , which is $\leq 1$ when $h \leq \exp(-\alpha_t \Delta E_{\max}) / \sum_{y} u_t^i(y \mid x)$ . Equivalently, we require a smaller step size $h' = h \cdot \exp(-\alpha_t \Delta E_{\max})$ , which simply means guidance increases the effective number of denoising steps. $\square$

A.3 Constraint Propagation Complexity

Claim. Under the AST-locality assumption (that constraint propagation from position $i$ affects only positions within $i$ 's scope or connected via the symbol graph), the cost of constraint propagation per unmasking step is $O(|\text{scope}(i)| + |\text{symrefs}(i)|)$ rather than $O(L)$ .

Argument. Let $S_i = \{j : j \in \text{scope}(i)\} \cup \{j : (i, j) \in E_{\text{sym}}\}$ be the set of positions affected by unmasking position $i$ , where $E_{\text{sym}}$ is the symbol graph edge set. For typical code:

$|\text{scope}(i)| = O(b)$ where $b$ is the average scope size (function body, class body)
$|\text{symrefs}(i)| = O(r)$ where $r$ is the average number of cross-file references per symbol

Both $b$ and $r$ are typically $\ll L$ for well-structured code. The total propagation cost over all $L$ unmasking steps is $O(L \cdot (b + r))$ rather than $O(L^2)$ .

For pathological cases (deeply nested scopes, extensively cross-referenced symbols), the propagation degrades toward $O(L^2)$ . The architecture handles this gracefully: the denoiser's attention already processes all positions, so propagation cost is dominated by the forward pass cost.

B.1 Masked Diffusion Models at Scale

LLaDA [8] (NeurIPS 2025 oral) demonstrated that masked diffusion models scale to 8B parameters trained on 2.3T tokens, achieving performance competitive with LLaMA3 8B across general tasks, math, code, and in-context learning. Critically, LLaDA operates with a vanilla Transformer backbone and optimizes a variational lower bound on log-likelihood, establishing that principled generative modeling via masked diffusion is viable at LLM scale. SMDM established the first scaling laws for masked diffusion, showing compute-optimal behavior analogous to Chinchilla scaling for AR models.

B.2 Self-Correction in Diffusion Models

A fundamental limitation of masked diffusion is that once a token is unmasked, it remains fixed. Several approaches address this:

ReMDM [Wang et al., 2025]: Training-free remasking of low-confidence tokens
PRISM [29]: Provable per-token quality scores via a self-correction loss; the learned quality head detects incorrect tokens for remasking without RL or external verifiers
Prism (TTS) [32]: Hierarchical trajectory search with partial remasking for test-time scaling of diffusion LLMs, achieving best-of-N performance with fewer function evaluations
RemeDi [Huang et al., 2025]: RL-trained confidence head with supervised and reward-based objectives

ASTRAL's localized repair (§6.2) builds on this line of work but adds two innovations: (1) external verification-driven remasking rather than model-confidence-driven, providing ground-truth error signals; and (2) AST-aware span dilation, ensuring that remask regions align with syntactic boundaries.

B.3 Test-Time Scaling for Diffusion LLMs

McDiffuSE [28] optimizes denoising order via Monte Carlo tree search, treating slot-filling order as a sequential decision problem. Prism (TTS) [32] introduces hierarchical trajectory search with local branching and self-verified feedback. Both demonstrate that compute-at-inference-time strategies, well-established for AR models (best-of-N, tree search), transfer productively to diffusion LLMs. ASTRAL's adaptive verification triggers (§6.1) concentrate verification resources where they are most needed.

Appendix C Glossary of Notation

Symbol	Meaning
$\mathcal{V}$ , $\bar{\mathcal{V}}$	Vocabulary, vocabulary + mask token
$L$	Output sequence length
$x_t \in \bar{\mathcal{V}}^L$	Sequence state at time $t$
$p_t(x)$	Probability path at time $t$
$u_t^i(y \mid x)$	Transition rate: position $i$ from $x^i$ to $y$ at time $t$
$E(x)$ , $E_\phi(x)$	True / surrogate constraint energy
$C_d(x)$	Violation measure for constraint domain $d$
$\alpha_t$	Guidance strength at time $t$
$\tilde{u}_t^i$	Energy-guided transition rate
$\square_\sigma^\Gamma$	Typed hole with type $\sigma$ and environment $\Gamma$
$H(x_t)$	Induced holey program at time $t$
$\phi_{d \to d'}$	Constraint propagation morphism
$\rho$	Spearman correlation (fidelity gate)
$\hat{u}_i$	Predicted uncertainty at position $i$
$\tau$	Uncertainty gating threshold
$n_{\text{ctx}}$ , $n_{\text{out}}$	Context length, output length
$K$	Number of context bottleneck latents
$N$	Number of denoising steps

References

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 2021. arXiv:2107.03006

[2] Campbell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., & Doucet, A. (2024). A Continuous Time Framework for Discrete Denoising Models. NeurIPS 2024. arXiv:2205.14987

[3] Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. arXiv:2405.21060

[4] Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T. Q., Synnaeve, G., Adi, Y., & Lipman, Y. (2024). Discrete Flow Matching. NeurIPS 2024. arXiv:2407.15595

[5] Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., & Zhang, Y. (2025). DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation. arXiv:2506.20639

[6] Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752

[7] Gong, L., Elhoushi, M., & Cheung, A. (2024). AST-T5: Structure-Aware Pretraining for Code Generation and Understanding. arXiv:2401.03003

[8] Nie, S., Zhu, F., You, Z., et al. (2025). Large Language Diffusion Models. NeurIPS 2025 (Oral). arXiv:2502.09992

[9] Qin, L., Welleck, S., Khashabi, D., & Choi, Y. (2022). COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. NeurIPS 2022. arXiv:2202.11705

[10] Poesia, G., Polozov, A., Le, V., et al. (2022). Synchromesh: Reliable Code Generation from Pre-trained Language Models. ICLR 2022. arXiv:2201.11227

[11] Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469

[12] Xu, M., Geffner, T., Kreis, K., et al. (2024). Energy-Based Diffusion Language Models for Text Generation. ICLR 2025. arXiv:2410.21357

[13] Ye, J., Liu, M., Shi, J., & Yeung, D.-Y. (2024). DREAM: Diffusion Rectification and Estimation-Adaptive Models. CVPR 2024. arXiv:2312.00210

[14] Zeng, Y., Cao, J., Li, Z., et al. (2025). TreeDiff: AST-Guided Code Generation with Diffusion LLMs. arXiv:2508.01473

[15] Russell, S. et al. (2025). Diffusion on Syntax Trees for Program Synthesis. ICLR 2025. PDF

[16] Mündler, N., Dekoninck, J., & Vechev, M. (2025). Constrained Decoding of Diffusion LLMs with Context-Free Grammars. arXiv:2508.10111

[17] Zhang, Y., Li, Y., Liu, Y., et al. (2026). LAVE: Lookahead-then-Verify Reliable Constrained Decoding for Diffusion LLMs. arXiv:2602.00612

[18] Xue, A., Rout, L., Caramanis, C., & Shakkottai, S. (2026). AnCoder: Anchored Code Generation via Discrete Diffusion Models. arXiv:2602.17688

[19] Wu, Z. et al. (2026). DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas. arXiv:2602.01326

[20] Liu, H., Yang, Z., & Su, B. (2026). Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly. arXiv:2602.00476

[21] Fan, C., Heng, W., Li, B., et al. (2026). Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model. arXiv:2601.15892

[22] Mistral AI. (2024). Codestral Mamba. Technical Report

[23] Balcerak, M., Amiranashvili, T., Shit, S., Terpin, A., et al. (2025). Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling. NeurIPS 2025. arXiv:2504.10612

[24] Cardei, B. et al. (2025). Constrained Language Generation with Discrete Diffusion. arXiv:2503.09790

[25] Mündler, N. et al. (2025). Type-Constrained Code Generation with Language Models. arXiv:2504.09246

[26] Jaegle, A., Gimeno, F., Brock, A., et al. (2021). Perceiver: General Perception with Iterative Attention. ICML 2021. arXiv:2103.03206

[27] Bai, H. et al. (2026). DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels. arXiv:2602.11715

[28] Ong, J. J. L. et al. (2026). McDiffuSE: Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models. arXiv:2602.12586

[29] Kim, T. et al. (2025). PRISM: Plug-in Remasking for Inference-time Self-correction of Masked Diffusions. arXiv:2510.01384

[30] Omar, C., Voysey, I., Hilton, M., Aldrich, J., & Hammer, M. A. (2019). Hazel: A Live Functional Programming Environment with Typed Holes. POPL 2019. DOI:10.1145/3290327

[31] Lipman, Y., Havasi, M., Holderrieth, P., et al. (2024). Flow Matching Guide and Code. arXiv:2412.06264

[32] Bai, J. et al. (2026). Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models. arXiv:2602.01842

[33] Holderrieth, P., Havasi, M., Yim, J., et al. (2024). Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective. ICLR 2025. arXiv:2412.03487

[34] Su, Y., Wang, Z., & Chen, J. (2025). A Theoretical Analysis of Discrete Flow Matching. arXiv:2509.22623

[35] Logical Intelligence. (2026). Kona 1.0: Energy-Based Reasoning Model. Technical Report