Energy, Structure, and Shaped Generation

TL;DR

In "The Hole in the Puzzle," I introduced shaped generation: the principle that constraints should shape what gets generated, not filter what was generated. That principle is paradigm-independent. This post explores what it looks like beyond autoregressive models – through discrete flow matching, energy-based reasoning, and the structural properties of code that make these approaches natural. ASTRAL is a research direction I've been developing. Logical Intelligence's Kona is independent proof that energy-based constraint satisfaction works at scale. The convergence is becoming clear: the generation paradigm matters less than the constraint architecture.

Autoregressive models commit left-to-right. Each token constrains all subsequent tokens irrevocably. For prose, this is fine – natural language flows forward, and early words genuinely do constrain later ones in useful ways. For code, the commitment ordering is often wrong. You're deciding the function signature before you understand the implementation. You're choosing a return type before you know what the body computes. You're naming variables before you know their roles.

Humans don't write code this way. We sketch structure, leave gaps, fill in details, revise earlier decisions as later ones clarify what's needed. Constraints flow bidirectionally: the implementation informs the signature as much as the signature constrains the implementation. The type of a loop variable depends on what the loop does, but what the loop does depends on the data structures chosen three functions away.

In The Hole in the Puzzle, I introduced shaped generation as a paradigm-independent principle: constraints should shape what gets generated, not filter what was generated. I showed how this works for autoregressive models through token masking, and gestured at what it might look like for diffusion and flow-matching. This post follows that thread into non-autoregressive territory – where the generation paradigm changes, but the constraint architecture carries over. This isn't AR-bashing. Autoregressive models are remarkably good and getting better. Ananke proves that shaped generation works within the AR paradigm today, and that's where the practical value is right now. This is about what becomes possible when you relax the left-to-right commitment.

The Landscape Shift

When I wrote the previous posts, discrete diffusion for code was a more niche area of research. That's changed fast. Apple released DiffuCoder, a 7B open-source diffusion language model specifically for code generation – trained on code, evaluated on code, with results competitive with AR models of similar scale. Inception Labs shipped Mercury, the first commercial diffusion LLM, claiming 5-10x throughput improvements over autoregressive models of comparable quality. These aren't conference papers about toy problems. They're working systems at meaningful scale, and they landed within months of each other.

The DiffuCoder result I find most interesting isn't the performance numbers – it's the discovery that discrete diffusion language models can choose how causal their generation is. The spectrum between fully autoregressive (one token at a time, strict left-to-right) and fully parallel (all tokens simultaneously) is continuous, not binary. Generation strategy becomes a dial you can turn based on the task, not a fixed architectural commitment. discrete diffusionStart with a sequence of mask tokens. The model learns to iteratively unmask them – predicting which tokens to reveal and what values they should take. Each step refines all positions simultaneously, conditioned on what's already been revealed. It's corruption-then-reversal: corrupt clean sequences to noise during training, learn to reverse the corruption during inference.

But the property that matters most for code isn't speed – it's that bidirectional conditioning is native. In a diffusion model, a token in the middle of a function body is influenced by what comes before and after it, simultaneously. There's no left-to-right bottleneck where early decisions can't be revised in light of later context.

For code, this is structurally significant. A function's body should inform its signature. A return type should constrain the implementation. Error handling should be consistent with the error types defined elsewhere in the module. Bidirectional information flow is how code actually works – the dependency graph is a DAG, not a chain. Autoregressive generation fights this structure; diffusion embraces it. DiffuCoder is open-source and well worth examining. The codebase is clean and the paper is thorough about both what works and what doesn't.

ASTRAL: Structure-Aware Flow Matching

ASTRAL (AST-aware Refinement And Learning) is a research direction I've been developing – a design for combining discrete flow matching with the constraint architecture from Ananke and CLaSH (Coordinated Logical and Semantic Holes). ASTRAL is theoretical at this point. It may be wrong in important ways. But the design space is real, and my thinking continues to evolve as the field moves. The full proposal and reading list are public snapshots of where my thinking was when I wrote them. Some of what follows has already moved past those snapshots.

Context at Scale

The distinctive feature of code generation at the repo level is that context scales in length, not generation. You might need millions of tokens of context – the full repository, dependencies, documentation, tests – to generate hundreds of tokens of new code. The ratio is extreme and it shapes everything.

This means O(n) context processing is a prerequisite, not an optimization. Transformer self-attention is O(n²) in sequence length. At repo scale, that's the difference between possible and impossible. ASTRAL proposes an SSM (Mamba-2 style) context encoder with AST-aware positional encoding – encoding not just sequence position but tree depth, sibling index, scope nesting level, and syntactic node type. The model understands that it's inside the third method of a class definition, not just at token position 47,231. Why SSMsState space models process sequences in O(n) time and constant memory per step. At repo scale – hundreds of thousands to millions of tokens of context – this is the difference between fitting in memory and not. The tradeoff is reduced expressiveness for certain long-range dependencies compared to attention, but for code context (where structure provides strong locality), this is favorable.

Cross-file symbol resolution matters too. The same UserService referenced in a controller, a test file, and a migration should be understood as the same entity, not three unrelated token sequences that happen to match. ASTRAL proposes cross-file symbol graph attention: a sparse attention mechanism that connects symbol references across files through the dependency graph. The SSM handles local sequential context. The symbol graph handles structural relationships – imports, type hierarchies, call graphs – that cross file boundaries. This is how programmers think about codebases: local detail within files, structural relationships between them.

Hierarchical Corruption

Standard diffusion corrupts uniformly – every position has the same probability of being masked at each noise level. ASTRAL proposes hierarchical corruption that respects AST structure. High-level constructs (module declarations, class hierarchies, function signatures) are corrupted last and recovered first. Implementation details (variable names, loop bounds, specific expressions) are corrupted first and recovered last.

This mirrors how humans reason about code: architecture before implementation. You decide you need a recursive function over a tree before you decide what the base case returns. The corruption schedule encodes a structural prior about the natural refinement order of programs.

A parser-based validity regularizer accompanies this: a term in the training loss that penalizes denoising steps producing syntactically invalid intermediates. The denoising trajectory stays closer to valid code throughout, not just at the endpoint. The model learns to refine through valid programs, not through token soup that happens to converge. This maps to the abstraction of levels of typed holes from the previous post. The corruption schedule is, in a sense, creating and filling typed holes in order of abstraction.

Denoising as Hole-Filling

Here's where the connection to the previous post becomes structural, not just metaphorical.

In The Hole in the Puzzle, I described typed holes: placeholders that carry a type, a context, and constraints. A program with holes is still well-defined – you can typecheck it, reason about it, and in systems like Hazel, even run it. Filling a hole is a local operation with global consequences: the fill is validated against the hole's captured environment, and information propagates to constrain neighboring holes.

A partially-denoised program in ASTRAL's flow-matching process is a program with typed holes. Every masked position is a hole. Every revealed position is a fill. The constraints that typed holes carry – expected type, scope environment, cross-domain requirements – are exactly the constraints that should guide which tokens get unmasked next and what values they take.

The crucial difference from AR generation: in an autoregressive model, holes can only be filled left-to-right. You can't skip ahead to fill a return type that would constrain the function body, then come back. The fill order is locked to the token order, regardless of the constraint structure. In flow-matching, the fill order follows the constraint structure. Hierarchical corruption ensures that high-level holes (signatures, type annotations) fill first, and their constraints propagate downward to implementation-level holes – exactly the progressive refinement that CLaSH was designed for.

This maps directly to bidirectional typing. When a type annotation position gets unmasked (synthesis ↑), that type information flows down to constrain the function body positions that are still holes (analysis ↓). When an implementation detail gets filled, its synthesized type flows up to validate against the already-filled signature. The bidirectional information flow that typed holes enable theoretically is what diffusion models do natively – conditioning on the full context of already-revealed positions, regardless of linear order.

And the cross-domain morphisms from CLaSH become constraint propagation channels during denoising. When a type-level position fills, the Types → Imports morphism fires: the newly revealed type may require an import that wasn't previously constrained. When a control flow structure fills, ControlFlow → Types fires: unreachable branches relax their type constraints. Each unmasking step triggers a cascade of cross-domain updates that narrow what the remaining holes can be. Propagation costThis cascade is the expensive part. Naively, every fill triggers O(n) constraint updates across all remaining holes and all domain morphisms. In practice, the AST structure provides locality – most fills only affect holes within the same scope or connected through the symbol graph. Exploiting this sparsity is what makes the propagation tractable.

Energy-Based Constraint Guidance

This is where ASTRAL connects to the broader energy-based model movement, and where the design gets both most interesting and most speculative.

The constraint propagation from typed holes needs a mechanism to influence the denoising trajectory. Energy functions provide that mechanism. Constraints from specifications, types, and tests compile into energy functions that bias the trajectory at each step. Valid code lives in energy minima. The flow-matching process descends the energy landscape – each denoising step moves toward lower energy, where energy means "satisfies more constraints more completely." The CLaSH product domain from the previous post becomes a composite energy landscape – each constraint domain contributes a term, and the cross-domain morphisms become coupling terms. Filling a type hole changes the energy landscape for import and semantic holes simultaneously.

The design is two-tiered, and the tiers serve fundamentally different roles. Fast learned surrogates provide per-step guidance: lightweight neural networks trained to approximate what real verifiers would say about partially-generated code. They're cheap enough to run at every denoising step, providing the continuous gradient signal needed to shape the trajectory. Ground-truth verifiers – type checkers, test execution, SMT solvers – run at intervals to anchor the surrogates to reality. They're too expensive for every step, but they provide the definitive correctness signal that the surrogates are trained to approximate. The surrogates steer. The verifiers correct course. Surrogate fidelity is the hardest part of this design. If learned surrogates diverge from real verifiers, the model is steered by a broken compass. The target is Spearman ρ > 0.7 between surrogate scores and verifier outcomes. That's an empirical bet, not a formal guarantee. Periodic recalibration against ground truth is the best available mitigation.

I designed ASTRAL's energy-based guidance system based on theoretical arguments about what code generation should look like when you take constraint satisfaction seriously. What I didn't expect was that someone would ship a commercial proof that energy-based constraint satisfaction works at scale – from an entirely different direction.

Kona and the Physics of Reasoning

Logical Intelligence has an unusual origin for an AI lab. Founded by Eve Bodnia, whose PhD is in quantum information theory and who has 22 papers on dark matter and cosmology. Yann LeCun as founding chair of the Technical Research Board. Their team draws heavily from physics, formal verification, and mathematical optimization rather than the NLP-to-scaling pipeline that produced most current AI labs. Eve Bodnia's physics background isn't incidental. Energy-based models are variational methods – and variational methods are the bread and butter of quantum mechanics. The intuition transfer from physics to constraint satisfaction is direct: find the state that minimizes the energy functional, subject to boundary conditions.

Kona is their first commercial model, and it's non-autoregressive. It reasons in continuous latent space – not discrete tokens, but dense vector representations that can be smoothly optimized. It uses approximate gradients to iteratively edit outputs toward constraint satisfaction, refining a complete solution in-place rather than building one left-to-right. The metaphor they use: "seeing a maze from above instead of wandering through it." All constraints are evaluated simultaneously across the entire output, not sequentially as each token commits.

The benchmarks are striking. 96.2% Sudoku solve rate versus 2% for frontier LLMs, at roughly 300x the speed. 76% on Putnam competition problems (December 2025) via Aleph – their orchestration layer coordinating Kona with LLMs and Lean 4 for formal verification. 96.2% vs 2% on Sudoku isn't an intelligence gap. LLMs fail Sudoku because sequential token generation can't do simultaneous constraint satisfaction across rows, columns, and boxes. You need to satisfy 27 interlocking constraints at once. Doing that one token at a time, left to right, is like solving a jigsaw puzzle by committing to each piece's position before seeing the next piece. The failure is architectural, not capability.

Their framing resonates: "It does not predict likely outcomes. It enforces constraints. It replaces trust with proof." Read that again. This is the same thesis as Ananke and CLaSH – generation as constraint satisfaction rather than pattern completion – arrived at independently from physics and formal verification rather than programming language theory. The convergence is not coordinated. That's what makes it interesting.

Their target markets tell you something about where energy-based constraint satisfaction matters most: energy grid optimization, semiconductor design, manufacturing process control. Domains where constraint satisfaction isn't a quality-of-life improvement – it's a safety requirement. Where "probably right" isn't good enough and "provably satisfies constraints" is the minimum bar.

I have no affiliation with Logical Intelligence. I'm watching them because they're building evidence for a thesis I arrived at from a different direction. LeCun has argued for years that autoregressive generation is fundamentally limited for reasoning. His position on Logical Intelligence's advisory board isn't coincidence. The JEPA (Joint Embedding Predictive Architecture) program and Kona's approach are intellectually aligned – both frame intelligence as energy minimization in representation space, not next-token prediction.

The Convergence

Step back. Four threads are converging from different origins, different teams, different intellectual traditions:

Constrained decoding for AR models – Ananke, llguidance, Outlines. Token masks at each generation step. Works now, ships value now, limited by left-to-right commitment.

Discrete diffusion and flow-matching for code – DiffuCoder, Mercury, ASTRAL. Parallel refinement. Bidirectional conditioning. Trajectory shaping through energy guidance.

Energy-based reasoning in continuous space – Kona. Gradient-based optimization in latent space. All constraints simultaneously. Dense vectors, not discrete tokens.

Formal verification integration – Aleph, Lean-based proof search, AlphaProof. Ground-truth correctness. Deterministic verification of probabilistic generation. AlephAleph is Logical Intelligence's orchestration layer – a hybrid architecture that routes different parts of a problem to the right tool. Kona for constraint reasoning in continuous space. LLMs for natural language understanding and problem decomposition. Lean 4 for formal verification. Each component to its strength, composed through an orchestration layer that manages the handoffs.

Different surfaces, same thesis: code generation is constrained search; the generation paradigm is the search strategy. The typed hole is the unifying abstraction across all of them. In AR + masking, each generation step fills one hole left-to-right. In diffusion, all positions start as holes and fill in parallel. In Kona's continuous space, the entire output is a single high-dimensional hole being iteratively refined. The paradigm determines the fill strategy; the constraint architecture determines what "valid fill" means. This was the unifying insight from shaped generation in the previous post. It's holding up across paradigms that didn't exist – or at least weren't commercially viable – when I wrote it.

Paradigm	Search Strategy	Constraint Integration
AR + masking	Greedy/beam, per-step filtering	Token masks at each step
Diffusion/flow	Parallel refinement	Energy guidance on trajectory
EBM (continuous)	Gradient-based optimization	Energy function minimization
Hybrid	Composed	Composite energy landscape

LeCun's JEPA provides a theoretical umbrella: energy-based world models predicting in representation space rather than pixel or token space. Both ASTRAL and Kona are instances of this broader program, whether or not they're explicitly designed that way. The generation paradigm varies; the principle – that valid outputs are energy minima in a constraint-shaped landscape – is invariant.

The architectural question that interests me most: will these converge into hybrid systems? SSMs for context processing (O(n), repo-scale). Flow-matching for generation (bidirectional, structure-aware). Energy-based guidance for constraints (continuous optimization). Formal verification for ground truth (deterministic correctness). Each component to its strength, composed through orchestration that routes subproblems to the right substrate.

That's the direction ASTRAL points, and it's the direction Aleph's architecture already demonstrates at a coarser granularity. Aleph uses Kona for constraint reasoning, LLMs for natural language, Lean 4 for proof. The components are different from what ASTRAL proposes, but the principle is the same: no single architecture handles all aspects of the problem well, and the composition strategy matters as much as the individual components.

What's Hard

Coherent designs are cheap and working systems are expensive.

Surrogate fidelity. If learned surrogates drift from real verifiers, guidance is noise. There are no formal guarantees that a surrogate trained on partially-generated code will remain calibrated as the generation distribution shifts. Periodic recalibration against ground truth is the best available approach, and "best available" is a long way from "solved."

Discrete versus continuous. Kona's continuous latent space is mathematically clean for gradient-based optimization – smooth landscapes, well-defined gradients, convergence guarantees from optimization theory. Code is discrete. The approximate gradients at the discrete-continuous interface are exactly that – approximate. Straight-through estimators and Gumbel-softmax reparameterization are hacks that work surprisingly well in practice, but the theory is thin.

Training scale. DiffuCoder needed 7B parameters and significant compute to match AR baselines. Adding AST-aware positional encoding, hierarchical corruption schedules, and energy-based guidance increases the training complexity further. The compute requirements are substantial and the training dynamics are less understood than for standard AR models. "Energy-Based Models for Code Generation under Compilability Constraints" (Chen et al., 2021) explored this direction five years ago. The ideas aren't new. What's new is that the component technologies – discrete flow matching, efficient SSMs, fast incremental parsers – have matured enough that integration might actually be practical. The question was never whether this is the right direction, but whether the infrastructure is ready.

Verification bottleneck. Ground-truth verifiers (type checkers, test suites, SMT solvers) are expensive. Running them at every denoising step is prohibitive. Running them too rarely lets surrogates drift. The right check frequency is task-dependent, and there's no general answer.

Integration complexity. SSM context encoder + flow-matching generator + learned surrogates + ground-truth verifiers + AST-aware positional encoding + parser-based regularizer = many moving parts. Each is individually tractable. Integration is where complexity compounds – where the interaction effects between components create behaviors that weren't in any individual component's specification. I want to be clear about the gap between "this design is coherent" and "this works in practice." ASTRAL is a research direction. The hard problems listed here are genuinely hard, and any one of them could be a showstopper. Design coherence is necessary but nowhere near sufficient.

Onward

The constraint-as-architecture thesis is gaining evidence from multiple independent directions. Ananke proves it works for autoregressive models today. Kona proves energy-based reasoning works for constraint satisfaction at scale. DiffuCoder and Mercury prove discrete diffusion works for code. These are different teams, different paradigms, different intellectual traditions arriving at the same structural insight.

Near-term, I'm continuing to develop Ananke because it ships value now. The CLaSH framework is paradigm-independent by design – the constraint algebra, the product domain, the cross-domain morphisms all port to flow-matching when the generation substrate is ready. The investment in constraint architecture is durable across paradigm shifts.

My thinking on ASTRAL continues to evolve. Watching what Logical Intelligence does with Kona is informing how I think about energy landscapes and constraint compilation. The proposal is a snapshot, not an endpoint.

The tooling gap remains the leverage point. The ideas in this post are tractable for researchers with cluster access and deep familiarity with the literature. Making these patterns accessible to working engineers who need to ship code that satisfies real constraints – that's where actual leverage is. Ananke is one attempt at bridging that gap for AR models today. The CLaSH framework is designed to bridge it for whatever generation paradigm comes next.

The generation paradigm matters less than getting the constraint architecture right. Get the constraints right, and the generation substrate becomes a swappable implementation detail. That's the bet. So far, the evidence from multiple independent directions is that it's a good one.