[R]
Back to blog

CodeGen as Constrained Search

Why I built a custom constrained decoding engine for SGLang, and what it means for AI code generation.

Code generation is now fast and fluent. That's the easy part. The hard part is that it's also wrong in ways that take hours to discover.

Not syntactically wrong. Not even obviously wrong. Wrong in the ways that matter: it violates the invariants your system actually depends on, the patterns your team has learned the hard way, the constraints that exist nowhere in writing but everywhere in practice.I spent a lot of time trying to solve this with better context engineering. Retrieval, memory, chunking strategies, prompt optimization. All of it helps at the margins. None of it solves the core problem.

The standard response is to throw more into the context window. But even models with 200k token capacity degrade as you approach that limit. More fundamentally, the constraints that matter are often implicit or diffuse: organizational standards nobody documented, architectural patterns encoded across dozens of files, hard-won lessons from production incidents that live in postmortems and PR comments.

You can't retrieve what was never written down.

A Different Frame

Back in 2023, I started exploring constrained generation as an alternative, then under the aegis of Cheshire AI. The idea: instead of hoping the model infers your constraints from context, enforce them at decode time.

My early experiments used llguidance, which does token-level constraint enforcement during generation. I started simple: language grammars for the specific versions I was targeting. Then I extended to key libraries and APIs, encoding tighter constraints about what valid code looked like in my particular projects.llguidance is excellent infrastructure. The SGLang and Guidance teams have done real work here. But it's fundamentally designed for structural constraints, not the kind of dynamic, layered, semantically-rich constraints I needed.

The results were promising. Code that respected project-specific patterns without stuffing everything into the prompt. But the constraint specification was manual and static. Every time I wanted to encode new nuance, I was hand-crafting grammars or constraint models.

That's when the shape of the real system started to become clear.

The Ananke Architecture

What I've built over the past several months is a system called Ananke, which goes much further than Cheshire ever did. The core reframe: treat code generation as constrained search through valid programs, not token prediction with post-hoc repair.Named for the Greek goddess of necessity and inevitability. The constraints aren't suggestions; they're the shape of the solution space itself.

The system has three main components:

Clew extracts constraints from existing codebases. Not specifications of what code does, but constraints on what code is allowed to do. It mines ASTs, type systems, dependency graphs, runtime telemetry, and human artifacts like ADRs and incident postmortems. The goal is to make implicit team knowledge explicit and machine-actionable.I'm also working on a related project, Topos, that approaches the implicit context problem from another direction: lifting semantic structure from codebases into navigable knowledge graphs.

Braid compiles constraints just-in-time for each generation task. It takes baseline constraints from Clew and fuses them with immediate context: what file you're editing, what your git state looks like, what you explicitly asked for, who's asking. The same codebase gets different constraint programs for a production hotfix versus an exploratory prototype.

Maze orchestrates the actual generation. It's not a model; it's an orchestration layer that directs models to search the constraint-bounded space of valid programs.I still need to integrate Maze with the new SGLang backend. Stay tuned.

Building My Own Backend

The constraint expressiveness I needed wasn't available in existing tools. llguidance handles syntactic constraints beautifully, but I needed to coordinate constraints across five domains: syntax, types, imports, control flow, and semantics. I needed constraints that could adapt based on context. I needed typed holes that could represent structured incompleteness.Typed holes come from the Hazel research program out of UMich. The key insight: incompleteness isn't failure, it's a precisely-scoped question. A hole carries type information, constraints, and context. You can check properties of incomplete programs and fill holes progressively. It's a beautiful piece of PL theory that turns out to be exactly what you need for iterative AI-assisted development.

So I forked SGLang and built a custom constrained decoding engine.

The ananke-sglang fork supports multi-domain constraint coordination, with enforcement across Python, TypeScript, Go, Rust, Kotlin, Swift, and Zig. Constraints from different domains fuse into unified token masks. The search space narrows with each token.

I still leverage llguidance for some syntactic constraints, but the majority of the work has moved to my own backend. The core Ananke system, with Clew and Braid, is written in Zig for the performance-critical paths, with Rust handling the Maze orchestration layer.

What This Gets You

The system adds latency. We're talking tens to low hundreds of microseconds on average, plus the overhead of constraint compilation. You need access to the inference server, not just an API.

The tradeoff is worth it.

Code comes out respecting the actual constraints of your project without requiring iteration cycles or post-hoc testing to discover violations. I'd rather catch constraint violations at generation time than in code review or production. And I can encode a richer set of constraints than any testing regime can express.Property-based testing is excellent. I'm a huge proponent. But it's fundamentally post-hoc. Constrained generation is preventive. They're complementary, not competing.

The current system extracts constraints from 9 languages, compiles to JSON Schema, regex, and token masks, and enforces at generation time with sub-50μs per-token overhead. It works. It'll get better.

The Broader Frame

I've started thinking about AI-assisted development in two frames: in-context and out-of-context.

In-context work is what happens inside the context window: retrieval, memory, prompt engineering, data encodings. Important work, mostly incremental.

Out-of-context work is where the novel leverage is. Encoding project knowledge, invariants, and intent into constraints that shape generation at decode time. The context window becomes a channel for immediate intent, not a cargo hold for everything the model might need to know.

This is where I'm spending most of my time now. The constraint extraction pipeline is still very much under development. The constraint DSL (Ariadne) is parsing but not yet type-checking. Multi-model orchestration and diffusion model integration are on the roadmap.

But the core thesis is holding up: if you can control the shape of the search space, you can make code generation both faster and more trustworthy. The model doesn't need to guess what valid means in your context. You tell it.

0❤++