How I Broke Down the Process of Writing a Book with LLMs

It's 2am and I'm staring at a wall of text that reads like cardboard. Three hundred pages of technically coherent prose, and every paragraph has the same hollow aftertaste. "It's important to note." "Let's delve into." "This approach offers several advantages." The words are all correct, the sentences parse, but reading it feels like shaking hands with someone who's smiling without their eyes. A verbal uncanny valley: close enough to human to feel wrong.

The project was ambitious: Elements of Agentic System Design, a 300-page technical book about building AI applications. Not a blog post, not a tutorial series. A book. And I was writing it with AI, because the recursive irony felt appropriate: dogfooding the methodology while documenting it, using agents to write about agents.

What I discovered had nothing to do with better prompts. The solution came from an unexpected domain: compiler design. Generation is compilation. The quality of your output is limited by the quality of your intermediate representation. Here's how I got there.

Prompt Harder

First attempt: a "clarity bear" interview prompt, a structured conversation that extracted ideas through targeted questions. Hours of back-and-forth about agents, state management, orchestration patterns, tool design. The transcript accumulated into fifty pages of raw thought, organized by topic, timestamped and indexed. I dropped the whole thing into Claude's context window and asked it to generate the book.

The result wasn't "needs editing." It was fundamentally wrong. Entire chapters materialized about topics barely mentioned in the interview, while core insights I'd spent an hour articulating got a single paragraph buried in section four. The context window contained all the words but none of the intent, like a student who memorized the textbook but can't solve the exam. I threw it away and started over. The problem wasn't the prompt. The problem was the architecture.

80%

Second attempt: procedural generation. Outline first, then chapter by chapter, section by section. Each chapter got a detailed spec before generation: topics to cover, sequence to follow, examples to include. Generate a chapter, review it, move to the next. The assembly line approach.

This got to roughly 80%. Prose was coherent. Ideas flowed in approximately the right sequence. I could see the shape of a book emerging from the noise.

But 80% isn't publishable.

Two problems surfaced. First: AIisms everywhere. "It's important to note" in every third paragraph, "Let's delve into" opening each section, the unmistakable AI aftertaste baked into the texture of every page. Second: hallucinated emphasis. Not factual hallucinations, but something subtler. The model wasn't making things up; it was inventing nuance, flattening distinctions I cared about while emphasizing points I'd barely mentioned. The outline said "explain context construction," and the output explained something adjacent with confident authority.

The outline captured hierarchy and topics, but it didn't encode the actual ideas exhaustively. The specific framings, the precise distinctions, the rhetorical shape I was reaching for: none of that was written down anywhere. It existed only in my head, and the model couldn't read my head. I was decomposing the problem, but into what?

Detection, Not Prevention

I attacked the AIisms first, writing style guides in the system prompt: "Never use 'delve.' Avoid 'it's worth noting.' Don't start paragraphs with 'Additionally.'" I added examples of good prose and bad prose, specified voice with increasing precision, enumerated banned phrases, provided before-and-after rewrites.

None of it stuck. The model would comply for a few paragraphs, then drift back to defaults, like a student who nods along in class and immediately reverts to old habits on the exam. The style guide was in the context window; it just wasn't in the output.

The insight came sideways: stop trying to prevent the problem and instead detect and fix it after. Separate the loop. Let the model generate freely, content first, style be damned, then run a second pass to identify violations and rewrite offending sentences. Two distinct operations instead of one conflated mess.

But detection had to be deterministic. "Sounds too AI-like" is a feeling, not a rule, and I needed something mechanical. So I went through my own edits, every angry margin note: "NO." "This is generic." "Sounds like a robot." "Nobody talks like this." I collected them, and patterns emerged: specific phrases, structural tics, telltale rhythms. I had an AI analyze my complaints, extract the rules, and build a linter.

The linter didn't need to understand good writing; it just needed to catch bad writing reliably. Generate, lint, rewrite flagged passages, repeat until clean. Separate content generation from style verification. Make verification deterministic. The style problem was solved.

Full Rewrite Flow

INPUT

Prose Is Compiled Output

With style solved, I expected the rest to fall into place. It didn't. Editing was still exhausting. I'd generate a chapter, run the linter, fix flagged patterns, read through the result, and something would still be off. Not style, but substance. This explanation doesn't land. That example feels disconnected from the point. Rewrite the paragraph, regenerate, lint again, read again. Progress measured in inches.

The frustration had a familiar shape. It reminded me of debugging assembly code years ago, tracing registers and memory addresses to find a logic error that would've been obvious in the source. I was working at the wrong layer. The realization hit: prose is compiled output. The sentences, the word choices, the rhythm: these are microstates, countless variations that can express the same underlying idea.

When I edited a sentence, I was fighting at the microstate level. One paragraph can be phrased a hundred different ways; fix one phrasing, the model generates a different one next time, and the cycle repeats forever. But here's the insight: a paragraph is a macrostate. It has a rhetorical function: to claim something, explain a mechanism, provide an example. The specific sentences are just one instantiation of that function. If the macrostate is right, microstates can vary freely. If the macrostate is wrong, no amount of sentence editing will fix it.

I was debugging assembly when I should have been fixing the source.

Writing in the AI age has to change, but not how people think. The goal isn't to stop writing; you still have to articulate your ideas, shape your argument, fight for your voice. What AI accelerates is the expansion from structure to prose, the compilation step. But that only works if you're operating at the right layer. The right layer isn't sentences. It's moves.

What Outlines Throw Away

Reading through chapters, I couldn't tell if my ideas were complete. The outline said "Section 3.2: Context Construction." The output had paragraphs about context construction. But did it cover what I actually meant? Did it make the right distinctions? Did it land the insight I was reaching for? I couldn't tell without reading every word. At 300 pages, careful reading is its own full-time job.

The outline was supposed to be my map, but staring at it, I realized it doesn't tell me what each paragraph does. "Introduction to X": is that a definition, a claim, or a motivation? "Explain Y": mechanism or example? The outline encoded topics, but not moves.

Traditional outlines are lossy compression. They preserve hierarchy and sequence but throw away rhetorical structure, like compiling to an IR that discards type information. You can't typecheck downstream if types were never preserved. Traditional outlines don't encode rhetorical structure, which is what actually matters.

Paragraphs Have Types

I stepped back and asked a different question: how do you notate idea flow? Not topics, not structure, but the actual movement of a reader's understanding from paragraph to paragraph. What does each chunk do to the reader's mental state?

Down a research rabbit hole, I found Swales (1990), a linguist who studied how academic papers work. His insight: texts are composed of "rhetorical moves." A move isn't what a paragraph is about. It's what a paragraph does.

The core vocabulary:

CLAIM: assert a position
MECHANISM: explain why/how something is true
EXAMPLE: make abstract concrete
CONTRAST: distinguish from alternatives
QUESTION: open a tension
ANSWER: resolve it

Every paragraph has a type. The type constrains what it can contain.

The outline transformed. Instead of topic labels, move annotations:

Context Construction:
  p1:
    CLAIM: Context is the only input to the model
    KEY: "Everything must fit inside that call"
  p2:
    MECHANISM: Why this constraint exists
    CONCRETE_DETAIL: The model has no persistent memory
  p3:
    CONSEQUENCE: What this means for system design
    PRACTICAL: You must reconstruct state every time

Looking at a move-annotated outline, I could see structure at a glance. QUESTION without ANSWER? Structural bug. CLAIM without MECHANISM? Missing explanation. MECHANISM without EXAMPLE? Too abstract. I could verify content completeness at the outline level because each move type has a semantic contract: MECHANISM must explain why, EXAMPLE must be concrete. Violations are visible in the notation itself, before any prose exists. The IR must preserve the signal you care about verifying.

Move Analyzer Demo

The Pattern

Months in, the pattern crystallized. I'd been treating generation as monolithic: prompt in, prose out. But it's actually compilation. Source material (ideas, research, intent) transforms through intermediate representations into final output (prose), and the quality of each stage depends on what the previous stage preserved.

Why different approaches fail or succeed:

One-shot generation fails because there's no intermediate state to verify. Ideas in, prose out, and if the result is wrong, you can't isolate where it went wrong. No breakpoints, no stack traces.

Multi-step with fuzzy IRs (summaries, traditional outlines) fails subtly. You have stages, but the representations don't preserve what matters. The outline looks like structure, but it's lossy, and critical information evaporates at the boundary.

Multi-step with typed IRs works. Move notation preserves rhetorical structure, the style linter preserves voice constraints, and each pass has a verifiable invariant. You can inspect each IR and know whether it's correct before compilation continues.

Multi-step AI generation works when each step produces an intermediate representation with verifiable invariants.

The general pattern for any AI generation task:

Identify what signal you need to preserve
Design an IR that encodes that signal explicitly
Make verification deterministic where possible
Separate concerns into dedicated passes

This is why "better prompts" is the wrong frame. It's not about prompting. It's about architecture.

If you're writing with AI and fighting the same battles: start with verification, not generation. Build a style linter from your own complaints. Annotate one existing piece with move notation to see if structure becomes visible. Generate per-move instead of per-section. Start with verification. The generation will follow.

The book isn't done, but the system works. Move notation doesn't eliminate editing; it localizes errors. The style linter doesn't prevent AIisms; it catches them mechanically. Each tool solves one problem cleanly instead of all problems poorly.

The goal was never "AI writes for me." It's "AI generates, I verify at the right abstraction level."

You're not writing. You're compiling.