Writ Architecture

1. The 5-Stage Retrieval Pipeline

retrieval

For every code-related prompt, Writ asks: "which rules matter for this exact task?" and returns the answer in under a millisecond by running five small searches in sequence.

Why this exists

A single retriever (keyword-only, or vector-only) misses rules a coding agent actually needs. Combining BM25 + ANN + graph closes those gaps.
The whole thing must finish in roughly the same time as one Python function call, so it can run on every prompt without slowing the agent down.

What each stage does

Stage 1: Domain filter. Drops rules outside the active domain (Python vs. Magento vs. infrastructure) before any ranking work. Source: writ/retrieval/pipeline.py:246-270.
Stage 2: BM25 keyword search. Tantivy index over rule trigger (2.0x boost), statement, tags, and body (0.5x dilution). Returns up to 50 candidates. Source: writ/retrieval/keyword.py:28-117.
Stage 3: ANN vector search. hnswlib over 384-dim embeddings from all-MiniLM-L6-v2 (ONNX preferred, SentenceTransformer fallback). HNSW params M=16, ef_construction=200, ef_search=50. Returns up to 10 candidates. Source: writ/retrieval/embeddings.py.
Stage 4: Graph traversal. A pre-computed adjacency cache lets us look up neighbors of any rule in constant time (no live Cypher per query). Two-hop expansion. Source: writ/retrieval/traversal.py:23-147.
Stage 5: Two-pass ranking with RRF. First pass picks the top three highest-authority rules. Second pass scores everything else by proximity to those three. Reciprocal Rank Fusion combines BM25 and vector signals before weights are applied. Source: writ/retrieval/ranking.py.

One example

Prompt: "add a Magento product repository method that loads by SKU." Stage 1 keeps only Magento + arch rules. Stage 2's BM25 hits on "repository," "factory," "loadBy." Stage 3's vector hits on the rationale text. Stage 4 pulls in the related anti-pattern ($factory->create()->load()). Stage 5 ranks them and the agent gets a small bundle of relevant rules.

2. The Weighting System

retrieval

Five signals combine into the final rank. The weights are not guesses, they were tuned against a ground-truth corpus.

The exact weights

59.4%

19.8%

9.9%

1%

Vector (ANN): 0.594 BM25 (keyword): 0.198 Severity: 0.099 Confidence: 0.099 Graph proximity: 0.01

What each weight means

Vector dominates because semantic matches recover rules whose triggers do not share vocabulary with the query.
BM25 anchors for exact-name lookups (rule IDs, identifiers).
Severity bumps critical rules above advisory ones when scores tie. Scale: critical 1.0, high 0.75, medium 0.5, low 0.25.
Confidence reflects how trusted a rule is: battle-tested 1.0, production-validated 0.8, peer-reviewed 0.6, speculative 0.3.
Graph proximity is a tiebreaker: 1.0 for one-hop neighbors of the top-three seed rules, 0.5 for two-hop, 0.0 otherwise.

Literal-query override

When the agent asks an exact-phrase question (looking up a specific rule by ID), BM25 and vector both get 0.396, because semantic matching adds noise.
Frequency graduation: once a rule has 50 or more observations with a 75%+ positive ratio, the empirical ratio replaces the static confidence weight. Source: writ/frequency.py:28-53.

Source

writ/retrieval/ranking.py:22-26 for the defaults.

3. Mandatory vs. Retrieved: the Architectural Invariant

invariant

A small set of rules is injected on every prompt, no questions asked. The vast majority of rules are retrieved on demand. This split is the most important design decision in Writ.

Mandatory rules (always on)

Forbidden phrases, communication rules, and structural enforcement rules.
Capped at ~5K tokens per session to keep the budget predictable.
Never retrieved by BM25 or ANN. They are excluded at index-build time and at query time (writ/retrieval/pipeline.py:510).
Loaded via the separate GET /always-on endpoint.

Retrieved rules (RAG-driven)

Every domain rule, skill, playbook, anti-pattern, and pressure scenario.
Pulled into context only when relevant to the current prompt.
Subject to the five-stage pipeline.
Deduplicated across turns so the same rule is not re-injected.

Why the split matters

The mandatory band is a permanent floor: rules so important they must apply to every turn, even if the agent does not ask for them.
The retrieved band is variable: it lets us add 1000+ rules to the corpus without paying for any of them on prompts that do not need them.
Together they let Writ scale without scaling the context budget.

4. The Knowledge Graph

corpus

Rules are not a flat list. They live in a Neo4j graph with twelve node types and more than ten edge types, so a rule can teach a skill, counter an anti-pattern, or supersede an older rule, and the retrieval pipeline can follow those links.

Twelve node types

Rule: the primary enforceable rule.
Abstraction: a compressed summary of a rule bundle.
Skill: a teachable technique (retrievable).
Playbook: a multi-phase workflow (retrievable).
Technique: an implementation detail (retrievable).
AntiPattern: a code smell to flag (retrievable).
ForbiddenResponse: a prohibited AI output, always on.
Phase: a step inside a playbook.
Rationalization: a counter to a known excuse.
PressureScenario: an adversarial test case.
WorkedExample: a before/after illustration.
SubagentRole: a role definition for an agentic worker.

Edge types

Structural: DEPENDS_ON, PRECEDES, CONFLICTS_WITH, SUPPLEMENTS, SUPERSEDES, RELATED_TO, APPLIES_TO.
Methodology: TEACHES, COUNTERS, DEMONSTRATES, DISPATCHES, GATES, PRESSURE_TESTS, CONTAINS, ATTACHED_TO.

How the graph helps retrieval

Hit one rule, get the bundle: the adjacency cache lets the pipeline expand a single match into its directly related rules without a Cypher round-trip.
Conflicts and supersessions are first-class: the gate refuses to ingest a new rule that conflicts with an existing one.
Source: writ/graph/schema.py.

5. The Mode System

enforcement

Every session declares a mode up front. The mode decides how much ceremony Writ imposes: from "just answer" to "no code without an approved plan and tests."

Four modes

Conversation: discussion, brainstorming, questions. No phase gates. No write blocking.
Debug: investigating a specific problem. Read-heavy, no code generation expected.
Review: evaluating existing code against rules. Structured findings, no edits.
Work: building or modifying code. Full workflow with phase gates (see card 6).

How a mode is set

The agent runs writ-session.py mode set <mode> <session_id> at the start of the session.
If no mode is set, the gate hooks block all writes except plan.md.
The --orchestrator flag tells Writ to suppress broad RAG injection in the master session because sub-agent workers will do their own retrieval.

Why modes exist

Most prompts are not "write me production code." Conversation and debug modes give the agent permission to be useful without paying the cost of the full workflow.
Work mode is the strict mode: it is opt-in, and once entered it cannot be silently bypassed.
Source: bin/lib/writ-session.py:835-985.

6. The Phase Gates (Work Mode)

enforcement

Inside Work mode the agent moves through four gates. Each gate is a file the user has to approve before the agent is allowed to write the next thing.

The four phases

Planning: the agent writes plan.md and capabilities.md to the project root. Required sections: Files, Analysis, Rules Applied, Capabilities.
Testing: the agent writes test skeleton files with method signatures and assertions. No implementation code is allowed yet.
Implementation: the agent writes the production code that makes the tests pass. Tests are run automatically after each write.
Verification: the agent verifies the result and updates capabilities.md with checked-off items.

How approval works

The user types "approved" after seeing the plan or the test skeletons.
That triggers the /writ-approve command, which calls POST /session/{id}/advance-phase with confirmation_source: "tool".
The audit trail records who approved what, when, and via which path (tool vs. pattern match).
Pattern matching on "approved" is defense in depth only. The primary path is the tool call.

Why this exists

Agents over-eager to write code is the most common failure mode in agentic coding. Phase gates make the agent pause and show its work.
Each gate is a small contract the user can read in 30 seconds.

7. The Enforcement Layer (Hooks)

enforcement

Writ ships about thirty bash hooks that fire at well-defined moments in the agent's lifecycle. They inject rules, validate writes, and block actions that would violate the workflow.

When hooks fire

SessionStart: load mandatory rules, register the session.
UserPromptSubmit: inject the always-active rules block and run the RAG retrieval for the prompt.
PreToolUse (Write, Edit, Bash, Read, ExitPlanMode, Task, TodoWrite): validate that the action is allowed in the current phase, and that the file passes structural checks.
PostToolUse: validate the written file against rules, mark pending tests, and inject any rules surfaced by the write itself.
Stop / SessionEnd: run pending tests, log friction events, audit the session.
PreCompact / PostCompact: snapshot and restore state across context compactions.

The gate system

Each phase has a gate file: plan.md for planning, the test skeletons for testing, the implementation files for implementation.
A pre-write hook checks the gate state before allowing the write. If the current phase forbids the action, the hook denies with an explicit reason.
Sub-agent workers are exempt from the gate: they run in their own session with is_subagent: true, so they can do their assigned task without the orchestrator's phase pressure.

How risky rule proposals are filtered

The "structural gate" runs five checks before any AI-proposed rule enters the graph: schema validation, mechanical enforcement path (for mandatory rules), specificity (rejecting vague language like "consider" or "try to"), redundancy (cosine similarity against existing rules), and conflict (against CONFLICTS_WITH edges).
Source: writ/gate.py:54-221.

8. The Evolution Model

corpus

The corpus is not static. AI agents propose new rules, structural gates filter the noise, and frequency tracking promotes rules that actually pay off in practice.

The authority ladder

human: the default for hand-authored rules.
ai-provisional: the only authority an AI proposal can claim, forced by propose_rule. Bound to speculative confidence (weight 0.3).
ai-promoted: set by writ review --promote after a human reviewer accepts an AI-provisional rule. Confidence bumps to peer-reviewed (weight 0.6).

The confidence ladder

speculative (0.3): freshly proposed by AI.
peer-reviewed (0.6): a human approved promotion.
production-validated (0.8): static default for established rules.
battle-tested (1.0): granted only by empirical frequency.

Frequency graduation

Every time a rule fires, the system records whether it stuck (the agent followed it) or was rationalized away.
Once a rule reaches 50 observations with a 75%+ positive ratio, the empirical ratio overrides the static confidence weight in ranking.
Source: writ/frequency.py.

The structural gate (one more time)

No AI proposal reaches the graph without passing all five gate checks. This is how Writ keeps an AI from polluting its own corpus.