1. The 5-Stage Retrieval Pipeline
retrieval
For every code-related prompt, Writ asks: "which rules matter for this exact task?"
and returns the answer in under a millisecond by running five small searches in sequence.
Why this exists
- A single retriever (keyword-only, or vector-only) misses rules a coding agent
actually needs. Combining BM25 + ANN + graph closes those gaps.
- The whole thing must finish in roughly the same time as one Python function call,
so it can run on every prompt without slowing the agent down.
What each stage does
- Stage 1: Domain filter. Drops rules outside the active domain
(Python vs. Magento vs. infrastructure) before any ranking work. Source:
writ/retrieval/pipeline.py:246-270.
- Stage 2: BM25 keyword search. Tantivy index over rule
trigger (2.0x boost), statement, tags, and body (0.5x dilution). Returns up to
50 candidates. Source:
writ/retrieval/keyword.py:28-117.
- Stage 3: ANN vector search. hnswlib over 384-dim embeddings
from
all-MiniLM-L6-v2 (ONNX preferred, SentenceTransformer fallback).
HNSW params M=16, ef_construction=200, ef_search=50.
Returns up to 10 candidates. Source: writ/retrieval/embeddings.py.
- Stage 4: Graph traversal. A pre-computed adjacency cache lets
us look up neighbors of any rule in constant time (no live Cypher per query).
Two-hop expansion. Source:
writ/retrieval/traversal.py:23-147.
- Stage 5: Two-pass ranking with RRF. First pass picks the top
three highest-authority rules. Second pass scores everything else by proximity
to those three. Reciprocal Rank Fusion combines BM25 and vector signals before
weights are applied. Source:
writ/retrieval/ranking.py.
One example
- Prompt: "add a Magento product repository method that loads by SKU."
Stage 1 keeps only Magento + arch rules. Stage 2's BM25 hits on "repository,"
"factory," "loadBy." Stage 3's vector hits on the rationale text. Stage 4 pulls
in the related anti-pattern (
$factory->create()->load()).
Stage 5 ranks them and the agent gets a small bundle of relevant rules.
2. The Weighting System
retrieval
Five signals combine into the final rank. The weights are not guesses, they were
tuned against a ground-truth corpus.
The exact weights
Vector (ANN): 0.594
BM25 (keyword): 0.198
Severity: 0.099
Confidence: 0.099
Graph proximity: 0.01
What each weight means
- Vector dominates because semantic matches recover rules whose
triggers do not share vocabulary with the query.
- BM25 anchors for exact-name lookups (rule IDs, identifiers).
- Severity bumps critical rules above advisory ones when scores tie.
Scale: critical 1.0, high 0.75, medium 0.5, low 0.25.
- Confidence reflects how trusted a rule is: battle-tested 1.0,
production-validated 0.8, peer-reviewed 0.6, speculative 0.3.
- Graph proximity is a tiebreaker: 1.0 for one-hop neighbors of
the top-three seed rules, 0.5 for two-hop, 0.0 otherwise.
Literal-query override
- When the agent asks an exact-phrase question (looking up a specific rule by ID),
BM25 and vector both get 0.396, because semantic matching adds noise.
- Frequency graduation: once a rule has 50 or more observations with a 75%+
positive ratio, the empirical ratio replaces the static confidence weight.
Source:
writ/frequency.py:28-53.
Source
writ/retrieval/ranking.py:22-26 for the defaults.
3. Mandatory vs. Retrieved: the Architectural Invariant
invariant
A small set of rules is injected on every prompt, no questions asked. The vast majority
of rules are retrieved on demand. This split is the most important design decision in Writ.
Mandatory rules (always on)
- Forbidden phrases, communication rules, and structural enforcement rules.
- Capped at ~5K tokens per session to keep the budget predictable.
- Never retrieved by BM25 or ANN. They are excluded at index-build time
and at query time (
writ/retrieval/pipeline.py:510).
- Loaded via the separate
GET /always-on endpoint.
Retrieved rules (RAG-driven)
- Every domain rule, skill, playbook, anti-pattern, and pressure scenario.
- Pulled into context only when relevant to the current prompt.
- Subject to the five-stage pipeline.
- Deduplicated across turns so the same rule is not re-injected.
Why the split matters
- The mandatory band is a permanent floor: rules so important they must apply to
every turn, even if the agent does not ask for them.
- The retrieved band is variable: it lets us add 1000+ rules to the corpus without
paying for any of them on prompts that do not need them.
- Together they let Writ scale without scaling the context budget.
4. The Knowledge Graph
corpus
Rules are not a flat list. They live in a Neo4j graph with twelve node types and
more than ten edge types, so a rule can teach a skill, counter an anti-pattern,
or supersede an older rule, and the retrieval pipeline can follow those links.
Twelve node types
- Rule: the primary enforceable rule.
- Abstraction: a compressed summary of a rule bundle.
- Skill: a teachable technique (retrievable).
- Playbook: a multi-phase workflow (retrievable).
- Technique: an implementation detail (retrievable).
- AntiPattern: a code smell to flag (retrievable).
- ForbiddenResponse: a prohibited AI output, always on.
- Phase: a step inside a playbook.
- Rationalization: a counter to a known excuse.
- PressureScenario: an adversarial test case.
- WorkedExample: a before/after illustration.
- SubagentRole: a role definition for an agentic worker.
Edge types
- Structural:
DEPENDS_ON, PRECEDES,
CONFLICTS_WITH, SUPPLEMENTS, SUPERSEDES,
RELATED_TO, APPLIES_TO.
- Methodology:
TEACHES, COUNTERS,
DEMONSTRATES, DISPATCHES, GATES,
PRESSURE_TESTS, CONTAINS, ATTACHED_TO.
How the graph helps retrieval
- Hit one rule, get the bundle: the adjacency cache lets the pipeline expand a
single match into its directly related rules without a Cypher round-trip.
- Conflicts and supersessions are first-class: the gate refuses to ingest a new
rule that conflicts with an existing one.
- Source:
writ/graph/schema.py.
5. The Mode System
enforcement
Every session declares a mode up front. The mode decides how much ceremony Writ
imposes: from "just answer" to "no code without an approved plan and tests."
Four modes
- Conversation: discussion, brainstorming, questions. No phase
gates. No write blocking.
- Debug: investigating a specific problem. Read-heavy, no code
generation expected.
- Review: evaluating existing code against rules. Structured
findings, no edits.
- Work: building or modifying code. Full workflow with phase
gates (see card 6).
How a mode is set
- The agent runs
writ-session.py mode set <mode> <session_id>
at the start of the session.
- If no mode is set, the gate hooks block all writes except
plan.md.
- The
--orchestrator flag tells Writ to suppress broad RAG injection
in the master session because sub-agent workers will do their own retrieval.
Why modes exist
- Most prompts are not "write me production code." Conversation and debug modes
give the agent permission to be useful without paying the cost of the full workflow.
- Work mode is the strict mode: it is opt-in, and once entered it cannot be silently
bypassed.
- Source:
bin/lib/writ-session.py:835-985.
6. The Phase Gates (Work Mode)
enforcement
Inside Work mode the agent moves through four gates. Each gate is a file the user has
to approve before the agent is allowed to write the next thing.
The four phases
- Planning: the agent writes
plan.md and
capabilities.md to the project root. Required sections:
Files, Analysis, Rules Applied, Capabilities.
- Testing: the agent writes test skeleton files with method signatures
and assertions. No implementation code is allowed yet.
- Implementation: the agent writes the production code that makes
the tests pass. Tests are run automatically after each write.
- Verification: the agent verifies the result and updates
capabilities.md with checked-off items.
How approval works
- The user types "approved" after seeing the plan or the test skeletons.
- That triggers the
/writ-approve command, which calls
POST /session/{id}/advance-phase with confirmation_source: "tool".
- The audit trail records who approved what, when, and via which path
(tool vs. pattern match).
- Pattern matching on "approved" is defense in depth only. The primary path is
the tool call.
Why this exists
- Agents over-eager to write code is the most common failure mode in agentic
coding. Phase gates make the agent pause and show its work.
- Each gate is a small contract the user can read in 30 seconds.
7. The Enforcement Layer (Hooks)
enforcement
Writ ships about thirty bash hooks that fire at well-defined moments in the
agent's lifecycle. They inject rules, validate writes, and block actions that would
violate the workflow.
When hooks fire
- SessionStart: load mandatory rules, register the session.
- UserPromptSubmit: inject the always-active rules block and
run the RAG retrieval for the prompt.
- PreToolUse (Write, Edit, Bash, Read, ExitPlanMode, Task, TodoWrite):
validate that the action is allowed in the current phase, and that the file passes
structural checks.
- PostToolUse: validate the written file against rules, mark
pending tests, and inject any rules surfaced by the write itself.
- Stop / SessionEnd: run pending tests, log friction events,
audit the session.
- PreCompact / PostCompact: snapshot and restore state across
context compactions.
The gate system
- Each phase has a gate file:
plan.md for planning, the test skeletons
for testing, the implementation files for implementation.
- A pre-write hook checks the gate state before allowing the write. If the
current phase forbids the action, the hook denies with an explicit reason.
- Sub-agent workers are exempt from the gate: they run in their own session
with
is_subagent: true, so they can do their assigned task without
the orchestrator's phase pressure.
How risky rule proposals are filtered
- The "structural gate" runs five checks before any AI-proposed rule enters the graph:
schema validation, mechanical enforcement path (for mandatory rules), specificity
(rejecting vague language like "consider" or "try to"), redundancy (cosine similarity
against existing rules), and conflict (against CONFLICTS_WITH edges).
- Source:
writ/gate.py:54-221.
8. The Evolution Model
corpus
The corpus is not static. AI agents propose new rules, structural gates filter the
noise, and frequency tracking promotes rules that actually pay off in practice.
The authority ladder
- human: the default for hand-authored rules.
- ai-provisional: the only authority an AI proposal can claim,
forced by
propose_rule. Bound to speculative confidence
(weight 0.3).
- ai-promoted: set by
writ review --promote after
a human reviewer accepts an AI-provisional rule. Confidence bumps to
peer-reviewed (weight 0.6).
The confidence ladder
- speculative (0.3): freshly proposed by AI.
- peer-reviewed (0.6): a human approved promotion.
- production-validated (0.8): static default for established rules.
- battle-tested (1.0): granted only by empirical frequency.
Frequency graduation
- Every time a rule fires, the system records whether it stuck (the agent
followed it) or was rationalized away.
- Once a rule reaches 50 observations with a 75%+ positive ratio, the empirical
ratio overrides the static confidence weight in ranking.
- Source:
writ/frequency.py.
The structural gate (one more time)
- No AI proposal reaches the graph without passing all five gate checks. This is
how Writ keeps an AI from polluting its own corpus.