Continuity: Execution-Intent-Conditioned Retrieval for Persistent Project Rationale in AI Coding Agents
Abstract
AI coding assistants routinely violate project-specific architectural decisions during long-running tasks, even when those decisions have been explicitly recorded. Existing memory systems treat retrieval as an opt-in action the model must choose to perform, which fails when the model does not recognize that retrieval is warranted. We describe Continuity, an engineering implementation of execution-intent-conditioned retrieval — retrieval keyed on file paths, entities, and mutation targets extracted from agent tool calls — that injects matched project rationale into the agent's context automatically via middleware, removing the model's burden to self-prompt for context. We evaluate Continuity against a no-memory baseline and a passive (opt-in) retrieval variant on two runners — single-prompt action alignment (n=30 per condition, 3 runs, 2 models) and multi-session recall (7 sessions × 20-question quizzes, 3 runs, 2 models) — and against MemPalace on a 50-query head-to-head. Exposing decisions to the agent lifts action alignment from 2.82/10 to 8.77/10 (GPT-4o; 2.88 → 8.22 on GPT-4o-mini), a roughly 3× jump. Automatic in-loop injection does not further improve single-prompt alignment over passive retrieval (8.61 vs. 8.77; within run-to-run variance), but roughly doubles the fraction of multi-session recall questions clearing a 0.7 cosine-similarity threshold (23% → 55% on GPT-4o; 28% → 50% on GPT-4o-mini). Inter-judge validation with a second LLM judge (Gemini 2.5 Flash) confirms the direction of all findings; Spearman ρ = 0.788 across 540 paired scores. We replicate the action-alignment and recall protocols on two additional fixtures (data-pipeline, mobile-app) with claude-sonnet-4-6 substituting for GPT-4o-mini (24-invocation matrix, §4.7); GPT-4o lift holds at 2.7–3.0× across all three corpora and claude-sonnet-4-6 shows a smaller but consistent 1.8–2.2× lift, indicating the benefit is model-agnostic. Across all conditions, we find that retrieval specificity dominates injection timing as the primary driver of performance. This work is an engineering report on a commercial product. The reference implementation is proprietary; we release benchmark fixtures, prompts, and raw result data to enable independent replication against alternative memory systems.
1. Introduction
Long-context degradation in LLMs — variously called “lost in the middle” (Liu et al., 2023), context rot, or instruction decay — is well documented. In the coding-agent setting, it contributes to a specific and costly failure: the agent takes actions inconsistent with architectural decisions it was previously told about. Standard retrieval-augmented generation (RAG) provides a partial remedy: give the model a search tool, store decisions in a structured store, and let the model query when relevant. In practice this fails for a simple reason — the model must first recognize that a query is warranted. For architectural constraints attached to specific files (e.g., “this module must not import from experimental/”), the trigger for retrieval is the act of editing the file, not a semantic signal the model is likely to notice mid-task.
This paper describes Continuity, an implementation of execution-intent-conditioned retrieval — retrieval keyed on file paths and entity references extracted from agent tool calls. The middleware fires on every file-touching tool invocation, but as we show in §4.7, that temporal pattern is not the load-bearing variable; the keying is. The structural shift is from retrieval-as-agent-decision to retrieval-as-system-invariant for a bounded class of file-scoped decisions. Standard RAG systems assume the model will recognize when retrieval is necessary; execution-intent-conditioned retrieval assumes that for certain operations (editing a file with known constraints, executing a command in a path under active rules) retrieval should occur unconditionally and is best keyed on the specific identifiers in the tool arguments rather than on the agent's broader intent. The contribution is an engineering report on a commercial product and an honest evaluation of the trade-offs of this shift. Our headline findings are narrower than the pattern's apparent promise:
- Exposing decisions to the agent at all matters enormously. Both retrieval conditions lift single-prompt action alignment roughly 3× over a no-memory baseline.
- In-loop injection is equivalent to passive retrieval on single-prompt benchmarks. When the relevant decisions are served inline with every prompt, automatic re-firing is redundant.
- In-loop injection is clearly better over multi-session workloads. The fraction of recall questions clearing a quality threshold roughly doubles.
We did not observe the specific failure mode some prior framings call “decision drift” on our benchmark — all three conditions show approximately flat alignment across seven sessions. The multi-session benefit of in-loop retrieval in our data comes from coverage (the agent sees decisions it would not have thought to query for) rather than from arresting drift.
1.1 Where Continuity sits: the Retrieval Prior framework
In this framing, memory architectures differ primarily in the state variable used to estimate retrieval relevance — their retrieval prior. This positioning makes the contribution legible against the existing literature:
| Retrieval strategy | Retrieval prior | Primary use case |
|---|---|---|
| Standard RAG | Semantic similarity | Broad latent knowledge recall, exploratory navigation |
| Conversational memory | Dialogue history | Turn-by-turn conversational coherence |
| ID-RAG (Platnick et al., 2025) | Identity graph state | Long-horizon persona and structural role coherence |
| Continuity | Execution intent | Architectural adherence, point-of-mutation constraint recall |
Continuity's retrieval prior is the agent's execution intent, derived from tool-call arguments — file paths, edit targets, entity references, the architectural scope under mutation. These are stronger locality signals for engineering rationale than conversational embeddings or identity graphs, because they describe what is about to be touched rather than what was said or who is acting. The §4.7 timing ablation supports this taxonomy empirically: per-query targeted retrieval keyed on execution intent produces a Cohen's d = 5.83 lift over blanket retrieval, while temporal re-firing of the same retrieval adds nothing measurable (d = −0.68). Specificity dominates timing.
3. The Continuity Pattern
Continuity consists of three components (Figure 1):
1. A decision store
A structured, append-mostly record of project decisions — typically one decision per record, each with a rationale, affected files/globs, and a timestamp.
2. A middleware layer
A shim between the agent and its tool executor. Before a file-touching tool call (read, edit, write, bash with file arguments) returns to the model, the middleware looks up which decisions are linked to the affected path(s).
3. An injection step
Matched decisions are prepended to the tool result in a metadata block, so the agent sees the rationale in the same turn as the tool output, without having to ask for it.

The design trade-off is explicit: in-loop retrieval reduces the agent's retrieval decision cost to zero for file-scoped decisions, at the cost of (a) requiring that decisions be indexable by file path and (b) adding per-tool-call overhead. Decisions that are not tied to specific files (e.g., cross-cutting style conventions) are not well served by this pattern and fall back to conventional RAG.
The reference implementation is approximately 180 lines of TypeScript and ships as part of the commercial Continuity product. The implementation handles glob-based path matching, decision ranking for projects with more than ~50 decisions, a per-session lookup cache, and a configurable injection budget. A high-level pseudocode sketch appears in Appendix B. Full implementation details are not part of this paper; the pattern is described at a level of abstraction sufficient for replication.
3.1 What the public benchmarks actually exercise
Important disclosure for readers replicating from the public repo. The runners in github.com/Alienfader/continuity-benchmarks exercise the retrieval-keying logic of the production middleware, not the tool-call interception delivery mechanism. The continuity-in-loop condition extracts file paths and capitalized entity tokens from the agent's prompt (via runners/shared/retrieval.ts::extractEntities), runs a BM25 query against the decision store on those keys, and prepends matched decisions to the prompt. This faithfully reproduces the middleware's key-extraction and ranking behavior. It does not reproduce the production delivery shape: in the runner, matched decisions are prepended to the prompt rather than injected into a tool-result _meta.relevantDecisions field at the point of tool execution.
The §4.7 timing-ablation conclusions therefore apply to entity-keyed retrieval delivered via prompt-prepend. Whether the production delivery shape (decisions injected in tool-result metadata in the same turn as the tool output) further improves or degrades the result is a separate question the public benchmark does not currently answer. A skeleton runner for end-to-end production-middleware replay (a continuity-mcp-middleware condition) is scaffolded at runners/middleware-replay.ts; running the v2 matrix through it is future work.
4. Evaluation
We report three experiments, all on the paydash-api fixture (a Python web-service project with 19 recorded architectural decisions). Each experiment's protocol was adapted from the corresponding ID-RAG Parallel runner (Platnick et al., 2025); methods are in Appendix A. Query sets, prompt fixtures, raw result JSONs, and scoring scripts are released at github.com/Alienfader/continuity-benchmarks so a third party can replicate the evaluation against their own memory system.
4.1 Single-prompt action alignment
For each of 30 prompts, the agent is asked to take an action on a file with active decisions. An LLM judge (Claude Sonnet 4.6, temperature 0) scores whether the action conforms to the relevant decisions on a 1–10 rubric. Three runs per condition.
| Condition | GPT-4o (mean, stdev) | GPT-4o-mini (mean, stdev) |
|---|---|---|
| Baseline (no retrieval) | 2.82 (±0.07) | 2.88 (±0.16) |
| Continuity (Passive) | 8.77 (±0.00) | 8.22 (±0.09) |
| Continuity (In-Loop) | 8.61 (±0.07) | 8.13 (±0.18) |
Both retrieval conditions raise alignment by roughly 3× over baseline on both models. The Passive and In-Loop variants are within run-to-run variance of each other on both models — the in-loop mechanism provides no measurable benefit here. We interpret this as the expected outcome: action alignment is measured per-prompt, and passive retrieval already serves the relevant decisions for each prompt, so automatic re-firing is redundant.
4.2 Multi-session recall
Over seven sessions with approximately 5,000 tokens of off-topic noise injected between sessions, a 20-question recall quiz probes retention of decision rationale. Responses are embedded and scored against ground truth by cosine similarity (all-mpnet-base-v2). Three runs per model (two for GPT-4o-mini on this runner — one run failed with a transient API error).
| Condition | GPT-4o mean cosine | GPT-4o-mini mean cosine | GPT-4o frac ≥ 0.7 | GPT-4o-mini frac ≥ 0.7 |
|---|---|---|---|---|
| Baseline | 0.519 (±0.001) | 0.514 (±0.001) | 12% | 13% |
| Continuity (Passive) | 0.600 (±0.003) | 0.589 (±0.001) | 23% | 28% |
| Continuity (In-Loop) | 0.693 (±0.002) | 0.691 (±0.001) | 55% | 50% |

Both retrieval conditions improve recall over baseline. Unlike the action-alignment results, here the in-loop variant is clearly better than passive: the fraction of questions clearing a 0.7 cosine threshold roughly doubles on both models (23% → 55% on GPT-4o; 28% → 50% on GPT-4o-mini). We report the threshold metric alongside mean cosine because the threshold is more interpretable — it measures how many questions are answered well enough to matter rather than averaging across all questions.
Two caveats on this experiment:
First, all three conditions show approximately flat alignment across the seven sessions (drift slopes under 0.003 per session in absolute terms). The paydash-api fixture's 19 decisions do not induce the kind of session-over-session drift Platnick et al. observe in persona-grounded agents, so we cannot make the claim that Continuity arrests drift. What our data supports is that in-loop retrieval raises a floor: because relevant decisions arrive with every file-touching call, the agent cannot miss them due to failing to query. This is a coverage benefit, not a drift-reduction benefit. On a different fixture with genuinely drift-prone content, the picture might differ; we have not tested this.
Second, absolute cosine numbers are not apples-to-apples with those reported in Platnick et al. (different ground truth, different retrieval target, same embedding scale). What is comparable is the magnitude of the lift, which is of the same order in both studies.
4.3 Task-oriented contrast vs. MemPalace (v3, Unified Search)
The April 30, 2026 Head-to-Head v3 run was executed against Continuity's production corpus — 1,894 architectural decisions and approximately 1,560 indexed source files. v3 retires the prior Passive/In-Loop split in favour of Continuity's Unified Search (RRF hybrid: Semantic + Keyword + Tags fused with file-snippet retrieval). MemPalace mines the full codebase into ChromaDB. Each query is judged by a single LLM call (Anthropic Claude Sonnet) which returns relevance scores 0–1 and a winner; ties are awarded when neither set dominates.
Methodology for competitive baseline. The MemPalace comparison is a task-oriented contrast rather than a controlled head-to-head over the same artifacts. Both systems were evaluated against the same 50-query technical benchmark using identical prompts. The corpora differ in structure rather than source: both systems were pointed at the same project repository, but they ingest different artifacts of it. Continuity indexes structured decision records (Q/A pairs with tags, sourced from .continuity/decisions.json — 1,894 records at run time) plus a content-plus-filename index over project files (.ts/.js/.json/.md/.yml under 100KB, excluding node_modules, build outputs, and worktrees). MemPalace ingests the raw codebase via its default ChromaDB pipeline (1,177 files mined into vector drawers) and has no equivalent of the structured decision corpus. Continuity's retrieval ran through @continuity/core's SemanticSearchService with RRF hybrid (semantic via all-MiniLM-L6-v2 embeddings + keyword + tag fusion) and returned up to 7 blended results per query; MemPalace was invoked as its CLI subprocess (mempalace search) with no embedding-model substitution and no parameter overrides, returning its top 5 results. No tuning effort was applied to either system. The corpora are therefore not identical at the artifact level, and the top-k returned per query differs by system (7 vs 5) — these differences reflect each system's deployment-shape default rather than a controlled hyperparameter sweep.
The two systems optimize for different retrieval objectives (per the §1.1 taxonomy): MemPalace emphasizes semantic repository exploration and latent code understanding; Continuity emphasizes decision adherence and rationale resurfacing conditioned on execution intent. The results should therefore be read narrowly as execution-intent-keyed retrieval against a structured decision corpus is substantially better suited for architectural-rationale recall tasks than generalized semantic codebase mining. We do not claim this generalizes to all repository retrieval workloads — in exploratory programming, feature discovery, or broad semantic navigation, MemPalace-shaped systems may outperform Continuity's path/entity-linked approach.
| Metric | Continuity (Unified) | MemPalace |
|---|---|---|
| Query wins (N=50) | 42 | 6 |
| Ties | 2 | 2 |
| Mean relevance (LLM-judged, 0–1) | 0.85 | 0.62 |
| Wake-up latency | 16 ms | 3,506 ms |
| Wake-up tokens | 813 | 817 |
| Avg query latency | 138 ms | 1,966 ms |
| Avg query tokens | 686 | 965 |
The latency gap is large and reflects a genuine architectural difference: Continuity does not index the full codebase. The relevance gap is more contingent. Our query set was constructed by the author and skews toward the kind of project-rationale lookups Continuity is designed for (“why does module X do Y?”), underweighting queries about code structure or implementation details where MemPalace's full-codebase index does better — MemPalace's wins cluster on source-tree queries (webpack config, VSIX packaging, CLI commands, benchmark result files, offline operation). The split is therefore not evidence that Continuity is uniformly superior; it is evidence that on rationale-centric queries, a small targeted index beats a large general one. We recommend treating this comparison as suggestive and welcome a third-party query set.
The v3 run consolidates the prior Passive/In-Loop split into a single Unified Search path: at retrieval time both decision rationale and project-file snippets are ranked together via reciprocal rank fusion. The §4.1 finding still holds — in-loop injection's value is not in how each retrieval scores but in when it fires — but the production system now serves both flows through one ranker, which is what these v3 numbers measure.
4.4 Inter-judge validation
We ran two independent inter-judge replication studies on action-alignment outputs from two separate benchmark matrices: the original paydash-api matrix (April 2026, gpt-4o-mini agent, n=540 paired scores), and the v2 cross-corpus matrix described in §4.7 (May 2026, gpt-4o + claude-sonnet-4-6 agents, n=1,080 paired scores). Both re-scored saved outputs with an independent judge model (Google Gemini 2.5 Flash) using the same prompt and ordinal scale. Combined sample: 1,620 paired Sonnet-vs-Gemini scores across two corpora and three agent models. These are two independent inter-judge replications, not a single replication of one underlying study.
| Study | N | Judges | Spearman ρ | Cohen's κ (linear-weighted) | Δ (Sonnet − Gemini) |
|---|---|---|---|---|---|
| Paydash (Apr 2026) | 540 | Sonnet 4.6 vs Gemini 2.5 Flash | 0.788 | 0.518 (moderate) | −1.44 |
| Cross-corpus (May 2026) | 1,080 | Sonnet 4.6 vs Gemini 2.5 Flash | 0.722 | 0.558 (moderate) | −1.17 |
| Combined | 1,620 | — | ρ ∈ [0.71, 0.79] | κ ∈ [0.51, 0.56] | — |
Both studies land in the same agreement band despite differing in corpus (paydash-api vs data-pipeline + mobile-app), agent model (gpt-4o-mini vs gpt-4o + claude-sonnet-4-6), and run window (April vs May 2026). The agreement signal is itself reproducible across these axes; both runs use the same Judge B model (Gemini 2.5 Flash, temperature 0). The combination of strong rank correlation with moderate absolute agreement is the canonical “same signal, different calibration” pattern. The Continuity-vs-baseline lift is preserved under both judges in every cell of the cross-corpus 24-cell matrix. The per-condition picture (across all 1,080 cross-corpus pairs):
| Condition | N (pairs) | Sonnet mean | Gemini mean | Δ (Sonnet − Gemini) |
|---|---|---|---|---|
| baseline | 360 | 3.91 | 5.65 | −1.74 |
| continuity | 360 | 9.00 | 9.89 | −0.89 |
| continuity-in-loop | 360 | 8.92 | 9.79 | −0.88 |
| Overall | 1,080 | 7.28 | 8.44 | −1.17 |
Aggregating across both continuity-conditioned columns from the §4.7 alignment table (continuity-blanket and continuity-in-loop, each on 12 cells), Sonnet reports a 2.30× lift in the ratio of overall means (continuity ≈ 9.00 / baseline 3.91); Gemini reports 1.75× by the same aggregation (9.89 / 5.65). The continuity-condition mean of ~9.00 is the average of the two continuity columns from the alignment table (8.98 and 8.90); per-cell preservation under both judges is computable from reports/id-rag-parity/inter-judge-cross-corpus.json. Computing instead as the mean of per-cell ratios produces 2.43× and 1.79× respectively — both aggregations agree directionally and within reading-noise of each other. The ratios are magnitude-shifted but directionally identical: Gemini compresses the lift because it is more lenient on baseline actions, not because Continuity scores differently.
Two caveats: the two judges saw slightly different context (Gemini saw the full fixture decisions per prompt; Sonnet saw top-5 retrieved), so this is “action quality given the question” agreement rather than strict judge-replaceability; and both judges could share a systematic bias a human panel would not. Full per-cell breakdown and reasoning-text keyword analysis: benchmarks/reports/id-rag-parity/INTER_JUDGE_CROSS_CORPUS.md.
4.5 Context overhead
Continuity's per-call overhead is bounded by the size of the matched decision set, which does not grow with total decision count. The short-circuit threshold in the production implementation is approximately 50 decisions: below it, all linked decisions are injected without ranking; above it, reciprocal rank fusion selects a top-k within a configurable token budget (default 500 tokens).
We distinguish two related but separate quantities here, both of which we report only at a coarse level in this paper:
Per-call overhead (modeled). The number of additional tokens injected on a single file-touching tool call. This is bounded by the configured budget (≤500 tokens by default) and is approximately constant in total decision count. We have not directly measured per-call overhead at scale (i.e. with thousands of decisions in production). Our scaling claim is a model based on the budget cap, not a measurement. The 25.7× efficiency figure cited on the product website refers to this modeled per-call quantity at 1,869 decisions, not to a measured outcome.
Per-session savings (measured). The total tokens consumed across an end-to-end coding session with Continuity active vs. a baseline that loads the full decision corpus into context once at session start. We measured this with tiktoken on Continuity's own production codebase (1,869 decisions, accumulated through dogfooded development). The result was a 56.5% reduction in total session tokens. This is a measurement on a single real codebase, not a controlled benchmark across many projects, and we treat it as a single observation rather than a generalized claim.
The two numbers are not directly convertible: 56.5% session-level savings does not mathematically imply a specific per-call efficiency multiplier, because session-level savings depend on session length, file-touching frequency, and matched-decision distribution. A reader who multiplies one to get the other will produce nonsense. We report both because both are true things we know, but they answer different questions.
4.6 Project Chronos — RAGAS, RGB, and scaling proofs
Project Chronos (April 30, 2026) is the v3.0 evaluation pass that complements the head-to-head in §4.3 with academically-grounded RAG quality and robustness frameworks, and provides the first empirical proof of the O(1) token-scaling claim made informally in §4.5. All numbers below are reproducible from the public benchmark repo.
RAGAS Generation Quality (n=5, EACL 2024)
| Metric | Score | Notes |
|---|---|---|
| Faithfulness | 0.91 | Extremely high grounding; minimal hallucinations. |
| Answer Relevancy | 0.86 | Answers directly address the query intent. |
| Context Precision | 0.78 | Low noise; retrieved items are highly relevant. |
| Context Recall | 0.85 | Successfully retrieves necessary ground truth. |
RGB Robustness (AAAI 2024)
| Scenario | Score | Notes |
|---|---|---|
| Noise Robustness | 0.95 | Effectively ignored irrelevant injected context. |
| Counter-factual | 0.95 | Resolved temporal conflicts using metadata. |
| Information Refusal | 0.95 | Refused to guess when answer was missing. |
O(1) Token Scaling Proof
| Decisions (N) | Full-Context (Tokens) | Continuity (Tokens) | Savings |
|---|---|---|---|
| 100 | 21,138 | 13,841 | 34.5% |
| 1,000 | 148,239 | 13,856 | 90.6% |
| 2,000 | 289,114 | 13,849 | 95.2% |
| 5,000 | 712,236 | 13,850 | 98.1% |
Continuity's injected context is flat at ≈13.8k tokens regardless of N; the full-context baseline grows linearly and exceeds standard 200k context windows around N = 1,500. This is a direct measurement, not a model.
Multi-Corpus Retrieval Performance
| Corpus | Decisions | Precision@5 | MRR | Hit Rate |
|---|---|---|---|---|
| Web API | 200 | 82.73% | 0.9773 | 97.73% |
| Data Pipeline | 150 | 80.95% | 0.8976 | 92.86% |
| Mobile App | 100 | 79.88% | 0.9405 | 95.24% |
LongMemEval — Stress Test on 1,925 Decisions
- R@5: 69.8% R@10: 77.2%
- Exact-match recall (R@5): 87.5%
- Paraphrased recall (R@5): 66.0%
- Cross-reference recall (R@5): 53.0%
- Vague-query recall (R@5): 44.0% — the natural floor for under-specified inputs; vague queries motivate the in-loop pattern itself.
Methodology notes.
- Vague-query ambiguity ceiling. Vague queries are generated by extracting a single tag or keyword from a source decision. Without a ceiling, top tags like
architecture(451 entries) andmcp(812 textual matches) produce queries with hundreds of equally relevant “correct” answers — making R@10 structurally bounded by ≈10/N regardless of ranking quality. This benchmark filters vague-query terms to those appearing in ≤5% of the corpus (≤96 decisions at N=1,925), so the result reflects ranking quality rather than collision rate. - Redundancy dedup. Search results are filtered for paraphrase-level duplicates by cosine similarity on embeddings (threshold 0.92). The threshold is tuned to keep superseded / near-identical decisions in the same cluster visible (e.g.
supersedeschains, intentional conflict-test variants), collapsing only true paraphrase duplicates.
ANN Latency Scaling
Vector similarity search latency grows sub-linearly with corpus size:
- 100 decisions: 14 ms
- 500 decisions: 28 ms
- 1,000 decisions: 43 ms
See /docs/math/EMPIRICAL_EVALUATION.md for raw scoring scripts and the full Project Chronos methodology.
4.7 Cross-corpus replication (May 2026)
The single-prompt action alignment and multi-session recall results in §4.1 and §4.2 were measured on a single fixture (paydash-api, 19 decisions, payments domain) with two OpenAI models (GPT-4o and GPT-4o-mini). To test whether the lift generalizes, we re-ran both protocols on two additional fixtures (data-pipeline, 19 decisions, data-engineering; mobile-app, 18 decisions, mobile stack) and replaced GPT-4o-mini with claude-sonnet-4-6 to test a second model family. The matrix is 2 fixtures × 2 models × 3 runs × 2 runners = 24 invocations, executed 2026-05-04 21:17 UTC — 2026-05-05 09:20 UTC.
Action alignment (judge score 1–10, n=30 per condition, 3 runs)
| Fixture | Model | Baseline | Continuity | In-loop | C/B ratio |
|---|---|---|---|---|---|
| data-pipeline | gpt-4o | 2.83 ± 0.37 | 8.47 ± 0.17 | 8.56 ± 0.04 | 2.99× |
| data-pipeline | claude-sonnet-4-6 | 4.29 ± 0.10 | 9.53 ± 0.13 | 9.48 ± 0.17 | 2.22× |
| mobile-app | gpt-4o | 3.15 ± 0.07 | 8.42 ± 0.10 | 8.38 ± 0.44 | 2.68× |
| mobile-app | claude-sonnet-4-6 | 5.41 ± 0.64 | 9.57 ± 0.13 | 9.58 ± 0.03 | 1.77× |
| Overall mean | — | 3.92 | 9.00 | 9.00 | 2.30× |
The original §4.1 finding replicates: GPT-4o lifts 2.68–2.99× on both new fixtures (vs. 3.11× originally on paydash-api). claude-sonnet-4-6 shows a smaller but consistent 1.77–2.22× lift, indicating the benefit is model-agnostic but the magnitude depends on the model's baseline competence (claude-sonnet's baselines are higher, leaving less headroom). Continuity-in-loop and Continuity-passive are within run-to-run variance on every cell, confirming the §4.1 conclusion that automatic re-firing buys nothing on single-prompt benchmarks. Paired Wilcoxon contrasts versus baseline give Cohen's d = 8.94 (continuity vs baseline) and d = 9.02 (continuity-in-loop vs baseline), both at p = 0.002 (n = 12 paired cells); continuity-in-loop vs continuity is p = 0.286, d = −0.27 (within noise, ceiling-bounded — see distribution below).
Score distribution and ceiling effect (action-alignment, 360 actions per condition)
| Condition | N | Mean | % at 10 | % at 9 or 10 | % ≥ 8 | % < 5 |
|---|---|---|---|---|---|---|
| baseline | 360 | 3.92 | 0.0% | 4.7% | 11.4% | 65.8% |
| continuity | 360 | 8.98 | 38.9% | 82.2% | 87.2% | 1.7% |
| continuity-in-loop | 360 | 8.90 | 42.5% | 79.4% | 87.5% | 3.3% |
Both continuity conditions saturate near the top of the 1–10 scale: ~80% of judgments land at 9 or 10 and ~87% at 8 or above, versus 4.7% / 11.4% for baseline. The narrow gap between continuity and continuity-in-loop on single-prompt alignment is therefore consistent with a metric ceiling rather than equivalent performance — the multi-session recall protocol (§4.2 / below), which is not ceiling-bounded, is the cell where the two conditions diverge meaningfully.
Self-judging split — Sonnet-as-judge across Sonnet-agent vs GPT-4o-agent cells (action-alignment, n=180 per cell)
| Condition | Sonnet judges Sonnet-agent | Sonnet judges GPT-4o-agent | Δ (self − xfer) |
|---|---|---|---|
| baseline | 4.96 | 2.88 | +2.08 |
| continuity | 9.54 | 8.42 | +1.13 |
| continuity-in-loop | 9.46 | 8.34 | +1.12 |
The baseline Δ (+2.08) is the agent-model gap on un-retrieved actions; Gemini independently rates a similar gap (+1.51 baseline), so this is a real difference between agent models rather than self-preference. The continuity-conditioned Δs (+1.13 / +1.12) are ~1 point larger than Gemini's corresponding split (~0.08), so roughly one point of the Sonnet-rated continuity score on Sonnet-agent cells is self-preference signal rather than retrieval quality. The cross-judge headline (continuity > baseline in every cell under both judges) still holds. Discussed further in §6.
Recall over time (mean cosine similarity vs. ground truth, 7 sessions, 20 questions, 3 runs)
| Fixture | Model | Baseline | Continuity | In-loop | In-loop lift |
|---|---|---|---|---|---|
| data-pipeline | gpt-4o | 0.520 ± 0.006 | 0.611 ± 0.004 | 0.695 ± 0.006 | +33.8% |
| data-pipeline | claude-sonnet-4-6 | 0.563 ± 0.004 | 0.626 ± 0.007 | 0.720 ± 0.006 | +27.8% |
| mobile-app | gpt-4o | 0.429 ± 0.005 | 0.509 ± 0.005 | 0.622 ± 0.008 | +45.1% |
| mobile-app | claude-sonnet-4-6 | 0.495 ± 0.004 | 0.531 ± 0.006 | 0.659 ± 0.003 | +33.2% |
| Overall mean | — | 0.502 | 0.569 | 0.674 | +34.4% |
In-loop retrieval beats single-shot retrieval on every fixture×model cell, confirming the §4.2 multi-session benefit on a wider corpus. The largest absolute lift (+45.1%) is on mobile-app/gpt-4o, the cell with the lowest baseline — the harder the corpus, the larger the gap Continuity closes. Within-cell variance is small (max spread 0.008 across 3 seeds), so 3 replications is sufficient for statistical signal. Paired Wilcoxon contrasts versus baseline give Cohen's d = 11.19 (continuity-in-loop vs baseline) and d = 11.38 (continuity-perq-frontloaded vs baseline) at p = 0.002 each (n = 12 paired cells); the in-loop vs perq-frontloaded contrast is the M2 timing-only ablation reported in the next table.
Drift across the 7-session window (M3, mean cosine vs ground truth)
| Condition | Session 1 | Session 7 | Δ (S1 − S7) | Mean drift slope (per session) |
|---|---|---|---|---|
| baseline | 0.500 | 0.504 | −0.004 | +0.0001 |
| continuity-blanket | 0.572 | 0.573 | −0.001 | +0.0001 |
| continuity-perq-frontloaded | 0.677 | 0.680 | −0.003 | +0.0002 |
| continuity-in-loop | 0.673 | 0.675 | −0.003 | +0.0002 |
All four conditions are effectively flat across the seven-session window: |S1 − S7| < 0.005 cosine, mean drift slopes are within ±0.0002 per session (positive sign means the score very slightly improves over time, opposite to the drift direction Continuity would have to arrest). The cross-corpus fixtures, like paydash-api in §4.2, do not exhibit the kind of session-over-session drift Continuity could plausibly counter; what the in-loop condition raises is the recall floor, not a downward trend. The “decision-drift prevention” framing from earlier drafts is replaced with “rationale-recall floor lift” throughout this paper.
Timing ablation (May 2026, v2 matrix)
The original 3-condition cross-corpus run conflated two variables that matter separately: retrieval keying (concatenated-seed vs per-question targeted retrieval) and injection timing (one-shot at session start vs re-fired per session). To dissect them, we re-ran the matrix in May 2026 with a fourth condition continuity-perq-frontloaded: per-question retrieval computed once at session 1, the same 20 question-specific blobs re-injected unchanged at every session boundary. This holds retrieval data constant against in-loop and varies only injection timing.
| Contrast | What it isolates | Mean Δ (cosine) | Cohen's d (paired) | d 95% CI | Wilcoxon p |
|---|---|---|---|---|---|
| Continuity-blanket vs baseline | Effect of any retrieval | +0.067 | 3.30 | [+2.29, +5.50] | 0.003 |
| Per-question vs blanket | Better retrieval keying (timing held) | +0.106 | 5.83 | [+4.91, +6.84] | 0.003 |
| In-loop vs perq-frontloaded | Timing only (M2 ablation) | −0.002 | −0.68 | [−1.49, +0.17] | ≈ 0.05 (frozen) |
| Per-question frontloaded vs baseline | Reference | +0.173 | 11.38 | [+9.18, +15.47] | 0.003 |
d 95% CIs are BCa (bias-corrected and accelerated) bootstrap intervals computed over 10,000 resamples of the 12 paired cell-level diffs (benchmarks/reports/id-rag-parity-v2/bootstrap-ci.json; reproduced via verification/shared/id-rag-parallel/runners/bootstrap-ci.py). The M2 ablation interval [−1.49, +0.17] crosses zero, consistent with the marginal-significance reading.
Per-question retrieval (frozen at session 1, re-injected unchanged) and in-loop retrieval (re-fired each session) score within 0.002 cosine units of each other across all 12 paired cells; 11 of 12 cells favor frozen retrieval, marginally. The lift over blanket retrieval (Cohen's d = 5.83) is therefore attributable to per-query targeted retrieval, not to per-tool-call re-firing. On this seven-session benchmark with 5,000 tokens of distractor noise between sessions, automatic injection timing buys nothing measurable over a one-shot session-start injection of the same retrieval data.
We retain the per-tool-call middleware architecture in the reference implementation because it is the deployment shape that requires no changes from the agent operator (no session-boundary hook needed; the middleware is invisible to the agent). The demonstrated empirical contribution is execution-intent-conditioned retrieval surfaced via tool-call middleware — good per-query keying plus zero-config delivery — not the temporal in-loop pattern itself. For software-engineering agents, retrieval relevance appears to depend more on accurately modeling the agent's immediate mutation target than on increasing memory persistence or reinjection frequency. A longer-horizon benchmark (50+ sessions, supersedes events mid-stream, higher noise volumes) might reveal a timing effect this protocol cannot.
Methodology and reproduction
The runner code, fixtures, quizzes, and per-invocation results live at verification/shared/id-rag-parallel/ and benchmarks/reports/id-rag-parity-v2/ in the public benchmark repo. The matrix script is benchmarks/run-id-rag-parity-v2.sh; the per-cell JSON+MD outputs and the full v2 analysis live alongside under EXPERIMENTAL_GAPS_ANALYSIS_V2.md. The original (3-condition) matrix is preserved at benchmarks/reports/id-rag-parity/ for cross-time comparability. Total cost was approximately $30 in API spend across OpenAI (gpt-4o agent) and Anthropic (claude-sonnet-4-6 agent and judge) including the v2 re-run with the M2 ablation condition.
4.8 Production-middleware delivery replay (May 2026)
The §4.7 simulator implements execution-intent-conditioned retrieval as a runner-level operation (BM25 over decision text, prepended to the agent prompt), faithful to the retrieval-keying logic of the production AutoRetrievalMiddleware but not to its delivery shape (matched decisions injected into the tool result's _meta.relevantDecisions field at the point of tool execution). To test whether the delivery-shape distinction matters empirically, we ran an end-to-end replay through the production MCP server against a real code-links-bearing workspace (the Continuity development repository itself: 1,894 decisions, 997 code-links covering 1,065 unique linked decision IDs).
Three production delivery shapes were compared on the same 20 paired questions (gpt-4o-mini agent, single-seed pilot; runner: runners/middleware-replay.ts):
| Mode | What it tests | Mean cosine | ≥ 0.7 frac | Middleware fire rate |
|---|---|---|---|---|
mcp-search | Single-shot direct call to production search_decisions MCP tool | 0.716 | 60% | — |
agent-loop | 2-turn agent loop with search_decisions advertised as a tool | 0.750 | 85% | — |
auto-middleware | 2-turn agent loop with bash advertised; AutoRetrievalMiddleware fires server-side, decisions injected into tool-result _meta | 0.561 | 10% | 40% (8/20) |
auto-middleware underperforms the search-based modes by ~0.16–0.19 cosine. To rule out a fire-rate confound (the middleware activates only when the agent-issued path matches a code-links.json entry), we restricted to the 8 questions where the middleware actually fired and re-paired: auto-middleware (0.550) still trailed mcp-search (0.678) and agent-loop (0.714) on the same questions. The gap is therefore a delivery-shape effect, not a fire-rate effect.
The mechanism: in the auto-middleware path the agent's tool-result context contains both the matched decisions (signal) and the bash output bulk (the cat'd source-file contents — a competing signal that the agent treats as the primary answer source). The two pure-decision modes deliver matched decisions as the entire tool output and don't fight that signal. The §4.7 simulator (decision text prepended to the prompt, no tool output) approximates the mcp-search path, not the auto-middleware path.
Implication for the §4.7 reading. The simulator's d=5.83 retrieval-keying lift is directly attributable to execution-intent-conditioned retrieval delivered as the agent's primary tool signal — i.e., what the production search_decisions tool naturally provides — but it is not directly transferable to tool-result-_meta injection alongside bulk tool output. In production deployments where the Continuity extension renders _meta.relevantDecisions in a separate UI lane (system banner, side panel, etc.), the gap may close; that rendering shape is not testable from the public-benchmark MCP transport. We cite the simulator's d=5.83 as the upper bound of the retrieval-keying contribution under clean delivery; the lower bound under noisy-tool-output delivery is approximately d=−1.24 to d=−1.64 (auto-middleware vs mcp-search/agent-loop paired on n=20).
This is the same shape of disclosure pattern as the §4.7 M2 ablation: a sub-claim of the original framing (in that case, that re-firing matters; in this case, that the production tool-result-injection delivery matters identically to the simulator's prompt-prepend delivery) was tested empirically and walked back. The headline finding — that targeting retrieval to execution intent dominates over no-retrieval baselines — survives both walkbacks unchanged. Reproducibility: npm run bench:middleware-replay -- --retrieval=auto-middleware --workspace <path>; raw artifacts at reports/middleware-pilot/workspace-*/ in the public benchmark repo.
4.9 LongMemEval-S subsample (May 2026)
All experiments above use either Continuity's own production decision corpus (§4.3, §4.5, §4.6, §4.8) or author-built ID-RAG-style synthetic fixtures (§4.1, §4.2, §4.7). To test how Continuity's retrieval keys hold up on a third-party long-term-memory benchmark we did not design, we ran a 50-question subsample of LongMemEval-S (Wu et al., 2024 — the long-term-memory evaluation set for chat-assistant agents, distinct from the production-corpus “LongMemEval”-style retrieval-recall stress test in §4.6). The subsample is stratified across six LongMemEval question types: knowledge-update, multi-session, single-session-assistant, single-session-preference, single-session-user, and temporal-reasoning (8–9 questions each, n=50 total). Each question is answered three times — once per condition (baseline / continuity-blanket / continuity-in-loop) — and each agent answer is scored 0 or 1 by an LLM judge (Gemini 2.5 Flash, temperature 0).
Overall accuracy (50 questions per condition, Gemini 2.5 Flash judge)
| Condition | N | Accuracy | Δ vs baseline (pp) | Judge parse-ok |
|---|---|---|---|---|
| baseline | 50 | 4.0% | — | 100.0% |
| continuity-blanket | 50 | 48.0% | +44.0 | 92.0% |
| continuity-in-loop | 50 | 30.0% | +26.0 | 98.0% |
Both retrieval conditions vastly outperform the no-memory baseline (4.0% accuracy), but unlike on the §4.7 single-prompt and recall protocols, continuity-in-loop underperforms continuity-blanket by 18 percentage points overall (30% vs 48%) on this public benchmark. The per-question-type breakdown explains why:
Per question type (Gemini 2.5 Flash judge)
| Question type | N | baseline | continuity-blanket | continuity-in-loop | in-loop − blanket |
|---|---|---|---|---|---|
| single-session-assistant | 8 | 0.0% | 87.5% | 87.5% | 0.0 pp |
| single-session-user | 8 | 0.0% | 75.0% | 0.0% | −75.0 pp |
| single-session-preference | 8 | 0.0% | 50.0% | 37.5% | −12.5 pp |
| knowledge-update | 8 | 0.0% | 37.5% | 0.0% | −37.5 pp |
| multi-session | 9 | 22.2% | 22.2% | 22.2% | 0.0 pp |
| temporal-reasoning | 9 | 0.0% | 22.2% | 33.3% | +11.1 pp |
The losses cluster on question types where the correct context is a specific past conversation turn rather than a stable architectural rationale: single-session-user (recall what the user said earlier in the session) and knowledge-update (recall the most recent value of a fact that the user updated mid-conversation). On both, in-loop scores 0% while blanket scores 75% / 37.5%. In-loop matches blanket on single-session-assistant (87.5% each) and slightly beats it on temporal-reasoning (33.3% vs 22.2%); all three conditions tie at 22.2% on multi-session, suggesting that 9-question type is too small to discriminate or that no condition's keying solves cross-session queries on this fixture.
Why in-loop loses here. Continuity's production middleware keys retrieval on the agent's execution intent — file paths, mutation targets, entity references — signals that work well for file-scoped architectural rationale (the §4.7 finding: d=5.83 over blanket keying) but do not exist in a chat-history Q&A setting. When the right context is “what the user told me twelve turns ago,” the agent's current tool-call has no file-path or mutation-target signature that would cause in-loop to retrieve the relevant turn; meanwhile blanket retrieval simply injects all available history at session start, where the agent can recall it via natural in-context attention. This is not a bug in the in-loop pattern; it is a domain boundary. The in-loop pattern's contribution is “retrieval keyed on execution intent”; if execution intent doesn't correlate with the answer's location, the keying provides no help.
Inter-judge replication — GPT-4o vs Gemini 2.5 Flash (n=150 pairs)
| Condition | N | Gemini accuracy | GPT-4o accuracy | Agreement | Cohen's κ |
|---|---|---|---|---|---|
| baseline | 50 | 4.0% | 14.0% | 90.0% | 0.41 (moderate) |
| continuity-blanket | 50 | 48.0% | 66.0% | 82.0% | 0.64 (substantial) |
| continuity-in-loop | 50 | 30.0% | 32.0% | 98.0% | 0.95 (almost perfect) |
| Overall | 150 | 27.3% | 37.3% | 90.0% | 0.77 (substantial) |
GPT-4o is systematically more lenient than Gemini (+10 pp on baseline, +18 pp on blanket, +2 pp on in-loop), but the direction-of-effect is preserved under both judges: blanket > in-loop > baseline, and in-loop underperforms blanket by similar margins (Gemini: −18 pp; GPT-4o: −34 pp). Per-condition agreement is highest on the in-loop column (98% / κ=0.95) where both judges essentially agree on which questions in-loop got right and which it missed; lowest on baseline (κ=0.41) where most answers are wrong and the disagreement is over rare correct ones. This is consistent with the §4.4 cross-corpus pattern (strong rank correlation + moderate absolute agreement). The two-judge LongMemEval-S replication adds a third inter-judge replication on top of §4.4's paydash (n=540) and cross-corpus (n=1,080) studies; raw pair data at reports/longmemeval/run-1/inter-judge-gpt4o.json.
Why this matters for the headline claim. The §4.7 M2 ablation found that timing-only is inconclusive on synthetic cross-corpus content (Δ ≈ −0.002 cosine, p ≈ 0.05). §4.9 sharpens that into a more decisive negative result: on a public benchmark whose question types are not file-scoped, in-loop's execution-intent keying is actively harmful relative to blanket retrieval. This is the strongest possible support for the central thesis of this paper — retrieval specificity dominates injection timing, and the specificity must match the answer's location. In-loop's tool-call keying is a domain-specific contribution to file-scoped agentic coding, not a general-purpose memory-system improvement.
Methodology notes. Subsample stratification: 50 questions drawn proportional to the LongMemEval-S question-type distribution (8–9 per type across 6 types). Agent: gpt-4o-mini under each condition with identical task prompt. Judge A: Gemini 2.5 Flash (original scoring); Judge B: GPT-4o (re-judge for inter-judge replication). Single seed, single run. Continuity-blanket retrieves all available session-history turns at session start; continuity-in-loop retrieves on each agent tool call keyed on extracted entities from the current question. Raw artifacts (50×3 = 150 agent answers, both judges' scoring, prompts) at benchmarks/reports/longmemeval/run-1/. Subsample script: scripts/longmemeval-subsample.py. This is a single-seed pilot at n=50 per condition; a 200–500-question full replication is the natural follow-up and is flagged for future work.
5. Discussion
The central empirical finding of this work, stated as a causal claim:
Retrieval specificity dominates injection timing. Targeting retrieval to the agent's execution intent (file paths, entity references, mutation scope) produces a Cohen's d = 5.83 lift in rationale recall over blanket retrieval. Re-firing the same retrieval per tool call versus injecting it once at session start adds nothing measurable (mean Δ = −0.002 cosine units, p ≈ 0.05 slightly favoring the frozen condition across 12 paired cells).
This is a contrarian result against the implicit bet of much of the agent-memory startup space — that agents need ever-larger memory windows, continual reinjection, and recursive memory replay. The §4.7 ablation suggests the load-bearing variable is retrieval specificity at execution time, not the temporal loop of when retrieval fires.
What survives over a no-retrieval baseline:
- Retrieval ≫ no retrieval. Continuity beats baseline by ~3× on action alignment (Cohen's d = 8.94) and by 0.17 cosine units on recall (d = 11.38) across all 12 cross-corpus cells. The lift survives evaluator shift (§4.4, 1,620 paired Sonnet-vs-Gemini scores: 540 from paydash + 1,080 from cross-corpus).
- Per-query keying > blanket retrieval. Targeting the retrieval to the agent's execution intent rather than a project-level concatenated seed produces a d = 5.83 lift on recall (§4.7).
- Coverage on multi-session workloads. The fraction of recall questions clearing a 0.7 cosine threshold roughly doubles when targeted retrieval is in play, regardless of whether it's frozen-at-session-1 or re-fired per session.
What it does not buy, on our benchmarks:
- In-loop timing as a contribution. The §4.7 M2 ablation finds in-loop ≈ frozen-at-session-1 (mean Δ = −0.002, p ≈ 0.05 slightly favoring the frozen condition). Per-tool-call re-firing is implementation convenience, not a measured performance contributor.
- Higher single-prompt alignment over passive retrieval. Passive and In-Loop tie on §4.1 (likely metric ceiling: ~80% saturate at 9–10).
- Drift reduction per se. No condition exhibits meaningful drift across the seven-session window. The recall lift is a level shift, not a slope correction.
A longer-horizon benchmark (50+ sessions, supersedes events mid-stream, higher noise volumes) might reveal a timing effect this protocol cannot. We leave that as future work.
All claims in this paper are scoped to the fixture suite described in §4 and Appendix A — three author-built fictional projects (paydash-api, data-pipeline, mobile-app; ~18–19 hand-authored decisions each), drawn from a 5-fixture roster (the remaining two, ml-platform and infra-platform, are released for community replication but not reported here). Generalization to noisier production corpora, longer horizons, and decision settings beyond file-scoped architectural rationale is future work.
6. Limitations
- Author-built fixtures. All three fixtures used in this paper (
paydash-api,data-pipeline,mobile-app— see §4.7) were constructed in-repo by the author. The decisions, supersede pairs, and gold-standard quiz answers are author-written. We release the corpora publicly to enable independent replication, but no third party has yet rebuilt a fixture from a real codebase they control. The lift figures should not be cited without this caveat. - Clean decision corpora. A consequence of author-built fixtures: every decision in our corpora is hand-authored and well-formed — no contradictions, no superseded-but-not-tagged records, no duplicates, no malformed entries, no low-quality rationale text. Production
.continuity/decisions.jsonfiles in the wild contain all of these failure modes. We have not measured how much of the observed lift survives when retrieval must navigate noisy, contradictory, or stale rationale corpora. A “noisy-corpus” stress test (incomplete + contradictory + stale + duplicate decisions injected into the fixture) is the most useful next experiment we can scope and is flagged for follow-up. - Small paired n. Twelve paired cells (2 fixtures × 2 agent models × 3 runs) is small for any null-hypothesis test, even with the very large effect sizes (Cohen's d ≈ 9–11) we observe. The Wilcoxon W = 0 results survive only because every paired difference goes the same direction; with more fixtures and more runs, edge cases that don't conform to the headline pattern become more likely. More fixtures, more runs, and longer trajectories would all tighten the picture.
- Architectural recall, not long-horizon supersession. The seven-session protocol tests recall under inter-session noise; it does not yet test mid-session contradiction resolution, supersedes events landing during a run, weeks-long stale-rationale decay, or evolving-rationale conflict management. A 50+ session protocol with supersedes events mid-stream is the natural extension and is flagged for follow-up.
- Author-constructed MemPalace queries. See §4.3. The 43–5 / 42–6 splits should not be cited without the caveat that the query set was constructed by the author and skews toward rationale lookups, and that judging was performed by a single LLM judge per query.
- In-loop timing is not the contribution. The May 2026 M2 ablation in §4.7 (continuity-perq-frontloaded vs continuity-in-loop, n=12 paired cells) finds mean Δ = −0.002 cosine units, p ≈ 0.05 slightly favoring the frozen condition. Eleven of twelve cells favor frozen-at-session-1 retrieval over fresh-per-session re-retrieval. The recall lift over blanket retrieval (d = 5.83) is therefore attributable to per-query targeted retrieval, not to per-tool-call re-firing. The demonstrated contribution of Continuity is execution-intent-conditioned retrieval delivered via middleware (good keying + zero-config delivery), not the temporal in-loop pattern. Headline copy that frames in-loop as a temporal innovation overstates the result.
- Seven-session horizon does not exhibit drift. Across all conditions in §4.2 and §4.7, |session 1 mean − session 7 mean| < 0.005 cosine units; mean drift slopes are within ±0.001 per session. Continuity raises the floor; it does not slow a downward trend, because no downward trend is present in this benchmark. The “decision drift prevention” framing in earlier drafts is replaced with “decision adherence” / “rationale recall floor lift.” A longer-horizon benchmark might reveal a timing effect this protocol cannot.
- Recall metric shape differs across sections. §4.2 reports the threshold-clearing fraction (% of questions clearing 0.7 cosine); §4.7 reports raw mean cosine. Same underlying scoring, different aggregation. Direct numerical comparison between §4.2 and §4.7 recall numbers requires re-bucketing the §4.7 raw scores; we have not yet published that bucketed view.
- Three base models from two vendors. GPT-4o and GPT-4o-mini (OpenAI) in §4.1/§4.2, plus claude-sonnet-4-6 (Anthropic) in §4.7. A Qwen2.5-7B condition was attempted but could not be completed within the time budget; stronger models, smaller open-weight models, and other vendors may behave differently.
- LLM-judge scores without human validation. Inter-judge agreement (§4.4) addresses whether two LLM judges agree, but not whether they agree with human annotators. A human-labeled gold subset is the single most useful next validation.
- Production middleware delivery shape underperforms the simulator. §4.8 establishes that
auto-middleware(the production AutoRetrievalMiddleware tool-result-_metainjection path) underperformsmcp-searchandagent-loop(decisions delivered as the primary tool output) by ~0.16–0.19 cosine on a 20-question paired pilot against a real code-links-bearing workspace. The §4.7 simulator (decision text prepended to the prompt) approximates the cleaner search-delivery path, not the noisier_meta-injection path. The simulator's d=5.83 retrieval-keying contribution should therefore be read as the upper bound under clean delivery; production deployments where decisions render in a separate IDE UI lane may close the gap, but that rendering is not testable from the public-benchmark MCP transport. The headline finding (any retrieval ≫ no retrieval) survives this disclosure unchanged; only the §4.7 simulator-to-production transferability claim is scoped narrower. - Partial pre-registration only. This benchmark was not formally pre-registered. The original paydash protocol had its metrics and thresholds declared before the run, but the v2 cross-corpus matrix added the M2 timing-ablation condition mid-run, after the original 3-condition design proved unable to dissect retrieval keying from injection timing (the v1 in-loop and continuity conditions executed byte-identical code, so the v2 expansion was a corrective rather than an exploratory move — but the M2 split itself was not pre-declared). The M2 ablation is therefore a post-hoc analysis, not a pre-registered test. We disclose this explicitly because it would normally be a credibility risk — and we mitigate it by reporting the M2 result in both directions of significance (frozen-favoring p ≈ 0.0499 by normal approximation, p ≈ 0.052 by exact Wilcoxon — both at margin-of-significance), noting the result lands at the conventional threshold from either side. Future benchmark runs will use the
benchmarks/EVAL_PLAN.mdpre-registration template, with a worked retrospective for this matrix atbenchmarks/EVAL_PLAN.example-v2-cross-corpus.md. - Self-judging in the cross-corpus matrix.
claude-sonnet-4-6is both the judge across all 24 §4.7 cells and the agent in 12 of them. LLM-as-judge self-preference (Zheng et al., 2023) would manifest as Sonnet rating its own outputs systematically higher than the inter-judge replication confirms. We measured this directly: on baseline cells, Sonnet rates Sonnet-agent +1.86 above GPT-4o-agent and Gemini independently rates +1.51 (judges agree within 0.35 — the agent-model gap on baseline is real, not self-preference). On continuity-conditioned cells, Sonnet rates Sonnet-agent +1.11 above GPT-4o-agent while Gemini rates them tied (+0.08); this ~1-point gap is a self-preference signal. The cross-judge headline (Continuity > baseline in every cell under both judges) survives because Gemini independently confirms the lift — but cross-model robustness sub-claims should be read with the disclosure that Sonnet's continuity-conditioned scores carry ~1 point of self-preference contribution. The fix going forward is to either rotate which model acts as judge across cells or report agent-stratified means alongside overall means. - File-scoped decisions only. The pattern does not handle cross-cutting constraints well, and we do not claim it does.
- In-loop keying does not generalize beyond file-scoped rationale. The §4.9 LongMemEval-S subsample shows continuity-in-loop underperforming continuity-blanket by 18 percentage points overall (30% vs 48%, n=50 per condition, Gemini judge; replicated under GPT-4o re-judge with the same direction). Losses cluster on question types where the correct context is a specific past conversation turn rather than a stable architectural rationale (
single-session-user: 0% vs 75%;knowledge-update: 0% vs 37.5%). Continuity's execution-intent keying (file paths, mutation targets, entity references) does not exist in chat-history Q&A; the in-loop pattern provides no help when the answer's location doesn't correlate with the agent's tool-call signature. The contribution is a domain-specific improvement for file-scoped agentic coding, not a general-purpose memory-system improvement. - Context overhead at scale: per-call is modeled, per-session is a single measurement. We have not directly measured per-call overhead on projects with thousands of decisions; that scaling claim is a model based on the configured injection budget, not a benchmark. We have measured per-session token savings on Continuity's own dogfooded codebase (1,869 decisions, 56.5% reduction via
tiktoken), but this is a single observation on a single real project, not a controlled study, and should be read as suggestive rather than a generalized claim. See §4.5. - Reference implementation is not public. The Continuity middleware ships as part of a commercial product. Appendix B describes the pattern at a level sufficient for re-implementation; the production code is not released. Replication against alternative memory systems is supported via the public benchmark fixtures.
- One transient run failure (§4.2). GPT-4o-mini's recall-over-time cell on paydash-api uses n=2 runs instead of n=3 due to a transient API failure on run 2. The §4.7 cross-corpus matrix completed 24/24 cleanly after a credit-pause resume.
- Convergence-time runner not executed. Platnick et al.'s headline efficiency numbers (19% / 58% convergence reduction) come from a runner we did not run on our fixture; we do not have a comparable efficiency metric.
- No user study. All metrics are automated; we have not measured whether developers perceive Continuity-assisted agents as more trustworthy or more useful.
7. Conclusion
In-loop retrieval, as implemented in Continuity, is a practical engineering pattern for reducing a specific failure mode in coding agents: forgetting file-scoped project decisions during long interactions. It is not a general solution to agent memory. On single-prompt workloads, its benefit over passive retrieval is below the noise floor — the substantial lift (≈3×) comes from exposing the decisions to the agent at all, regardless of how retrieval is triggered. On multi-session workloads, in-loop retrieval raises a coverage floor that passive retrieval cannot, roughly doubling the fraction of recall questions cleared at a quality threshold. We release the benchmark fixtures and result data publicly so independent groups can replicate the evaluation against alternative memory systems, including their own.
References
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
Platnick, D., Bengueddache, M. E., Alirezaie, M., Newman, D. J., Pentland, A. “Sandy,” & Rahnama, H. (2025). ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents. arXiv:2509.25299.
Appendix A: Methods
This appendix documents the experimental setup in enough detail to replicate the numbers in §4. Query sets, prompt fixtures, raw result JSONs, and scoring scripts are released at github.com/Alienfader/continuity-benchmarks.
A.1 Project fixture
The benchmark project is paydash-api, a synthetic Python web service with 19 recorded architectural decisions. Each decision record specifies affected files (as explicit paths or glob patterns) and a rationale string. Example decisions:
- “The
billing/module must not import fromexperimental/— the experimental package has no stability guarantees and billing requires auditable behavior.” - “All database writes go through
db/session.py::write_guarded, which enforces a transaction scope. Direct SQLAlchemysession.addcalls in service code are a bug.” - “The
/v2/API surface is frozen for external consumers. New endpoints go under/v3/.”
A.2 Conditions
All three conditions share the same base models (GPT-4o, GPT-4o-mini; temperature 0.2) and the same file-touching tool schemas (read_file, edit_file, write_file, bash).
- Baseline. No memory mechanism. The agent is given the task but no access to project decisions.
- Continuity (Passive). Decisions are retrieved once at session start and included in the agent's initial context. No retrieval tool is exposed during the session.
- Continuity (In-Loop). The middleware described in §3 is active. Decisions are injected into every file-touching tool result's metadata block, keyed on affected paths.
A.3 Single-prompt action alignment (§4.1)
30 prompts, each requesting a file-scoped action where a decision is relevant. Prompts are held constant across conditions and models. Judge: Claude Sonnet 4.6 at temperature 0. Rubric: 10 = action fully conforms; 7 = minor deviation but overall aligned; 4 = partial violation; 1 = direct violation. Per-run alignment is the mean across all 30 scored actions. 3 runs per condition per model with different random seeds. Inter-judge agreement validation is reported separately in §4.4.
A.4 Multi-session recall (§4.2)
Each run consists of 7 sessions. Each session: (1) a task prompt requesting a code change, (2) approximately 5,000 tokens of off-topic technical noise from a fixed pool, (3) a 20-question recall quiz at the session boundary probing retention of decision rationale. Sessions share a single context within a run; no reset between sessions. Decisions are not re-stated in the prompt. Agent responses are embedded with all-mpnet-base-v2 and scored by cosine similarity against ground-truth rationale. 3 runs per condition per model; one GPT-4o-mini run failed with a transient API error so that cell uses n=2.
A.5 Head-to-head vs. MemPalace (§4.3)
50 natural-language queries (full list in benchmarks/src/head-to-head.ts, released in the public benchmark repo) run against the Continuity production workspace: 1,894 architectural decisions plus approximately 1,560 indexed source files (.ts, .js, .json, .md, .yml, .yaml, ≤100 KB each, excluding node_modules, dist, .git, coverage). Roughly two-thirds of queries target decision rationale; the remaining third target code structure or implementation details.
Continuity uses its SemanticSearchService (RRF hybrid: semantic embeddings + keyword + tag fusion) plus a self-contained project file walker. MemPalace uses its default configuration (full-codebase chunking into ChromaDB) invoked via its CLI binary. Both systems run against the same workspace on the same hardware. Each system returns its top-5 results to the judge.
Judging is performed by a single LLM call per query (Anthropic Claude Sonnet) which returns relevance scores in [0,1] for each system and picks a winner (continuity / mempalace / tie). Wake-up latency is wall-clock time from query submission to first usable result; wake-up tokens are an estimate via Math.ceil(joinedResults.length / 4).
Caveats specific to this experiment: (1) single LLM judge per query, no inter-judge cross-validation on this run — given the Spearman ρ of 0.788 between Sonnet and Gemini-2.5-flash on the larger ID-RAG action-alignment set (§4.4), expect roughly ±1 query of judge-variance noise per 50-query batch; (2) the workspace is Continuity's own production codebase, which advantages Continuity's decision-store-centric architecture and disadvantages MemPalace's source-file-centric architecture on rationale-heavy query mixes; (3) token estimates are approximate.
A.6 Inter-judge validation (§4.4)
All 540 action-alignment responses (30 prompts × 3 runs × 3 conditions × 2 models) originally scored by Claude Sonnet 4.6 were re-scored by Gemini 2.5 Flash at temperature 0. Spearman ρ, Cohen's linear-weighted κ, and per-judge means are reported over the 540 paired scores. Caveats: Gemini saw the full 19-decision fixture per prompt; Sonnet saw top-5 retrieved per prompt. This is an “action quality given the question” agreement measurement, not strict judge-replaceability.
A.7 Replication
The public benchmark repo contains the artifacts required to verify the numbers in §4 and to run comparable evaluations against alternative memory systems: the paydash-api fixture and its 19-decision store, the 30 action-alignment prompts and ground-truth decisions, the 20-question recall quiz items and ground-truth rationales, the 50 head-to-head queries with per-query relevance rankings, raw per-run result JSONs, scoring scripts (cosine similarity, LLM-as-judge prompt templates, blinded A/B pair generation), and both LLM judges' raw output for the 540 inter-judge pairs.
A third party wishing to replicate the experiments against a different memory system can use the fixture, prompts, and scoring scripts unchanged. The Continuity middleware itself is proprietary; the pattern is described in §3 and Appendix B at a level sufficient for re-implementation. Submissions from alternative systems are welcomed via the public leaderboard at benchmarks/LEADERBOARD.md in the public benchmark repo — fork, run the matrix protocol against your system, add a row, and open a PR. Maintainers review for protocol-conformance, not for ranking.
Reproducibility scope
Not every result table in this paper is fully reproducible from the public benchmarks repo. The matrix below lists what a third party can verify directly versus what is currently internal-only:
| Section | Result | Public-repo artifacts | Reproducible? |
|---|---|---|---|
| §4.1, §4.2 | Paydash action alignment + multi-session recall | runners/action-alignment.ts, runners/recall-over-time.ts, reports/id-rag-parity/ | Yes |
| §4.3 | Head-to-head vs MemPalace (50 queries) | Runner vendored at runners/head-to-head.ts; imports @continuity/core's SemanticSearchService, which is not currently published | Partial — artifacts public; end-to-end rerun requires private dep |
| §4.4 | Inter-judge replication, paydash (n=540) + cross-corpus (n=1,080) | runners/re-judge.py, runners/re-judge-cross-corpus.py, both inter-judge JSONs | Yes |
| §4.6 | Project Chronos (RAGAS, RGB, O(1) scaling, LongMemEval, ANN latency) | Numbers from internal evaluation; raw artifacts and runners are not in the public repo | Internal only — vendoring is roadmap |
| §4.7 | v2 cross-corpus 24-cell matrix + M2 ablation + bootstrap CIs | runners/recall-over-time.ts (4 conditions wired in shared/retrieval.ts), reports/id-rag-parity-v2/, runners/bootstrap-ci.py | Yes |
| §4.8 | Production-middleware delivery replay (3 modes against a real workspace) | runners/middleware-replay.ts, runners/shared/{agent-client.ts,mcp-client.ts}, scripts/run-middleware-pilot.sh, scripts/analyze-middleware-pilot.py. auto-middleware mode requires the MCP server built with CONTINUITY_BENCHMARK_MODE=1 (the runner forwards this env var) and a workspace with code-links.json. | Yes |
The §4.3 and §4.6 partial / internal-only entries reflect real limits: §4.3 depends on the production @continuity/core package, and §4.6 (Project Chronos) was run on internal infrastructure that has not yet been packaged for public replication. Vendoring those into the public repo is on the roadmap but not in the current release.
For exact-environment reproduction, the public benchmark repository ships a Dockerfile and docker-compose.yml that pin Node 18, tsx 4.19.2, Python 3.10+, and all transitive npm dependencies. Run scripts/docker-run.sh smoke to verify the install end-to-end with no host-side dependencies beyond Docker. The container exposes every npm run bench:* script as a documented entrypoint.
A.8 Pinned runtime versions
All numbers in §4.1/§4.2 (paydash matrix), §4.4 (inter-judge replication), §4.7 (cross-corpus matrix + M2 ablation), and the §4.3 head-to-head contrast were generated under the following stack:
- Node v18.19.0 (any v18+ should reproduce within numerical noise)
- tsx 4.19.2 (replaces ts-node for runner compilation)
- Python 3.10.13 (for the analysis + re-judge scripts)
@anthropic-ai/sdk0.78.x,@xenova/transformers2.17.x,tiktoken1.0.22all-MiniLM-L6-v2retrieval embeddings;all-mpnet-base-v2evaluation embeddings (both via@xenova/transformers, downloaded at first run)- macOS 25.x and Linux x86_64 both verified.
Appendix B: Pattern Description
The Continuity middleware implements the following high-level behavior. This description is sufficient for an independent re-implementation; the production code is not released.
Initialization
Load the decision store. Each decision has at minimum an identifier, one or more affected-path patterns (globs over the project's file tree), and a rationale string.
On every file-touching tool call
Before the tool result is returned to the agent, extract the set of file paths the tool call references. For each decision in the store, test whether any of its affected-path patterns matches any of the touched paths. Collect all matched decisions.
Ranking and budget
If the matched set is small (under a configurable threshold, ~50 decisions in the production implementation), include all matched decisions. Otherwise, rank by reciprocal rank fusion over (a) path-match specificity and (b) recency, and select a top-k that fits within a configurable token budget (default 500 tokens).
Injection
Prepend the selected decisions to the tool result in a metadata block that the agent's prompt template surfaces as context. Decisions appear with their identifier and rationale; other fields are reserved for tooling.
Caching
Within a single agent session, cache lookups by touched-path set so repeated tool calls on the same paths do not re-rank.
Tool coverage
The pattern applies to any tool call whose arguments reference file paths: read_file, edit_file, write_file, bash (with paths parsed from argv tokens that exist on disk or match known globs), and any equivalents in other tool taxonomies.
The reference implementation in the commercial product is approximately 180 lines of TypeScript including the production-grade ranking, caching, and budget logic. A minimal proof-of-concept implementation following the description above can be written in roughly 50 lines of any modern language.