Whitepaper

Continuity: An In-Loop Retrieval Pattern for Persistent Project Rationale in AI Coding Agents

Author: Thiago Goncalves

Version: 7.1

Date: April 26, 2026

Benchmarks: github.com/Alienfader/continuity-benchmarks

Abstract

AI coding assistants routinely violate project-specific architectural decisions during long-running tasks, even when those decisions have been explicitly recorded. Existing memory systems treat retrieval as an opt-in action the model must choose to perform, which fails when the model does not recognize that retrieval is warranted. We describe Continuity, an engineering pattern that injects relevant project rationale into tool-call responses automatically via middleware, removing the model's need to self-initiate retrieval for a bounded class of file-scoped decisions. We evaluate Continuity against a no-memory baseline and a passive (opt-in) retrieval variant on two runners — single-prompt action alignment (n=30 per condition, 3 runs, 2 models) and multi-session recall (7 sessions × 20-question quizzes, 3 runs, 2 models) — and against MemPalace on a 50-query head-to-head. Exposing decisions to the agent lifts action alignment from 2.82/10 to 8.77/10 (GPT-4o; 2.88 → 8.22 on GPT-4o-mini), a roughly 3× jump. Automatic in-loop injection does not further improve single-prompt alignment over passive retrieval (8.61 vs. 8.77; within run-to-run variance), but roughly doubles the fraction of multi-session recall questions clearing a 0.7 cosine-similarity threshold (23% → 55% on GPT-4o; 28% → 50% on GPT-4o-mini). Inter-judge validation with a second LLM judge (Gemini 2.5 Flash) confirms the direction of all findings; Spearman ρ = 0.788 across 540 paired scores. This work is an engineering report on a commercial product. The reference implementation is proprietary; we release benchmark fixtures, prompts, and raw result data to enable independent replication against alternative memory systems.

1. Introduction

Long-context degradation in LLMs — variously called “lost in the middle” (Liu et al., 2023), context rot, or instruction decay — is well documented. In the coding-agent setting, it contributes to a specific and costly failure: the agent takes actions inconsistent with architectural decisions it was previously told about. Standard retrieval-augmented generation (RAG) provides a partial remedy: give the model a search tool, store decisions in a structured store, and let the model query when relevant. In practice this fails for a simple reason — the model must first recognize that a query is warranted. For architectural constraints attached to specific files (e.g., “this module must not import from experimental/”), the trigger for retrieval is the act of editing the file, not a semantic signal the model is likely to notice mid-task.

This paper describes Continuity, an implementation of what we call the in-loop retrieval pattern: retrieval that is triggered by the agent's tool calls rather than by the agent's explicit choice. The contribution is an engineering report on a commercial product and an honest evaluation of its trade-offs. Our headline findings are narrower than the pattern's apparent promise:

Exposing decisions to the agent at all matters enormously. Both retrieval conditions lift single-prompt action alignment roughly 3× over a no-memory baseline.
In-loop injection is equivalent to passive retrieval on single-prompt benchmarks. When the relevant decisions are served inline with every prompt, automatic re-firing is redundant.
In-loop injection is clearly better over multi-session workloads. The fraction of recall questions clearing a quality threshold roughly doubles.

We did not observe the specific failure mode some prior framings call “decision drift” on our benchmark — all three conditions show approximately flat alignment across seven sessions. The multi-session benefit of in-loop retrieval in our data comes from coverage (the agent sees decisions it would not have thought to query for) rather than from arresting drift.

3. The Continuity Pattern

Continuity consists of three components (Figure 1):

1. A decision store

A structured, append-mostly record of project decisions — typically one decision per record, each with a rationale, affected files/globs, and a timestamp.

2. A middleware layer

A shim between the agent and its tool executor. Before a file-touching tool call (read, edit, write, bash with file arguments) returns to the model, the middleware looks up which decisions are linked to the affected path(s).

3. An injection step

Matched decisions are prepended to the tool result in a metadata block, so the agent sees the rationale in the same turn as the tool output, without having to ask for it.

Figure 1: Continuity architecture — middleware intercepts file-touching tool calls and injects matched decisions into the tool result before returning it to the agent. — Figure 1: Continuity architecture. Middleware sits between the agent and its tool executor, matching tool-call file paths against the decision store and prepending matched decisions to the tool result.

The design trade-off is explicit: in-loop retrieval reduces the agent's retrieval decision cost to zero for file-scoped decisions, at the cost of (a) requiring that decisions be indexable by file path and (b) adding per-tool-call overhead. Decisions that are not tied to specific files (e.g., cross-cutting style conventions) are not well served by this pattern and fall back to conventional RAG.

The reference implementation is approximately 180 lines of TypeScript and ships as part of the commercial Continuity product. The implementation handles glob-based path matching, decision ranking for projects with more than ~50 decisions, a per-session lookup cache, and a configurable injection budget. A high-level pseudocode sketch appears in Appendix B. Full implementation details are not part of this paper; the pattern is described at a level of abstraction sufficient for replication.

4. Evaluation

We report three experiments, all on the paydash-api fixture (a Python web-service project with 19 recorded architectural decisions). Each experiment's protocol was adapted from the corresponding ID-RAG Parallel runner (Platnick et al., 2025); methods are in Appendix A. Query sets, prompt fixtures, raw result JSONs, and scoring scripts are released at github.com/Alienfader/continuity-benchmarks so a third party can replicate the evaluation against their own memory system.

4.1 Single-prompt action alignment

For each of 30 prompts, the agent is asked to take an action on a file with active decisions. An LLM judge (Claude Sonnet 4.6, temperature 0) scores whether the action conforms to the relevant decisions on a 1–10 rubric. Three runs per condition.

Condition	GPT-4o (mean, stdev)	GPT-4o-mini (mean, stdev)
Baseline (no retrieval)	2.82 (±0.07)	2.88 (±0.16)
Continuity (Passive)	8.77 (±0.00)	8.22 (±0.09)
Continuity (In-Loop)	8.61 (±0.07)	8.13 (±0.18)

Both retrieval conditions raise alignment by roughly 3× over baseline on both models. The Passive and In-Loop variants are within run-to-run variance of each other on both models — the in-loop mechanism provides no measurable benefit here. We interpret this as the expected outcome: action alignment is measured per-prompt, and passive retrieval already serves the relevant decisions for each prompt, so automatic re-firing is redundant.

4.2 Multi-session recall

Over seven sessions with approximately 5,000 tokens of off-topic noise injected between sessions, a 20-question recall quiz probes retention of decision rationale. Responses are embedded and scored against ground truth by cosine similarity (all-mpnet-base-v2). Three runs per model (two for GPT-4o-mini on this runner — one run failed with a transient API error).

Condition	GPT-4o mean cosine	GPT-4o-mini mean cosine	GPT-4o frac ≥ 0.7	GPT-4o-mini frac ≥ 0.7
Baseline	0.519 (±0.001)	0.514 (±0.001)	12%	13%
Continuity (Passive)	0.600 (±0.003)	0.589 (±0.001)	23%	28%
Continuity (In-Loop)	0.693 (±0.002)	0.691 (±0.001)	55%	50%

Figure 2: Benchmark results across two models (GPT-4o and GPT-4o-mini), showing mean cosine similarity and fraction of recall questions clearing a 0.7 threshold for Baseline, Passive, and In-Loop conditions. — Figure 2: Benchmark results across two models. In-Loop roughly doubles the fraction of multi-session recall questions clearing a 0.7 cosine-similarity threshold over Passive retrieval.

Both retrieval conditions improve recall over baseline. Unlike the action-alignment results, here the in-loop variant is clearly better than passive: the fraction of questions clearing a 0.7 cosine threshold roughly doubles on both models (23% → 55% on GPT-4o; 28% → 50% on GPT-4o-mini). We report the threshold metric alongside mean cosine because the threshold is more interpretable — it measures how many questions are answered well enough to matter rather than averaging across all questions.

Two caveats on this experiment:

First, all three conditions show approximately flat alignment across the seven sessions (drift slopes under 0.003 per session in absolute terms). The paydash-api fixture's 19 decisions do not induce the kind of session-over-session drift Platnick et al. observe in persona-grounded agents, so we cannot make the claim that Continuity arrests drift. What our data supports is that in-loop retrieval raises a floor: because relevant decisions arrive with every file-touching call, the agent cannot miss them due to failing to query. This is a coverage benefit, not a drift-reduction benefit. On a different fixture with genuinely drift-prone content, the picture might differ; we have not tested this.

Second, absolute cosine numbers are not apples-to-apples with those reported in Platnick et al. (different ground truth, different retrieval target, same embedding scale). What is comparable is the magnitude of the lift, which is of the same order in both studies.

4.3 Head-to-head vs. MemPalace

We ran 50 queries against Continuity (both Passive and In-Loop configurations) and MemPalace on the same project. MemPalace mines the full codebase into ChromaDB; Continuity queries its decision store plus a lightweight file index via reciprocal rank fusion.

Metric	Continuity (Passive)	Continuity (In-Loop)	MemPalace
Query wins (N=50)	44	43	5 / 4
Ties	1	3	—
Mean relevance (human-judged, 0–1)	0.86	0.86	0.61 / 0.60
Wake-up latency	14 ms	30 ms	2,807 / 3,950 ms
Wake-up tokens	726	801	817

The latency gap is real and reflects a genuine architectural difference: Continuity does not index the full codebase. The relevance gap is more contingent. Our query set was constructed by the author and skews toward the kind of project-rationale lookups Continuity is designed for (“why does module X do Y?”), underweighting queries about code structure or implementation details where MemPalace's full-codebase index does better — MemPalace's wins cluster on source-tree queries (webpack config, CLI commands, benchmark result files, offline operation). The split is therefore not evidence that Continuity is uniformly superior; it is evidence that on rationale-centric queries, a small targeted index beats a large general one. We recommend treating this comparison as suggestive and welcome a third-party query set.

This experiment also provides a second, independent confirmation of a finding from §4.1: Passive and In-Loop Continuity are statistically indistinguishable on single-query retrieval quality (44 vs. 43 wins, identical mean relevance). In-loop injection's value is not in how each retrieval scores but in when it fires.

4.4 Inter-judge validation

All 540 action-alignment responses from §4.1 were re-scored by a second LLM judge from a different vendor (Gemini 2.5 Flash). Aggregate agreement statistics:

Metric	Value	Interpretation
N	540 pairs	100% parseable
Spearman ρ (rank correlation)	0.788	Strong
Cohen's κ (linear-weighted)	0.518	Moderate (per Landis-Koch)
Sonnet mean	6.57	—
Gemini mean	8.01	Gemini systematically +1.44 points more generous

The combination of strong rank correlation with moderate absolute agreement is the canonical “same signal, different calibration” pattern. The 3× baseline-to-retrieval lift reported in §4.1 holds under both judges; the “Passive ≈ In-Loop on single prompts” finding also holds under both judges. Absolute scores should be read per-judge, not aggregated across judges. Two caveats: the two judges saw slightly different context (Gemini saw the full 19-decision fixture per prompt; Sonnet saw top-5 retrieved), so this is “action quality given the question” agreement rather than strict judge-replaceability; and both judges could share a systematic bias a human panel would not.

4.5 Context overhead

Continuity's per-call overhead is bounded by the size of the matched decision set, which does not grow with total decision count. The short-circuit threshold in the production implementation is approximately 50 decisions: below it, all linked decisions are injected without ranking; above it, reciprocal rank fusion selects a top-k within a configurable token budget (default 500 tokens).

We distinguish two related but separate quantities here, both of which we report only at a coarse level in this paper:

Per-call overhead (modeled). The number of additional tokens injected on a single file-touching tool call is bounded by the configured budget (≤500 tokens by default) and is approximately constant in total decision count. We have not directly measured per-call overhead at scale (i.e. with thousands of decisions in production); any scaling figure derived from the budget cap is a model, not a measurement. We do not claim a per-session token savings figure in this paper.

Retraction. A prior version of this paper, and accompanying marketing copy on the Continuity website, reported a 56.5% per-session token reduction measured with tiktoken on Continuity's own dogfooded codebase (1,864 decisions). On review, that figure was traced to a single hand-constructed bundle comparison rather than a per-session study, and is not substantiated by the public benchmark repository. The 56.5% figure has been retracted pending a controlled per-session measurement; the surrounding 25.7× / 96.1% scaling-model figures derived from the same artifact have been removed from the website.

5. Discussion

The honest summary of our results is narrower than the pattern's apparent promise. In-loop retrieval buys two concrete things over a passive retrieval baseline:

Higher coverage on multi-session workloads. The agent is reminded of decisions it would not have known to query for, roughly doubling the fraction of recall questions answered well enough to clear a quality threshold.
Elimination of the retrieval decision. The agent cannot fail to see a decision because it did not think to query for one.

What it does not buy, on our benchmarks:

Higher alignment on single-prompt workloads. When the agent is handed the relevant decisions inline with each prompt, automatic re-firing is redundant. Passive and In-Loop tie on single-prompt alignment (§4.1) and on single-query retrieval quality (§4.3).
Drift reduction per se. On the paydash-api fixture, none of the three conditions show meaningful drift across seven sessions. The in-loop benefit in §4.2 comes from coverage, not from arresting a decline.

Under what conditions might the passive-vs-in-loop gap widen beyond what we observe? We hypothesize three: (a) when the agent is under time or token pressure and skips optional tool calls, (b) when the decision's trigger is subtle (a rename, a refactor of a dependency) and less likely to prompt a query, and (c) on content that is genuinely drift-prone — persona facts, long-horizon plans — rather than static architectural constraints. We have not tested any of these and flag them as the natural next experiments.

6. Limitations

Single project fixture. All numbers are from paydash-api (19 decisions). Two other planned fixtures (ml-platform, infra-platform) were dropped for time. Generalization to larger decision counts, other languages, and non-synthetic projects is untested.
Author-constructed MemPalace queries. See §4.3. The 43–4 / 44–5 splits should not be cited without the caveat that the query set was constructed by the author and skews toward rationale lookups.
Two base models from one vendor. GPT-4o and GPT-4o-mini. A Qwen2.5-7B condition was attempted but could not be completed within the time budget; stronger models or smaller open-weight models may behave differently.
LLM-judge scores without human validation. Inter-judge agreement (§4.4) addresses whether two LLM judges agree, but not whether they agree with human annotators. A human-labeled gold subset is the single most useful next validation.
File-scoped decisions only. The pattern does not handle cross-cutting constraints well, and we do not claim it does.
Context overhead is modeled, not measured at scale. We have not directly measured per-call overhead on projects with thousands of decisions; that scaling claim is a model based on the configured injection budget, not a benchmark. See §4.5.
Reference implementation is not public. The Continuity middleware ships as part of a commercial product. Appendix B describes the pattern at a level sufficient for re-implementation; the production code is not released. Replication against alternative memory systems is supported via the public benchmark fixtures.
One transient run failure. GPT-4o-mini's recall-over-time cell uses n=2 runs instead of n=3 due to a transient API failure on run 2.
Convergence-time runner not executed. Platnick et al.'s headline efficiency numbers (19% / 58% convergence reduction) come from a runner we did not run on our fixture; we do not have a comparable efficiency metric.
No user study. All metrics are automated; we have not measured whether developers perceive Continuity-assisted agents as more trustworthy or more useful.

7. Conclusion

In-loop retrieval, as implemented in Continuity, is a practical engineering pattern for reducing a specific failure mode in coding agents: forgetting file-scoped project decisions during long interactions. It is not a general solution to agent memory. On single-prompt workloads, its benefit over passive retrieval is below the noise floor — the substantial lift (≈3×) comes from exposing the decisions to the agent at all, regardless of how retrieval is triggered. On multi-session workloads, in-loop retrieval raises a coverage floor that passive retrieval cannot, roughly doubling the fraction of recall questions cleared at a quality threshold. We release the benchmark fixtures and result data publicly so independent groups can replicate the evaluation against alternative memory systems, including their own.

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.

Platnick, D., Bengueddache, M. E., Alirezaie, M., Newman, D. J., Pentland, A. “Sandy,” & Rahnama, H. (2025). ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents. arXiv:2509.25299.

Appendix A: Methods

This appendix documents the experimental setup in enough detail to replicate the numbers in §4. Query sets, prompt fixtures, raw result JSONs, and scoring scripts are released at github.com/Alienfader/continuity-benchmarks.

A.1 Project fixture

The benchmark project is paydash-api, a synthetic Python web service with 19 recorded architectural decisions. Each decision record specifies affected files (as explicit paths or glob patterns) and a rationale string. Example decisions:

“The billing/ module must not import from experimental/ — the experimental package has no stability guarantees and billing requires auditable behavior.”
“All database writes go through db/session.py::write_guarded, which enforces a transaction scope. Direct SQLAlchemy session.add calls in service code are a bug.”
“The /v2/ API surface is frozen for external consumers. New endpoints go under /v3/.”

A.2 Conditions

All three conditions share the same base models (GPT-4o, GPT-4o-mini; temperature 0.2) and the same file-touching tool schemas (read_file, edit_file, write_file, bash).

Baseline. No memory mechanism. The agent is given the task but no access to project decisions.
Continuity (Passive). Decisions are retrieved once at session start and included in the agent's initial context. No retrieval tool is exposed during the session.
Continuity (In-Loop). The middleware described in §3 is active. Decisions are injected into every file-touching tool result's metadata block, keyed on affected paths.

A.3 Single-prompt action alignment (§4.1)

30 prompts, each requesting a file-scoped action where a decision is relevant. Prompts are held constant across conditions and models. Judge: Claude Sonnet 4.6 at temperature 0. Rubric: 10 = action fully conforms; 7 = minor deviation but overall aligned; 4 = partial violation; 1 = direct violation. Per-run alignment is the mean across all 30 scored actions. 3 runs per condition per model with different random seeds. Inter-judge agreement validation is reported separately in §4.4.

A.4 Multi-session recall (§4.2)

Each run consists of 7 sessions. Each session: (1) a task prompt requesting a code change, (2) approximately 5,000 tokens of off-topic technical noise from a fixed pool, (3) a 20-question recall quiz at the session boundary probing retention of decision rationale. Sessions share a single context within a run; no reset between sessions. Decisions are not re-stated in the prompt. Agent responses are embedded with all-mpnet-base-v2 and scored by cosine similarity against ground-truth rationale. 3 runs per condition per model; one GPT-4o-mini run failed with a transient API error so that cell uses n=2.

A.5 Head-to-head vs. MemPalace (§4.3)

50 natural-language queries constructed by the author over the paydash-api fixture, released in the public benchmark repo. Roughly two-thirds target decision rationale; the remaining third target code structure. Continuity v6.0 (decision store + file-name BM25 + reciprocal rank fusion) and MemPalace's default configuration (full-codebase chunking into ChromaDB with text-embedding-3-small) run on the same project on the same hardware. For each query each system returns its top result; the author ranked the two head-to-head on relevance, blinded via random A/B labels assigned by a wrapper script, with the mapping revealed only after all 50 queries were scored. Wake-up latency is wall-clock time from query submission to first usable result.

A.6 Inter-judge validation (§4.4)

All 540 action-alignment responses (30 prompts × 3 runs × 3 conditions × 2 models) originally scored by Claude Sonnet 4.6 were re-scored by Gemini 2.5 Flash at temperature 0. Spearman ρ, Cohen's linear-weighted κ, and per-judge means are reported over the 540 paired scores. Caveats: Gemini saw the full 19-decision fixture per prompt; Sonnet saw top-5 retrieved per prompt. This is an “action quality given the question” agreement measurement, not strict judge-replaceability.

A.7 Replication

The public benchmark repo contains the artifacts required to verify the numbers in §4 and to run comparable evaluations against alternative memory systems: the paydash-api fixture and its 19-decision store, the 30 action-alignment prompts and ground-truth decisions, the 20-question recall quiz items and ground-truth rationales, the 50 head-to-head queries with per-query relevance rankings, raw per-run result JSONs, scoring scripts (cosine similarity, LLM-as-judge prompt templates, blinded A/B pair generation), and both LLM judges' raw output for the 540 inter-judge pairs.

A third party wishing to replicate the experiments against a different memory system can use the fixture, prompts, and scoring scripts unchanged. The Continuity middleware itself is proprietary; the pattern is described in §3 and Appendix B at a level sufficient for re-implementation.

Appendix B: Pattern Description

The Continuity middleware implements the following high-level behavior. This description is sufficient for an independent re-implementation; the production code is not released.

Initialization

Load the decision store. Each decision has at minimum an identifier, one or more affected-path patterns (globs over the project's file tree), and a rationale string.

On every file-touching tool call

Before the tool result is returned to the agent, extract the set of file paths the tool call references. For each decision in the store, test whether any of its affected-path patterns matches any of the touched paths. Collect all matched decisions.

Ranking and budget

If the matched set is small (under a configurable threshold, ~50 decisions in the production implementation), include all matched decisions. Otherwise, rank by reciprocal rank fusion over (a) path-match specificity and (b) recency, and select a top-k that fits within a configurable token budget (default 500 tokens).

Injection

Prepend the selected decisions to the tool result in a metadata block that the agent's prompt template surfaces as context. Decisions appear with their identifier and rationale; other fields are reserved for tooling.

Caching

Within a single agent session, cache lookups by touched-path set so repeated tool calls on the same paths do not re-rank.

Tool coverage

The pattern applies to any tool call whose arguments reference file paths: read_file, edit_file, write_file, bash (with paths parsed from argv tokens that exist on disk or match known globs), and any equivalents in other tool taxonomies.

The reference implementation in the commercial product is approximately 180 lines of TypeScript including the production-grade ranking, caching, and budget logic. A minimal proof-of-concept implementation following the description above can be written in roughly 50 lines of any modern language.