MCP Memory
Semantic memory server for AI agents
Situation
AI agents are stateless. Every conversation starts from zero — no memory of past decisions, no recall of corrections, no awareness of project constraints. The result is a frustrating loop: you correct the same mistake three times, re-explain architecture decisions, and lose hours to hallucinations that a simple boundary could prevent.
Existing solutions were either too heavy (full knowledge bases that required manual curation) or too naive (dumping entire conversation logs into vector stores with no quality filtering). Neither approach respected the signal-to-noise ratio that makes memory useful.
I needed a system that could distinguish between valuable observations and conversational noise, maintain temporal relevance so fresh knowledge wins over stale facts, and inject hard constraints that prevent recurring mistakes.
Task
Build a semantic memory server that operates via the Model Context Protocol (MCP), enabling any AI agent to save, search, and retrieve observations with automatic curation. The system needed to:
- Reject low-signal inputs before they pollute memory
- Deduplicate semantically similar observations
- Decay old memories unless actively accessed
- Support "technical anchors" — short imperative constraints injected into every search
- Run on the edge with sub-100ms search latency
Action
I designed the system using Domain-Driven Design with strict hexagonal architecture. The domain layer knows nothing about HTTP, MCP, or PostgreSQL — every external dependency is behind an adapter interface.
The curation pipeline is the core differentiator. When an observation arrives, it passes through a low-signal filter (rejects "ok thanks" and "trying again" type inputs), then gets embedded with OpenAI's text-embedding-3-small model. A cosine similarity check at 0.92 threshold handles deduplication — above that, the observation is a duplicate. Below 0.85, it's clearly distinct. The ambiguous zone (0.85-0.92) triggers a two-tier review with a separate LLM call.
Technical anchors solve the recurring-mistake problem. These are short imperative constraints (1-5 sentences) that bypass semantic search entirely — they're injected as-is into every search response. When an agent searches memory, it first receives relevant anchors (project-scoped, stack-scoped, or global), then ranked observations. The anchors cost ~30 tokens but prevent hours of corrections.
Hybrid ranking combines vector similarity (85%) with temporal relevance (15%). A fire-and-forget decay boost runs asynchronously on each access, so frequently referenced memories stay fresh without blocking search responses.
The stack is Cloudflare Workers (V8 isolates for global edge deployment), Hono framework (runs everywhere — Node, Deno, Bun, Workers), and PostgreSQL with pgvector HNSW indexes via Supabase. Authentication uses OAuth 2.1 + PKCE with RFC 9728 discovery — zero-config for end users.
Result
- 14 MCP tools covering the full memory lifecycle: save, search, timeline, anchors, project management
- Sub-100ms search latency on the edge via Cloudflare Workers + Hyperdrive connection pooling
- Automatic curation rejects ~40% of inputs as noise before storage
- Runtime-agnostic core via monorepo architecture (
core,cloudflare, andnodepackages) - Architecture that separates the memory model (the real value) from deployment details (Cloudflare is optional)