MCP Memory

Semantic memory server for AI agents

ArchitectSolo

HonopgvectorCloudflare WorkersOpenAI

Sub-100ms search

Architecture highlights

Arch 01

DDD core

Arch 02

Hybrid vector search

Arch 03

Edge runtime

mcp-memory.gatamura.dev

Situation

AI agents are stateless. Every conversation starts from zero — no memory of past decisions, no recall of corrections, no awareness of project constraints. The result is a frustrating loop: you correct the same mistake three times, re-explain architecture decisions, and lose hours to hallucinations that a simple boundary could prevent.

Existing solutions were either too heavy (full knowledge bases that required manual curation) or too naive (dumping entire conversation logs into vector stores with no quality filtering). Neither approach respected the signal-to-noise ratio that makes memory useful.

I needed a system that could distinguish between valuable observations and conversational noise, maintain temporal relevance so fresh knowledge wins over stale facts, and inject hard constraints that prevent recurring mistakes.

Task

Build a semantic memory server that operates via the Model Context Protocol (MCP), enabling any AI agent to save, search, and retrieve observations with automatic curation. The system needed to:

Reject low-signal inputs before they pollute memory
Deduplicate semantically similar observations
Decay old memories unless actively accessed
Support "technical anchors" — short imperative constraints injected into every search
Run on the edge with sub-100ms search latency

Action

I designed the system using Domain-Driven Design with strict hexagonal architecture. The domain layer knows nothing about HTTP, MCP, or PostgreSQL — every external dependency is behind an adapter interface.

The curation pipeline is the core differentiator. When an observation arrives, it passes through a low-signal filter (rejects "ok thanks" and "trying again" type inputs), then gets embedded with OpenAI's text-embedding-3-small model. A cosine similarity check at 0.92 threshold handles deduplication — above that, the observation is a duplicate. Below 0.85, it's clearly distinct. The ambiguous zone (0.85-0.92) triggers a two-tier review with a separate LLM call.

Technical anchors solve the recurring-mistake problem. These are short imperative constraints (1-5 sentences) that bypass semantic search entirely — they're injected as-is into every search response. When an agent searches memory, it first receives relevant anchors (project-scoped, stack-scoped, or global), then ranked observations. The anchors cost ~30 tokens but prevent hours of corrections.

Hybrid ranking combines vector similarity (85%) with temporal relevance (15%). A fire-and-forget decay boost runs asynchronously on each access, so frequently referenced memories stay fresh without blocking search responses.

The stack is Cloudflare Workers (V8 isolates for global edge deployment), Hono framework (runs everywhere — Node, Deno, Bun, Workers), and PostgreSQL with pgvector HNSW indexes via Supabase. Authentication uses OAuth 2.1 + PKCE with RFC 9728 discovery — zero-config for end users.

Result

14 MCP tools covering the full memory lifecycle: save, search, timeline, anchors, project management
Sub-100ms search latency on the edge via Cloudflare Workers + Hyperdrive connection pooling
Automatic curation rejects ~40% of inputs as noise before storage
Runtime-agnostic core via monorepo architecture (core, cloudflare, and node packages)
Architecture that separates the memory model (the real value) from deployment details (Cloudflare is optional)

Back to Projects