Amber IA

Conversational AI with modular pipeline

Head of Technology4 engineers

GPT-4o-miniRAGNode.jsStreaming

6-layer AI pipeline

Architecture highlights

Arch 01

6-layer pipeline

Arch 02

Content shield

Arch 03

RAG memory

Situation

The company needed a conversational AI assistant that could answer customer questions accurately while maintaining brand tone and safety guardrails. Existing chatbot solutions were either too rigid (decision-tree based) or too unpredictable (raw LLM with no moderation). Customer support volume was growing, and the team needed an AI layer that could handle common queries without human intervention — while ensuring harmful or off-topic content never reached users.

Task

Design and build a production-grade conversational AI platform with a modular architecture that separates concerns: input handling, content moderation, contextual memory, language model inference, and response delivery. The system needed to be extensible (new knowledge sources, different LLMs) and safe (content filtering before and after generation).

Action

I architected Amber IA as a 6-layer pipeline, where each layer has a single responsibility and can be developed, tested, and replaced independently.

Layer 1: User Input — Real-time chat interface with WebSocket connections for low-latency streaming. Messages are validated and sanitized before entering the pipeline.

Layer 2: Orchestration API — The /amber-chat endpoint routes messages through the pipeline. It manages conversation state, handles timeouts, and provides fallback responses when downstream services are unavailable.

Layer 3: Content Shield — A moderation layer that filters both incoming messages and outgoing responses. It detects prompt injection attempts, harmful content, and off-topic queries. Messages that fail moderation receive a pre-defined safe response without ever reaching the language model.

Layer 4: Memory & Context — RAG-powered contextual memory using embeddings. The system retrieves relevant knowledge base articles and conversation history to build a rich context window. This ensures responses are grounded in actual company data rather than hallucinated information.

Layer 5: Language Model — GPT-4o-mini processes the enriched context and generates responses. The model is configured with system prompts that enforce brand tone, response length, and behavioral guardrails. Temperature and top-p parameters are tuned for consistency over creativity.

Layer 6: Smart Response — Streaming delivery with human-like pacing. Responses are chunked and delivered progressively, giving users immediate feedback while the full response generates.

Result

Metric	Value
Pipeline layers	6 independent modules
Response latency	Sub-2s first token
Content filter accuracy	99.2% harmful content blocked
Knowledge base coverage	500+ articles indexed
Team size	4 engineers

Modular architecture that allowed the team to swap GPT-4o-mini for newer models without touching other layers
Content Shield blocked prompt injection attempts and off-topic queries before they reached the LLM
RAG memory reduced hallucination by grounding responses in indexed company knowledge
Streaming responses improved perceived performance and user engagement

Metric

Value

Pipeline layers

6 independent modules

Response latency

Sub-2s first token

Content filter accuracy

99.2% harmful content blocked

Knowledge base coverage

500+ articles indexed

Team size

4 engineers