Amber IA
Conversational AI with modular pipeline
Situation
The company needed a conversational AI assistant that could answer customer questions accurately while maintaining brand tone and safety guardrails. Existing chatbot solutions were either too rigid (decision-tree based) or too unpredictable (raw LLM with no moderation). Customer support volume was growing, and the team needed an AI layer that could handle common queries without human intervention — while ensuring harmful or off-topic content never reached users.
Task
Design and build a production-grade conversational AI platform with a modular architecture that separates concerns: input handling, content moderation, contextual memory, language model inference, and response delivery. The system needed to be extensible (new knowledge sources, different LLMs) and safe (content filtering before and after generation).
Action
I architected Amber IA as a 6-layer pipeline, where each layer has a single responsibility and can be developed, tested, and replaced independently.
Layer 1: User Input — Real-time chat interface with WebSocket connections for low-latency streaming. Messages are validated and sanitized before entering the pipeline.
Layer 2: Orchestration API — The /amber-chat endpoint routes messages through the pipeline. It manages conversation state, handles timeouts, and provides fallback responses when downstream services are unavailable.
Layer 3: Content Shield — A moderation layer that filters both incoming messages and outgoing responses. It detects prompt injection attempts, harmful content, and off-topic queries. Messages that fail moderation receive a pre-defined safe response without ever reaching the language model.
Layer 4: Memory & Context — RAG-powered contextual memory using embeddings. The system retrieves relevant knowledge base articles and conversation history to build a rich context window. This ensures responses are grounded in actual company data rather than hallucinated information.
Layer 5: Language Model — GPT-4o-mini processes the enriched context and generates responses. The model is configured with system prompts that enforce brand tone, response length, and behavioral guardrails. Temperature and top-p parameters are tuned for consistency over creativity.
Layer 6: Smart Response — Streaming delivery with human-like pacing. Responses are chunked and delivered progressively, giving users immediate feedback while the full response generates.
Result
| Metric | Value |
|---|---|
| Pipeline layers | 6 independent modules |
| Response latency | Sub-2s first token |
| Content filter accuracy | 99.2% harmful content blocked |
| Knowledge base coverage | 500+ articles indexed |
| Team size | 4 engineers |
- Modular architecture that allowed the team to swap GPT-4o-mini for newer models without touching other layers
- Content Shield blocked prompt injection attempts and off-topic queries before they reached the LLM
- RAG memory reduced hallucination by grounding responses in indexed company knowledge
- Streaming responses improved perceived performance and user engagement