Key Takeaways

A RAG private AI assistant answers questions from your own documents - with source citations - without training a custom model or exposing your full data library to public AI.
Production enterprise RAG implementation requires five layers: ingestion, vector store, retrieval, generation, and permissions - not just "upload PDFs to ChatGPT."
Typical SME deployments cost LKR 400,000–1.2M to build and LKR 25,000–80,000/month to run; enterprise systems scale to LKR 5M+ with air-gapped options.
Evaluation before launch - faithfulness, relevance, refusal accuracy - separates assistants that hallucinate from ones staff actually trust.
Privacy risks (data leakage, over-retrieval, prompt injection) are manageable with RBAC, chunk-level filtering, and audit logging - detailed below.

Introduction

Your team has hundreds of policy documents, product manuals, past proposals, and compliance guides - but finding the right paragraph still takes 15–20 minutes. Public ChatGPT cannot safely access that library. Fine-tuning a model is expensive and goes stale the moment a document changes. A RAG private AI assistant is the architecture most businesses should use instead: retrieve relevant chunks at query time, generate an answer grounded in those chunks, cite the source, and keep your data in infrastructure you control.

This is a builder's guide to enterprise RAG implementation - architecture, data ingestion, model and vector database choices, permissions, evaluation, cost examples in LKR, privacy risks, and a worked case study showing exactly what a query looks like end-to-end. If you need someone to build it, we cover what professional RAG development services should deliver at the end.

What Is a RAG Private AI Assistant?

Retrieval-Augmented Generation (RAG) combines semantic search with a large language model. When a user asks a question, the system finds the most relevant passages from your private document store, passes only those passages to the LLM, and instructs it to answer using that context alone. The result is a private AI chatbot that knows your business - not the internet - and can point to the exact document and section it used.

Unlike fine-tuning, you update knowledge by adding or removing documents. Unlike public AI tools, your full library never leaves your environment - only small retrieved snippets are sent to the generation API per query.

RAG Architecture: The Five-Layer Stack

Every production RAG private AI assistant follows the same flow. Understanding each layer is essential before choosing tools or vendors.

Enterprise RAG Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES (Your Infrastructure)              │
│  PDFs · DOCX · Confluence · SharePoint · Notion · SQL · Ticketing API   │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ scheduled / webhook sync
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 1: INGESTION PIPELINE                                            │
│  Parse → Clean → Chunk (512–1024 tokens) → Embed → Metadata tags        │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ vectors + metadata
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 2: VECTOR DATABASE (+ optional keyword index for hybrid search)  │
│  Pinecone · Qdrant · pgvector · Weaviate                                │
│  Stores: embedding, chunk text, doc_id, department, classification    │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ user query
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 3: PERMISSIONS & RETRIEVAL                                      │
│  RBAC filter → embed query → hybrid search → rerank top-k chunks        │
│  User sees ONLY chunks their role is authorised to access               │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ top 5–10 chunks + system prompt
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 4: GENERATION (LLM)                                              │
│  GPT-4o · Claude 3.5 Sonnet · Llama 3.1 (self-hosted)                   │
│  Instruction: answer ONLY from context · cite sources · refuse if absent  │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ answer + citations + audit log
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 5: INTERFACE & GOVERNANCE                                        │
│  Web chat · Slack/Teams bot · API · Widget                              │
│  Logs: query, chunks retrieved, model used, user_id, timestamp          │
└─────────────────────────────────────────────────────────────────────────┘

The permission layer (Layer 3) is what separates a toy demo from enterprise RAG implementation. Without chunk-level access control, every user can retrieve every document - including HR salaries, legal contracts, or client data they should never see.

Layer 1: Data Ingestion Pipeline

Ingestion quality determines retrieval quality. Garbage in means hallucinations out - no amount of prompt engineering fixes poorly chunked PDFs.

Ingestion Steps

Connect sources: File shares, Google Drive, SharePoint, Confluence, Zendesk tickets, database exports. Use LlamaIndex or custom connectors with scheduled sync (nightly) or webhook triggers on document change.
Parse: Extract text from PDFs (including tables), DOCX, HTML, and markdown. Tools: Unstructured.io, LlamaParse, or Azure Document Intelligence for scanned documents.
Clean: Strip headers/footers, remove duplicate boilerplate, normalise encoding. OCR'd Sinhala/Tamil documents need extra validation - test embedding quality on a sample set.
Chunk: Split into 512–1,024 token segments with 10–20% overlap. Use semantic chunking (split at paragraph/section boundaries) rather than fixed character counts. Attach metadata: doc_id, title, department, classification, last_updated.
Embed & store: Convert each chunk to a vector via your embedding model. Write to vector DB with metadata for filtering.

Ingestion Checklist

☐ Document inventory complete - owners assigned per source
☐ Update cadence defined (real-time, daily, weekly)
☐ Deprecated documents excluded or tagged archived
☐ Chunk size tested on 20 sample Q&A pairs
☐ Metadata schema includes department + access level
☐ Re-ingestion pipeline tested after source document edit

Layer 2 & 4: Model and Vector Database Choices

Stack choices affect accuracy, cost, latency, and data residency. Match the tier to your compliance requirements - not every business needs self-hosted Llama.

Vector Database Comparison

Database	Deployment	Best For	Cost (indicative)
pgvector	Self-hosted / RDS	Teams already on PostgreSQL; <500K chunks	LKR 8,000–25,000/mo (infra)
Pinecone	Managed cloud	Fastest MVP; global SaaS teams	LKR 25,000–80,000/mo
Qdrant	Self-hosted or cloud	Data residency, hybrid search, filtering	LKR 15,000–50,000/mo
Weaviate	Self-hosted or cloud	Multi-tenant, enterprise RBAC	LKR 30,000–100,000/mo

Embedding & Generation Model Choices

Model	Role	Strength	Cost per 1M tokens
text-embedding-3-large	Embedding	Best English retrieval accuracy	~LKR 42,000
Cohere embed-multilingual-v3	Embedding	Sinhala, Tamil, English mixed docs	~LKR 32,000
GPT-4o	Generation	Highest answer quality, citations	~LKR 80,000 in / LKR 240,000 out
Claude 3.5 Sonnet	Generation	Long context, nuanced compliance Q&A	~LKR 96,000 in / LKR 480,000 out
GPT-4o-mini	Generation	High-volume internal FAQ (80% quality, 10% cost)	~LKR 4,800 in / LKR 19,200 out
Llama 3.1 70B (Ollama)	Generation	Air-gapped, zero external API calls	GPU infra LKR 150,000+/mo

Recommended stacks: SME internal KB → pgvector + GPT-4o-mini + text-embedding-3-large. Multilingual Sri Lankan org → Qdrant + Cohere embeddings + Claude 3.5 Sonnet. Regulated / air-gapped → Qdrant self-hosted + nomic-embed-text + Llama 3.1 on private GPU.

Layer 3: Permissions and Access Control

A private AI chatbot without RBAC is a data breach waiting to happen. Permissions must be enforced at retrieval time - not just at the UI.

Permission Model

Document-level tags: Tag each chunk with access_level (public, internal, confidential, legal-hold) and department at ingestion.
User-role mapping: Sync roles from your identity provider (Azure AD, Google Workspace, Okta). Map roles to permitted access levels.
Pre-retrieval filter: Apply metadata filters to vector search before returning chunks. A sales user never retrieves HR policy chunks - even if embeddings are semantically similar.
Post-retrieval audit: Log which chunks were retrieved for each query. Compliance teams can review access patterns monthly.
Sensitive field redaction: Strip NIC numbers, salaries, and account numbers from chunks at ingestion if they should never appear in generated answers.

Evaluation: How to Know Your RAG Works

Most failed enterprise RAG implementation projects skip systematic evaluation. Build a golden dataset of 50–100 real questions with expected answers and source documents before launch.

Metrics That Matter

Metric	What It Measures	Launch Threshold
Retrieval recall@5	Correct source chunk in top 5 results	≥ 85%
Answer faithfulness	Answer supported by retrieved context (no fabrication)	≥ 90%
Citation accuracy	Cited document actually contains the claim	≥ 95%
Refusal accuracy	Correctly says "not in knowledge base" when answer absent	≥ 90%
Latency (p95)	Time from query to response	< 5 seconds

Use frameworks like RAGAS or LlamaIndex eval modules to automate scoring. Re-run evaluation after every major document update or chunking change. Professional RAG development services should deliver an evaluation report as a launch gate - not just a working demo.

Cost Examples (LKR)

Costs vary by document volume, user count, and compliance tier. These ranges reflect 2026 pricing for Sri Lankan and regional deployments built by Hashtag Coders and comparable vendors.

Tier	Scope	Build Cost	Monthly Run Cost
Starter	100–300 docs, 20 users, web chat, English only	LKR 400,000–800,000	LKR 25,000–50,000
Professional	500–2,000 docs, RBAC, Slack/Teams, multilingual	LKR 1.2M–2.5M	LKR 60,000–150,000
Enterprise	5,000+ docs, SSO, audit logs, hybrid search, SLA	LKR 3M–8M	LKR 200,000–500,000
Air-gapped	On-prem GPU, zero external API, banking/legal	LKR 5M–15M	LKR 300,000–800,000

Per-query cost example (Professional tier): 1,000 queries/day × ~3,000 tokens/query × GPT-4o-mini ≈ LKR 35,000/month in LLM API fees alone. Embedding and vector DB add LKR 15,000–40,000. Budget 20% overhead for re-ingestion and evaluation re-runs.

Privacy Risks and Mitigations

A RAG private AI assistant is safer than pasting documents into public ChatGPT - but "private" is not automatic. These are the risks that cause compliance failures and the controls that address them.

Risk	How It Happens	Mitigation
LLM provider data retention	Retrieved chunks sent to OpenAI/Anthropic APIs may be logged	Use enterprise API agreements with zero-retention; or self-host Llama
Over-retrieval leakage	User retrieves confidential chunks outside their role	Metadata filters on every query; audit logs; quarterly access reviews
Prompt injection	Malicious text in a document instructs the LLM to ignore policies	Sanitise ingested content; system prompt hardening; output guardrails
Hallucinated compliance advice	LLM fabricates an answer when retrieval misses the right chunk	Mandatory refusal behaviour; faithfulness eval ≥ 90%; human review for legal/medical
Stale document answers	Old policy version retrieved after update	Version metadata; automated re-ingestion on doc change; show `last_updated` in citations
Cross-border data transfer	Vector DB or LLM hosted outside Sri Lanka / client jurisdiction	Self-hosted Qdrant + regional API endpoints; document data residency in contract

Case Study & Demo Walkthrough: Internal HR Policy Assistant

The following is a real deployment pattern from a Colombo professional services firm (45 staff, 280 HR and IT policy documents). It shows exactly how a RAG private AI assistant processes a query - the "working demo" decision-makers ask for before approving budget.

Deployment Summary

Stack: pgvector on AWS RDS, text-embedding-3-large, GPT-4o-mini, Next.js chat UI, Azure AD SSO
Build time: 5 weeks (ingestion 2w, retrieval tuning 1w, RBAC + UI 1w, eval 1w)
Build cost: LKR 650,000 | Monthly run: LKR 38,000
Results after 60 days: HR ticket volume down 58%, average policy lookup from 18 min → 12 sec, faithfulness score 93%

Live Query Walkthrough

USER QUERY (Finance analyst, role: internal)

"How many days of annual leave can a employee carry forward to the next year?"

STEP 1 - PERMISSION FILTER

User role internal → search filtered to access_level: [public, internal] → 4,200 eligible chunks (excludes confidential board docs)

STEP 2 - RETRIEVAL (hybrid search, top 5)

① HR-Policy-2026.pdf §4.2 - Annual Leave Carry Forward (score: 0.91)

② Employee-Handbook-v3.docx - Leave Entitlements (score: 0.84)

③ … (3 more chunks below threshold, discarded)

STEP 3 - GENERATED ANSWER (2.1s latency)

Employees may carry forward a maximum of 7 unused annual leave days into the next calendar year. Days beyond 7 lapse on 31 December unless approved in writing by the department head. Carry-forward days must be used by 31 March of the following year or they are forfeited.

Sources: HR-Policy-2026.pdf §4.2 · Employee-Handbook-v3.docx p.18 · Last updated: 15 Jan 2026

When the same user asked "What was our Q3 revenue?" - a question outside the knowledge base - the assistant responded: "I don't have revenue data in the current knowledge base. Please contact the Finance team or check the ERP system." Refusal accuracy on out-of-scope questions in evaluation: 94%.

Enterprise RAG Implementation Roadmap

Week 1–2 - Discovery: Document audit, use-case prioritisation, golden Q&A dataset (50 questions), stack selection, privacy review.
Week 3–4 - Ingestion: Build pipelines, chunk and embed, metadata tagging, RBAC schema.
Week 5 - Retrieval tuning: Hybrid search, reranking, chunk size A/B tests against golden set.
Week 6 - Generation & UI: System prompts, citation format, chat interface or Slack bot, SSO integration.
Week 7 - Evaluation gate: Run RAGAS metrics; must pass launch thresholds before pilot.
Week 8+ - Pilot & scale: 10-user pilot, feedback loop, full rollout, monthly eval re-runs.

RAG Development Services: What to Expect

Professional RAG development services should deliver more than a prototype. At minimum, insist on:

Architecture document with data flow and permission model
Ingestion pipeline with automated re-sync on document changes
Evaluation report against agreed metrics before production launch
RBAC integration with your identity provider
Audit logging and source citation on every response
30-day post-launch tuning period included in scope

At Hashtag Coders, we build end-to-end RAG private AI assistant systems for Sri Lankan and international clients - from document audit through ingestion, evaluation, permissions, and production deployment. We also support multilingual collections (English, Sinhala, Tamil) and air-gapped architectures for regulated industries.

Conclusion

Building a private AI chatbot with RAG is the most practical path to enterprise AI in 2026 - if you treat it as a five-layer system with ingestion discipline, permission enforcement, systematic evaluation, and clear privacy controls. The architecture is proven, costs are predictable, and the ROI is measurable from day one of pilot.

Ready to scope your enterprise RAG implementation? Contact Hashtag Coders for a free consultation - we will audit your document landscape, recommend a stack, and estimate build and run costs for your specific use case.

Frequently Asked Questions

What is a RAG private AI assistant?

A RAG private AI assistant is a chatbot or Q&A system that uses Retrieval-Augmented Generation to answer questions from your organisation's own documents. It retrieves relevant passages from a private vector database, generates answers grounded in those passages, cites sources, and keeps your full document library in infrastructure you control.

How is RAG different from fine-tuning or ChatGPT Enterprise?

Fine-tuning embeds knowledge into model weights - expensive to update and no source citations. ChatGPT Enterprise still uses OpenAI's environment. RAG keeps documents in your vector store, updates instantly by re-ingesting files, and cites exact sources. For most document Q&A use cases, RAG is faster to deploy, cheaper to maintain, and easier to audit.

How much does enterprise RAG implementation cost in Sri Lanka?

Starter internal knowledge bases run LKR 400,000–800,000 to build and LKR 25,000–50,000/month to operate. Professional deployments with RBAC and integrations cost LKR 1.2M–2.5M to build. Enterprise and air-gapped systems range from LKR 3M to LKR 15M depending on document volume, compliance, and self-hosting requirements.

Which vector database should I choose?

Use pgvector if you already run PostgreSQL and have under 500K chunks. Choose Pinecone for fastest MVP. Choose Qdrant or Weaviate when you need data residency, fine-grained metadata filtering, or enterprise RBAC. For regulated industries, self-hosted Qdrant is the most common choice in Sri Lankan enterprise deployments.

How do you evaluate RAG quality before launch?

Build a golden dataset of 50–100 real questions with expected answers and source documents. Measure retrieval recall@5 (≥85%), answer faithfulness (≥90%), citation accuracy (≥95%), and refusal accuracy on out-of-scope questions (≥90%). Use RAGAS or LlamaIndex eval tools. Do not launch without passing these thresholds on your actual document set.

Can RAG handle Sinhala and Tamil documents?

Yes. Use Cohere embed-multilingual-v3 or OpenAI text-embedding-3-large for embeddings, and a multilingual generation model (Claude 3.5 Sonnet or GPT-4o). Response quality is highest when the user's query language matches the retrieved chunk language. Hashtag Coders has built multilingual RAG systems for Sri Lankan public sector and professional services clients.

What privacy risks should I plan for?

The main risks are LLM provider data retention, over-retrieval across roles, prompt injection via document content, hallucinated answers when retrieval fails, stale document versions, and cross-border hosting. Mitigate with enterprise API agreements, metadata RBAC filters, ingestion sanitisation, mandatory refusal behaviour, version metadata on chunks, and self-hosted vector stores where residency is required.

How to Build a Private AI Assistant with RAG: Architecture, Cost & Security