Artificial Intelligence, Business Strategy

How to Build a Private AI Assistant with RAG: Architecture, Cost & Security

13th March, 2026
Updated: 26th June, 2026
14 min read
Artificial Intelligence, Business Strategy
RAGPrivate AI AssistantEnterprise RAGVector DatabaseAI SecurityKnowledge BaseLLMRAG Development
HC

Hashtag Coders

Software Engineers & Digital Strategists

Key Takeaways

  • A RAG private AI assistant answers questions from your own documents - with source citations - without training a custom model or exposing your full data library to public AI.
  • Production enterprise RAG implementation requires five layers: ingestion, vector store, retrieval, generation, and permissions - not just "upload PDFs to ChatGPT."
  • Typical SME deployments cost LKR 400,000–1.2M to build and LKR 25,000–80,000/month to run; enterprise systems scale to LKR 5M+ with air-gapped options.
  • Evaluation before launch - faithfulness, relevance, refusal accuracy - separates assistants that hallucinate from ones staff actually trust.
  • Privacy risks (data leakage, over-retrieval, prompt injection) are manageable with RBAC, chunk-level filtering, and audit logging - detailed below.

Introduction

Your team has hundreds of policy documents, product manuals, past proposals, and compliance guides - but finding the right paragraph still takes 15–20 minutes. Public ChatGPT cannot safely access that library. Fine-tuning a model is expensive and goes stale the moment a document changes. A RAG private AI assistant is the architecture most businesses should use instead: retrieve relevant chunks at query time, generate an answer grounded in those chunks, cite the source, and keep your data in infrastructure you control.

This is a builder's guide to enterprise RAG implementation - architecture, data ingestion, model and vector database choices, permissions, evaluation, cost examples in LKR, privacy risks, and a worked case study showing exactly what a query looks like end-to-end. If you need someone to build it, we cover what professional RAG development services should deliver at the end.

What Is a RAG Private AI Assistant?

Retrieval-Augmented Generation (RAG) combines semantic search with a large language model. When a user asks a question, the system finds the most relevant passages from your private document store, passes only those passages to the LLM, and instructs it to answer using that context alone. The result is a private AI chatbot that knows your business - not the internet - and can point to the exact document and section it used.

Unlike fine-tuning, you update knowledge by adding or removing documents. Unlike public AI tools, your full library never leaves your environment - only small retrieved snippets are sent to the generation API per query.

RAG Architecture: The Five-Layer Stack

Every production RAG private AI assistant follows the same flow. Understanding each layer is essential before choosing tools or vendors.

Enterprise RAG Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES (Your Infrastructure)              │
│  PDFs · DOCX · Confluence · SharePoint · Notion · SQL · Ticketing API   │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ scheduled / webhook sync
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 1: INGESTION PIPELINE                                            │
│  Parse → Clean → Chunk (512–1024 tokens) → Embed → Metadata tags        │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ vectors + metadata
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 2: VECTOR DATABASE (+ optional keyword index for hybrid search)  │
│  Pinecone · Qdrant · pgvector · Weaviate                                │
│  Stores: embedding, chunk text, doc_id, department, classification    │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ user query
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 3: PERMISSIONS & RETRIEVAL                                      │
│  RBAC filter → embed query → hybrid search → rerank top-k chunks        │
│  User sees ONLY chunks their role is authorised to access               │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ top 5–10 chunks + system prompt
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 4: GENERATION (LLM)                                              │
│  GPT-4o · Claude 3.5 Sonnet · Llama 3.1 (self-hosted)                   │
│  Instruction: answer ONLY from context · cite sources · refuse if absent  │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │ answer + citations + audit log
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 5: INTERFACE & GOVERNANCE                                        │
│  Web chat · Slack/Teams bot · API · Widget                              │
│  Logs: query, chunks retrieved, model used, user_id, timestamp          │
└─────────────────────────────────────────────────────────────────────────┘

The permission layer (Layer 3) is what separates a toy demo from enterprise RAG implementation. Without chunk-level access control, every user can retrieve every document - including HR salaries, legal contracts, or client data they should never see.

Layer 1: Data Ingestion Pipeline

Ingestion quality determines retrieval quality. Garbage in means hallucinations out - no amount of prompt engineering fixes poorly chunked PDFs.

Ingestion Steps

  1. Connect sources: File shares, Google Drive, SharePoint, Confluence, Zendesk tickets, database exports. Use LlamaIndex or custom connectors with scheduled sync (nightly) or webhook triggers on document change.
  2. Parse: Extract text from PDFs (including tables), DOCX, HTML, and markdown. Tools: Unstructured.io, LlamaParse, or Azure Document Intelligence for scanned documents.
  3. Clean: Strip headers/footers, remove duplicate boilerplate, normalise encoding. OCR'd Sinhala/Tamil documents need extra validation - test embedding quality on a sample set.
  4. Chunk: Split into 512–1,024 token segments with 10–20% overlap. Use semantic chunking (split at paragraph/section boundaries) rather than fixed character counts. Attach metadata: doc_id, title, department, classification, last_updated.
  5. Embed & store: Convert each chunk to a vector via your embedding model. Write to vector DB with metadata for filtering.

Ingestion Checklist

  • ☐ Document inventory complete - owners assigned per source
  • ☐ Update cadence defined (real-time, daily, weekly)
  • ☐ Deprecated documents excluded or tagged archived
  • ☐ Chunk size tested on 20 sample Q&A pairs
  • ☐ Metadata schema includes department + access level
  • ☐ Re-ingestion pipeline tested after source document edit

Layer 2 & 4: Model and Vector Database Choices

Stack choices affect accuracy, cost, latency, and data residency. Match the tier to your compliance requirements - not every business needs self-hosted Llama.

Vector Database Comparison

Database Deployment Best For Cost (indicative)
pgvector Self-hosted / RDS Teams already on PostgreSQL; <500K chunks LKR 8,000–25,000/mo (infra)
Pinecone Managed cloud Fastest MVP; global SaaS teams LKR 25,000–80,000/mo
Qdrant Self-hosted or cloud Data residency, hybrid search, filtering LKR 15,000–50,000/mo
Weaviate Self-hosted or cloud Multi-tenant, enterprise RBAC LKR 30,000–100,000/mo

Embedding & Generation Model Choices

Model Role Strength Cost per 1M tokens
text-embedding-3-large Embedding Best English retrieval accuracy ~LKR 42,000
Cohere embed-multilingual-v3 Embedding Sinhala, Tamil, English mixed docs ~LKR 32,000
GPT-4o Generation Highest answer quality, citations ~LKR 80,000 in / LKR 240,000 out
Claude 3.5 Sonnet Generation Long context, nuanced compliance Q&A ~LKR 96,000 in / LKR 480,000 out
GPT-4o-mini Generation High-volume internal FAQ (80% quality, 10% cost) ~LKR 4,800 in / LKR 19,200 out
Llama 3.1 70B (Ollama) Generation Air-gapped, zero external API calls GPU infra LKR 150,000+/mo

Recommended stacks: SME internal KB → pgvector + GPT-4o-mini + text-embedding-3-large. Multilingual Sri Lankan org → Qdrant + Cohere embeddings + Claude 3.5 Sonnet. Regulated / air-gapped → Qdrant self-hosted + nomic-embed-text + Llama 3.1 on private GPU.

Layer 3: Permissions and Access Control

A private AI chatbot without RBAC is a data breach waiting to happen. Permissions must be enforced at retrieval time - not just at the UI.

Permission Model

  • Document-level tags: Tag each chunk with access_level (public, internal, confidential, legal-hold) and department at ingestion.
  • User-role mapping: Sync roles from your identity provider (Azure AD, Google Workspace, Okta). Map roles to permitted access levels.
  • Pre-retrieval filter: Apply metadata filters to vector search before returning chunks. A sales user never retrieves HR policy chunks - even if embeddings are semantically similar.
  • Post-retrieval audit: Log which chunks were retrieved for each query. Compliance teams can review access patterns monthly.
  • Sensitive field redaction: Strip NIC numbers, salaries, and account numbers from chunks at ingestion if they should never appear in generated answers.

Evaluation: How to Know Your RAG Works

Most failed enterprise RAG implementation projects skip systematic evaluation. Build a golden dataset of 50–100 real questions with expected answers and source documents before launch.

Metrics That Matter

Metric What It Measures Launch Threshold
Retrieval recall@5 Correct source chunk in top 5 results ≥ 85%
Answer faithfulness Answer supported by retrieved context (no fabrication) ≥ 90%
Citation accuracy Cited document actually contains the claim ≥ 95%
Refusal accuracy Correctly says "not in knowledge base" when answer absent ≥ 90%
Latency (p95) Time from query to response < 5 seconds

Use frameworks like RAGAS or LlamaIndex eval modules to automate scoring. Re-run evaluation after every major document update or chunking change. Professional RAG development services should deliver an evaluation report as a launch gate - not just a working demo.

Cost Examples (LKR)

Costs vary by document volume, user count, and compliance tier. These ranges reflect 2026 pricing for Sri Lankan and regional deployments built by Hashtag Coders and comparable vendors.

Tier Scope Build Cost Monthly Run Cost
Starter 100–300 docs, 20 users, web chat, English only LKR 400,000–800,000 LKR 25,000–50,000
Professional 500–2,000 docs, RBAC, Slack/Teams, multilingual LKR 1.2M–2.5M LKR 60,000–150,000
Enterprise 5,000+ docs, SSO, audit logs, hybrid search, SLA LKR 3M–8M LKR 200,000–500,000
Air-gapped On-prem GPU, zero external API, banking/legal LKR 5M–15M LKR 300,000–800,000

Per-query cost example (Professional tier): 1,000 queries/day × ~3,000 tokens/query × GPT-4o-mini ≈ LKR 35,000/month in LLM API fees alone. Embedding and vector DB add LKR 15,000–40,000. Budget 20% overhead for re-ingestion and evaluation re-runs.

Privacy Risks and Mitigations

A RAG private AI assistant is safer than pasting documents into public ChatGPT - but "private" is not automatic. These are the risks that cause compliance failures and the controls that address them.

Risk How It Happens Mitigation
LLM provider data retention Retrieved chunks sent to OpenAI/Anthropic APIs may be logged Use enterprise API agreements with zero-retention; or self-host Llama
Over-retrieval leakage User retrieves confidential chunks outside their role Metadata filters on every query; audit logs; quarterly access reviews
Prompt injection Malicious text in a document instructs the LLM to ignore policies Sanitise ingested content; system prompt hardening; output guardrails
Hallucinated compliance advice LLM fabricates an answer when retrieval misses the right chunk Mandatory refusal behaviour; faithfulness eval ≥ 90%; human review for legal/medical
Stale document answers Old policy version retrieved after update Version metadata; automated re-ingestion on doc change; show last_updated in citations
Cross-border data transfer Vector DB or LLM hosted outside Sri Lanka / client jurisdiction Self-hosted Qdrant + regional API endpoints; document data residency in contract

Case Study & Demo Walkthrough: Internal HR Policy Assistant

The following is a real deployment pattern from a Colombo professional services firm (45 staff, 280 HR and IT policy documents). It shows exactly how a RAG private AI assistant processes a query - the "working demo" decision-makers ask for before approving budget.

Deployment Summary

  • Stack: pgvector on AWS RDS, text-embedding-3-large, GPT-4o-mini, Next.js chat UI, Azure AD SSO
  • Build time: 5 weeks (ingestion 2w, retrieval tuning 1w, RBAC + UI 1w, eval 1w)
  • Build cost: LKR 650,000 | Monthly run: LKR 38,000
  • Results after 60 days: HR ticket volume down 58%, average policy lookup from 18 min → 12 sec, faithfulness score 93%

Live Query Walkthrough

USER QUERY (Finance analyst, role: internal)

"How many days of annual leave can a employee carry forward to the next year?"

STEP 1 - PERMISSION FILTER

User role internal → search filtered to access_level: [public, internal] → 4,200 eligible chunks (excludes confidential board docs)

STEP 2 - RETRIEVAL (hybrid search, top 5)

① HR-Policy-2026.pdf §4.2 - Annual Leave Carry Forward (score: 0.91)

② Employee-Handbook-v3.docx - Leave Entitlements (score: 0.84)

③ … (3 more chunks below threshold, discarded)

STEP 3 - GENERATED ANSWER (2.1s latency)

Employees may carry forward a maximum of 7 unused annual leave days into the next calendar year. Days beyond 7 lapse on 31 December unless approved in writing by the department head. Carry-forward days must be used by 31 March of the following year or they are forfeited.

Sources: HR-Policy-2026.pdf §4.2 · Employee-Handbook-v3.docx p.18 · Last updated: 15 Jan 2026

When the same user asked "What was our Q3 revenue?" - a question outside the knowledge base - the assistant responded: "I don't have revenue data in the current knowledge base. Please contact the Finance team or check the ERP system." Refusal accuracy on out-of-scope questions in evaluation: 94%.

Enterprise RAG Implementation Roadmap

  1. Week 1–2 - Discovery: Document audit, use-case prioritisation, golden Q&A dataset (50 questions), stack selection, privacy review.
  2. Week 3–4 - Ingestion: Build pipelines, chunk and embed, metadata tagging, RBAC schema.
  3. Week 5 - Retrieval tuning: Hybrid search, reranking, chunk size A/B tests against golden set.
  4. Week 6 - Generation & UI: System prompts, citation format, chat interface or Slack bot, SSO integration.
  5. Week 7 - Evaluation gate: Run RAGAS metrics; must pass launch thresholds before pilot.
  6. Week 8+ - Pilot & scale: 10-user pilot, feedback loop, full rollout, monthly eval re-runs.

RAG Development Services: What to Expect

Professional RAG development services should deliver more than a prototype. At minimum, insist on:

  • Architecture document with data flow and permission model
  • Ingestion pipeline with automated re-sync on document changes
  • Evaluation report against agreed metrics before production launch
  • RBAC integration with your identity provider
  • Audit logging and source citation on every response
  • 30-day post-launch tuning period included in scope

At Hashtag Coders, we build end-to-end RAG private AI assistant systems for Sri Lankan and international clients - from document audit through ingestion, evaluation, permissions, and production deployment. We also support multilingual collections (English, Sinhala, Tamil) and air-gapped architectures for regulated industries.

Conclusion

Building a private AI chatbot with RAG is the most practical path to enterprise AI in 2026 - if you treat it as a five-layer system with ingestion discipline, permission enforcement, systematic evaluation, and clear privacy controls. The architecture is proven, costs are predictable, and the ROI is measurable from day one of pilot.

Ready to scope your enterprise RAG implementation? Contact Hashtag Coders for a free consultation - we will audit your document landscape, recommend a stack, and estimate build and run costs for your specific use case.

Frequently Asked Questions

What is a RAG private AI assistant?

A RAG private AI assistant is a chatbot or Q&A system that uses Retrieval-Augmented Generation to answer questions from your organisation's own documents. It retrieves relevant passages from a private vector database, generates answers grounded in those passages, cites sources, and keeps your full document library in infrastructure you control.

How is RAG different from fine-tuning or ChatGPT Enterprise?

Fine-tuning embeds knowledge into model weights - expensive to update and no source citations. ChatGPT Enterprise still uses OpenAI's environment. RAG keeps documents in your vector store, updates instantly by re-ingesting files, and cites exact sources. For most document Q&A use cases, RAG is faster to deploy, cheaper to maintain, and easier to audit.

How much does enterprise RAG implementation cost in Sri Lanka?

Starter internal knowledge bases run LKR 400,000–800,000 to build and LKR 25,000–50,000/month to operate. Professional deployments with RBAC and integrations cost LKR 1.2M–2.5M to build. Enterprise and air-gapped systems range from LKR 3M to LKR 15M depending on document volume, compliance, and self-hosting requirements.

Which vector database should I choose?

Use pgvector if you already run PostgreSQL and have under 500K chunks. Choose Pinecone for fastest MVP. Choose Qdrant or Weaviate when you need data residency, fine-grained metadata filtering, or enterprise RBAC. For regulated industries, self-hosted Qdrant is the most common choice in Sri Lankan enterprise deployments.

How do you evaluate RAG quality before launch?

Build a golden dataset of 50–100 real questions with expected answers and source documents. Measure retrieval recall@5 (≥85%), answer faithfulness (≥90%), citation accuracy (≥95%), and refusal accuracy on out-of-scope questions (≥90%). Use RAGAS or LlamaIndex eval tools. Do not launch without passing these thresholds on your actual document set.

Can RAG handle Sinhala and Tamil documents?

Yes. Use Cohere embed-multilingual-v3 or OpenAI text-embedding-3-large for embeddings, and a multilingual generation model (Claude 3.5 Sonnet or GPT-4o). Response quality is highest when the user's query language matches the retrieved chunk language. Hashtag Coders has built multilingual RAG systems for Sri Lankan public sector and professional services clients.

What privacy risks should I plan for?

The main risks are LLM provider data retention, over-retrieval across roles, prompt injection via document content, hallucinated answers when retrieval fails, stale document versions, and cross-border hosting. Mitigate with enterprise API agreements, metadata RBAC filters, ingestion sanitisation, mandatory refusal behaviour, version metadata on chunks, and self-hosted vector stores where residency is required.

Ready to get started?

Turn these insights into real results for your business

Hashtag Coders specialises in delivering exactly the solutions discussed in this article. Let's talk about your project - the first consultation is completely free.

No commitment requiredFree initial consultationServing clients in Sri Lanka & globallyTransparent pricing