RAG Evaluation

Background

Implementing RAG in enterprises, especially with proprietary data, introduces considerable complexity. It’s common to see fine-tuned embedding models, diverse retrieval techniques, and multiple LLMs integrated within a comprehensive evaluation framework to meet accuracy standards.

Instead of viewing evaluation as a fixed, one-time process, we can make it an adaptive learning cycle that progressively refines accuracy. Arfniia Router facilitates this by experimenting with different RAG configurations and learning the optimal setups for the given context.

Cascading Error Propagation

RAG, as a multi-step pipeline, sees accuracy drift at each stage, from chunking and embedding to retrieval and generation. Small errors at each step can accumulate and cause a significant drop in overall accuracy from an end-to-end perspective.

Optimized LLM routing can enable not just prompt-model matching, but also identify the best configurations for context length and embedding alignment, reducing semantic drift between retrieval and generation to enhance response quality.

Self-Optimizing RAG Evaluation

By incorporating Arfniia Router, we can transform the standalone RAG evaluation into a self-optimizing process, systematically exploring combinations of chunking strategies, embedding models, retrieval techniques, and LLMs to identify the best-performing configurations and enhance overall accuracy.

Implementation Guide

A simple RAG evaluation which loops over embedding models and LLMs to generate a QA accuracy report. It can be easily extended to support chunking and retrieval strategies.

Embedding models:

emb_1
emb_2

LLMs:

model_a
model_b
eval-router (with model_a and model_b)

Evaluate and Learn
KnowledgeAssistant

from collections import defaultdict

def update_accuracy(config_key, is_correct, accuracy_tracker):
    accuracy_tracker[config_key]["total"] += 1
    if is_correct:
        accuracy_tracker[config_key]["correct"] += 1

    correct = accuracy_tracker[config_key]["correct"]
    total = accuracy_tracker[config_key]["total"]
    accuracy = correct / total if total > 0 else 0
    return accuracy

# Evaluation setup
evaluation_dataset = [
    {"question": "What is the capital of France?", "golden_answer": "Paris"},
    {"question": "Who wrote '1984'?", "golden_answer": "George Orwell"},
    # ...
]

router_name = "eval-router"  # NOTE: router for [model_a, model_b]

# lists of components to loop through
embedding_models = ["emb_1", "emb_2"]
generation_models = ["model_a", "model_b", router_name]

assistant = KnowledgeAssistant(router_name)
accuracy_tracker = defaultdict(lambda: {"correct": 0, "total": 0})

for item in evaluation_dataset:
    question = item['question']
    golden_answer = item['golden_answer']

    for emb_model in embedding_models:
        for llm in generation_models:
            config_key = (emb_model, llm)
            docs = assistant.retrieve_docs(question, emb_model)
            answer = assistant.answer(question, docs, emb_model, llm)

            is_correct = (answer == golden_answer)
            accuracy = update_accuracy(config_key, is_correct, accuracy_tracker)
            if llm == router_name:
                assistant.handle_feedback(answer.id, float(is_correct))
                assistant.handle_feedback("sparse", accuracy)

from openai import OpenAI
import requests

base_url = "http://ec2-ip-address:5525/v1"

class KnowledgeAssistant:
    def __init(self, router_name):
        self.router_name = router_name
        self.client = OpenAI(api_key="anything", base_url=base_url)
        self.feedbacks_api = f"{base_url}/feedbacks/{self.router_name}"

    def answer(self, question, docs, embedding_model, llm):
        resp = self.client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": f"A prompt for {question} and {docs}",
                }
            ],
            # send extra context to be learned by Arfniia Router
            metadata=dict(embedding_model=embedding_model)
            # Arfniia Router is also a proxy for non-router models
            model=llm,
        )
        return resp
    def handle_feedback(self, feedback_name, feedback_value):
        requests.put(f"{self.feedbacks_api}/{feedback_name}/{feedback_value}")

Key Takeaways

Enterprise-grade RAG applications require a robust evaluation framework to achieve accuracy standards. Integrating Arfniia Router transforms standalone evaluation into an adaptive learning process with these advantages:

Context Beyond Prompts: Leverages custom context alongside evaluation prompts.
Self-Optimizing: Continuously improves accuracy by learning from each evaluation cycle.
Built-in LLM API Proxy: Allows seamless side-by-side LLM comparison with a unified API.