Imagine asking your company’s smartest employee a question—only to find they’ve forgotten everything since their last training. This analogy captures a core limitation of large language models (LLMs): while they excel at generating language, their built-in knowledge is fixed at the time of training, and their memory for new or dynamic information is inherently limited.
LLMs such as GPT-4 Turbo, Claude 3 Opus, Gemini 1.5, and Llama 3 are trained on vast text datasets, enabling them to generate fluent, relevant responses for a wide variety of prompts. They can write emails, summarize documents, and answer questions. However, their knowledge is static—limited to what was available at their last training cut-off—and they cannot access new or proprietary information unless you explicitly provide that context.
While many LLMs support context windows of 4,000 to 32,000 tokens (roughly a few to a few dozen pages), state-of-the-art models as of 2025 now offer context windows of up to 1 million tokens or more. This means that, in some cases, you can provide much more information in a single interaction. However, even with these advances, context windows are not a substitute for true long-term memory: you still need to explicitly supply relevant information, and practical limitations—like latency, cost, and retrieval efficiency—remain, especially at enterprise scale.
These evolving limitations create real-world challenges:
Consider a practical scenario: you’re building a customer support assistant for a SaaS company. If a customer asks about a feature released last month, the LLM won’t know about it—unless you include the release notes or relevant documentation in every prompt. With larger context windows, you can supply more information, but this quickly becomes unwieldy and inefficient as your knowledge base grows.
How do we solve this? Retrieval-Augmented Generation (RAG) provides a dynamic "memory upgrade" for LLMs. With RAG, models can fetch relevant, up-to-date information from external sources on demand—much like a human searching a knowledge base before answering a tough question.
RAG consists of two main components:
Let’s see RAG in action with a simple, modern pipeline:
# Step 1: User asks a question
user_query = "What is our company's refund policy for 2024?"
# Step 2: Retrieve relevant documents from a knowledge base (e.g., via a vector database)
retrieved_docs = retrieve_documents(user_query) # Implemented with up-to-date vector search
# Step 3: Pass both the query and documents to the LLM
# Modern LLM APIs accept large context windows, but we still supply only relevant snippets for efficiency
generated_answer = call_llm(
prompt=f"Context: {retrieved_docs}\\\\nQuestion: {user_query}"
)
print(generated_answer)
Step by step: