Every organization has a mountain of documents—contracts, invoices, forms, reports—packed with business-critical details. Yet much of this information is locked away in PDFs or scans. Searching for a clause or reconciling invoices by hand is slow, costly, and error-prone.
Think about your own workplace: How much time is spent just finding information buried in documents? What if you could turn this chaos into a living knowledge base or automate entire workflows—with minimal manual effort?
That’s where Amazon Textract and Amazon Bedrock come in. Textract acts as your high-precision extractor, now supporting advanced features such as QUERIES for targeted field extraction, LAYOUT for structural elements, and SIGNATURES for signature detection. Bedrock’s foundation models—like Titan, Claude, Cohere, and Llama—transform this raw output into insights: summarizing, answering questions, or triggering automations.
In this chapter, you’ll learn how to build end-to-end pipelines that:
Let’s break down these building blocks step by step, grounding each with practical code and business examples.
Documents are like unmined gold. Invoices and contracts contain key details—totals, deadlines, obligations—but these are buried in unstructured formats. Manual extraction doesn’t scale. The challenge: unlock these insights quickly, accurately, and securely.
AWS provides a solution. Textract specializes in Optical Character Recognition (OCR) and structured extraction, reading not just plain text but also tables, key-value pairs, signatures, and layout elements. Its QUERIES feature lets you directly extract answers to business questions (e.g., “What is the payment term?”) with high precision. Bedrock gives you access to state-of-the-art foundation models (like Titan, Claude, Cohere, and Llama) for advanced reasoning, summarization, and workflow automation.
By connecting these tools, you can automate tasks like compliance checks, contract review, or invoice routing. Imagine a system that reads contracts, flags risky clauses, summarizes obligations, or answers, “What is the payment term in Vendor X’s latest invoice?”—with source citations. Modern pipelines also ensure data security (encryption, IAM least-privilege, VPC endpoints) and cost efficiency (prompt caching, batch processing) for production environments.
A typical document intelligence pipeline now leverages advanced extraction, flexible embedding, and modern retrieval patterns in four core steps:
# 1. Extract text, structure, and targeted fields from a document using advanced Textract featuresimport boto3
import json
textract = boto3.client("textract")
# Example: Using QUERIES and LAYOUT for targeted extractionresponse = textract.analyze_document(
Document={"S3Object": {"Bucket": "my-bucket", "Name": "contract.pdf"}},
FeatureTypes=["FORMS", "TABLES", "QUERIES", "LAYOUT", "SIGNATURES"],
QueriesConfig={
"Queries": [
{"Text": "What is the payment term?", "Alias": "PaymentTerm"},
{"Text": "What is the contract effective date?", "Alias": "EffectiveDate"}
]
}
)
# extract_text_from_blocks: parses Textract response blocks into plain texttext = extract_text_from_blocks(response["Blocks"])
# 2. Chunk and embed the text for semantic search (benchmark multiple embedding models)from textractsplitter import chunk_text # Or use a maintained alternativechunks = chunk_text(text, chunk_size=512, overlap=128)
bedrock = boto3.client("bedrock-runtime")
# Choose the embedding model best suited for your use case (Titan Embeddings G2, Cohere v4, Llama 3/4, etc.)EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2" # Example: Titan Embeddings G2def embed_chunk(chunk):
resp = bedrock.invoke_model(
modelId=EMBEDDING_MODEL_ID,
contentType="application/json",
accept="application/json",
body=json.dumps({"inputText": [chunk]})
)
# Parse embedding vector from response (model-specific) embedding = json.loads(resp["body"].read())["embedding"][0]
return embedding
embeddings = [embed_chunk(c) for c in chunks]
# 3. Index in OpenSearch for hybrid (keyword + vector) search using latest APIsfrom opensearchpy import OpenSearch
os_client = OpenSearch(...)
# OpenSearch now supports native hybrid search and RAG pipelinesfor i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
doc = {"text": chunk, "embedding": emb, "chunk": i}
os_client.index(index="doc-kb", body=doc)
# 4. Use Bedrock LLM to answer questions with citations (enable prompt caching for cost optimization)def build_prompt(question, passages):
context = "\\n\\n".join([f"[Chunk {p['chunk']}]\\n{p['text']}" for p in passages])
return (
"Answer the question using ONLY the provided passages. Cite sources.\\n\\n" f"Question: {question}\\nPassages:\\n{context}" )
# ...Retrieve top passages using OpenSearch's hybrid search, call Bedrock LLM, return answer with citations# Retrieval and answer generation will be detailed in the next section.
How this works:
extract_text_from_blocks
converts Textract’s response into plain text.Retrieval and answer generation logic will be explored in detail in the next section and in Chapter 6 (Retrieval-Augmented Generation).
These pipelines aren’t just technical exercises—they solve real business problems:
Production deployments should always secure data in transit and at rest (using KMS encryption), restrict access via IAM least-privilege, and leverage VPC endpoints for sensitive workloads.
By automating document intelligence, organizations reduce manual effort, minimize errors, and unlock insights for better decisions. With AWS, these solutions scale securely from a handful of documents to millions, while modern features and cost controls (like prompt caching and batch processing) ensure efficiency.