Introduction: From Raw Documents to AI-Powered Workflows

Every organization has a mountain of documents—contracts, invoices, forms, reports—packed with business-critical details. Yet much of this information is locked away in PDFs or scans. Searching for a clause or reconciling invoices by hand is slow, costly, and error-prone.

Think about your own workplace: How much time is spent just finding information buried in documents? What if you could turn this chaos into a living knowledge base or automate entire workflows—with minimal manual effort?

That’s where Amazon Textract and Amazon Bedrock come in. Textract acts as your high-precision extractor, now supporting advanced features such as QUERIES for targeted field extraction, LAYOUT for structural elements, and SIGNATURES for signature detection. Bedrock’s foundation models—like Titan, Claude, Cohere, and Llama—transform this raw output into insights: summarizing, answering questions, or triggering automations.

In this chapter, you’ll learn how to build end-to-end pipelines that:

Extract and structure data from PDFs, images, and forms using the latest Textract features
Embed content for semantic search (finding meaning, not just keywords) with state-of-the-art Bedrock embedding models
Summarize or answer questions using large language models (LLMs)
Automate workflows for onboarding, compliance, finance, and more, while following security and cost optimization best practices

Let’s break down these building blocks step by step, grounding each with practical code and business examples.

The Hidden Value in Documents

Documents are like unmined gold. Invoices and contracts contain key details—totals, deadlines, obligations—but these are buried in unstructured formats. Manual extraction doesn’t scale. The challenge: unlock these insights quickly, accurately, and securely.

AWS provides a solution. Textract specializes in Optical Character Recognition (OCR) and structured extraction, reading not just plain text but also tables, key-value pairs, signatures, and layout elements. Its QUERIES feature lets you directly extract answers to business questions (e.g., “What is the payment term?”) with high precision. Bedrock gives you access to state-of-the-art foundation models (like Titan, Claude, Cohere, and Llama) for advanced reasoning, summarization, and workflow automation.

By connecting these tools, you can automate tasks like compliance checks, contract review, or invoice routing. Imagine a system that reads contracts, flags risky clauses, summarizes obligations, or answers, “What is the payment term in Vendor X’s latest invoice?”—with source citations. Modern pipelines also ensure data security (encryption, IAM least-privilege, VPC endpoints) and cost efficiency (prompt caching, batch processing) for production environments.

Building an Intelligent Pipeline: Textract + Bedrock

A typical document intelligence pipeline now leverages advanced extraction, flexible embedding, and modern retrieval patterns in four core steps:

Sample Modern Document Intelligence Pipeline

# 1. Extract text, structure, and targeted fields from a document using advanced Textract featuresimport boto3
import json
textract = boto3.client("textract")
# Example: Using QUERIES and LAYOUT for targeted extractionresponse = textract.analyze_document(
    Document={"S3Object": {"Bucket": "my-bucket", "Name": "contract.pdf"}},
    FeatureTypes=["FORMS", "TABLES", "QUERIES", "LAYOUT", "SIGNATURES"],
    QueriesConfig={
        "Queries": [
            {"Text": "What is the payment term?", "Alias": "PaymentTerm"},
            {"Text": "What is the contract effective date?", "Alias": "EffectiveDate"}
        ]
    }
)
# extract_text_from_blocks: parses Textract response blocks into plain texttext = extract_text_from_blocks(response["Blocks"])
# 2. Chunk and embed the text for semantic search (benchmark multiple embedding models)from textractsplitter import chunk_text  # Or use a maintained alternativechunks = chunk_text(text, chunk_size=512, overlap=128)
bedrock = boto3.client("bedrock-runtime")
# Choose the embedding model best suited for your use case (Titan Embeddings G2, Cohere v4, Llama 3/4, etc.)EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2"  # Example: Titan Embeddings G2def embed_chunk(chunk):
    resp = bedrock.invoke_model(
        modelId=EMBEDDING_MODEL_ID,
        contentType="application/json",
        accept="application/json",
        body=json.dumps({"inputText": [chunk]})
    )
    # Parse embedding vector from response (model-specific)    embedding = json.loads(resp["body"].read())["embedding"][0]
    return embedding
embeddings = [embed_chunk(c) for c in chunks]
# 3. Index in OpenSearch for hybrid (keyword + vector) search using latest APIsfrom opensearchpy import OpenSearch
os_client = OpenSearch(...)
# OpenSearch now supports native hybrid search and RAG pipelinesfor i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
    doc = {"text": chunk, "embedding": emb, "chunk": i}
    os_client.index(index="doc-kb", body=doc)
# 4. Use Bedrock LLM to answer questions with citations (enable prompt caching for cost optimization)def build_prompt(question, passages):
    context = "\\n\\n".join([f"[Chunk {p['chunk']}]\\n{p['text']}" for p in passages])
    return (
        "Answer the question using ONLY the provided passages. Cite sources.\\n\\n"        f"Question: {question}\\nPassages:\\n{context}"    )
# ...Retrieve top passages using OpenSearch's hybrid search, call Bedrock LLM, return answer with citations# Retrieval and answer generation will be detailed in the next section.

How this works:

Extraction: Textract reads the document and structures its contents, with QUERIES providing targeted field extraction and LAYOUT/SIGNATURES enhancing downstream processing. The helper function extract_text_from_blocks converts Textract’s response into plain text.
Chunking & Embedding: The text is split into manageable passages (chunks) and converted into vector embeddings. Benchmark multiple Bedrock embedding models (e.g., Titan Embeddings G2, Cohere v4, Llama 3/4) for best accuracy, speed, and cost.
Indexing & Retrieval: OpenSearch indexes both text and embeddings, enabling native hybrid search or direct RAG pipelines. Latest OpenSearch APIs simplify retrieval and ranking.
LLM Reasoning: Bedrock’s large language models generate answers or summaries, citing the exact passages used. Enable prompt caching and batch processing for production-grade cost and latency optimization.

Retrieval and answer generation logic will be explored in detail in the next section and in Chapter 6 (Retrieval-Augmented Generation).

Real-World Impact

These pipelines aren’t just technical exercises—they solve real business problems:

Contract Q&A: Instantly answer compliance or finance questions from a corpus of contracts, with citations for audits.
Invoice Summarization: Extract totals, due dates, and vendor names from thousands of invoices using QUERIES and advanced extraction, then summarize and route them automatically.
Onboarding Automation: Parse and validate HR forms, trigger account creation, and flag missing information.

Production deployments should always secure data in transit and at rest (using KMS encryption), restrict access via IAM least-privilege, and leverage VPC endpoints for sensitive workloads.

By automating document intelligence, organizations reduce manual effort, minimize errors, and unlock insights for better decisions. With AWS, these solutions scale securely from a handful of documents to millions, while modern features and cost controls (like prompt caching and batch processing) ensure efficiency.