Imagine you're a detective facing three locked rooms full of evidence: PDFs full of clues, spreadsheets of stats, and a database of interconnected facts. Python's pypdf, Pandas, and DuckDB are your keys to unlock these mysteries, turning raw data into insights with minimal effort. In this chapter, you'll master these tools to tackle real-world challenges—fast.
Python offers three powerful keys to unlock these rooms: pypdf for cracking open PDF documents, Pandas for taming wild data tables, and DuckDB for querying information at lightning speed. Together, they form a data processing toolkit that can turn overwhelming information overload into crystal-clear insights with just a few lines of code.
In this chapter, we'll explore how these libraries can help you extract text from PDFs automatically, manipulate complex datasets with intuitive commands, and run lightning-fast SQL queries without complex database setup. By the end, you'll be equipped to handle real-world data challenges that would otherwise require hours of manual work—or specialized knowledge you may not have time to acquire.
PDFs are everywhere in business and academia, but they're notoriously difficult to work with programmatically. Think of them as beautifully wrapped packages that need special tools to open without damaging the contents. The pypdf library (formerly PyPDF2) is that specialized tool—letting you unwrap PDF content safely and extract the text inside.
Reading PDFs with pypdf is similar to opening and reading any file in Python, but with extra steps to handle the PDF structure. Let's look at how this works:
Before we dive into the code, imagine you have a PDF report that you'd normally open and read manually. Instead, we'll have Python open it and read the content for us—perfect for when you need to process multiple documents or extract specific information repeatedly.
from pypdf import PdfReader
with open('sample.pdf', 'rb') as file:
reader = PdfReader(file)
# Check if PDF is encrypted
if reader.is_encrypted:
# Try empty password first
reader.decrypt('')
page = reader.pages[0]
text = page.extract_text()
# Handle case where text extraction fails
if not text:
text = "No text found (possibly scanned PDF)"
print(text)
This simple pattern forms the foundation for any PDF text extraction task. Now, let's scale this up to handle multiple PDFs:
Imagine you have a folder full of research papers or financial reports, and you need to extract text from all of them. The following code shows how to process every PDF in a folder:
from pathlib import Path
from pypdf import PdfReader
pdf_files = Path('pdf_docs/').glob('*.pdf')
for pdf_file in pdf_files:
with open(pdf_file, 'rb') as file:
reader = PdfReader(file)
for page in reader.pages:
text = page.extract_text()
print(f"File: {pdf_file.name}, Page: "
f"{reader.pages.index(page) + 1}")
# First 100 chars
print(text[:100] + "...\\n")