Python

PDF Files In Python – A PyPDF Example

PDF files are widely used for reports, invoices, contracts, and documentation, but automating tasks such as reading, extracting text, or combining files can be challenging without the right tools. Python makes this much easier with PyPDF, a lightweight and well-maintained library for working with PDF files programmatically. This article explains how to work with PDF files in Python using PyPDF, a modern and actively maintained library designed for reading, writing, and manipulating PDFs.

1. Prerequisites

  • Python 3.8+
  • Basic Python knowledge.
  • One or more PDF files to test with.

2. What Is PyPDF?

PyPDF is a pure Python library for reading, writing, and manipulating PDF files programmatically. It allows us to extract text, split and merge documents, rotate or reorder pages, and create new PDFs without relying on external tools or system dependencies. Because it is lightweight, actively maintained, and easy to integrate, PyPDF is well-suited for automation scripts, backend services, and document processing workflows in Python applications.

3. Installing PyPDF

Install PyPDF using pip:

pip install pypdf

On many systems, especially macOS and Linux, pip is not installed by default, but pip3 is.

pip3 install pypdf

3.1 How to Use PyPDF

Using PyPDF typically involves three simple steps: loading a PDF, performing an operation, and saving the result. A PDF file is opened with PdfReader, which provides access to its pages and metadata. Any changes or new documents are handled with PdfWriter, where pages can be added, removed, or reordered. Once the desired operations are complete, the output is written to a new PDF file.

This straightforward workflow makes PyPDF easy to use for common tasks such as text extraction, splitting pages, and merging multiple documents.

4. Reading a PDF File

Before performing any operations, you need to load the PDF and inspect its structure.

from pypdf import PdfReader

# Load the PDF
reader = PdfReader("javafxmobile.pdf")

# Print basic information
print("Total pages:", len(reader.pages))

Output

Total pages: 10

This confirms that the PDF has been loaded successfully and shows how many pages it contains.

5. Extracting Text from a PDF

Extracting text is one of the most common tasks when working with PDFs, whether for search indexing, data analysis, or content processing.

Extract Text from a Single Page

from pypdf import PdfReader

reader = PdfReader("javafxmobile.pdf")

# Extract text from the first page
page = reader.pages[0]
text = page.extract_text()

print(text)

Extract Text from All Pages

Sometimes you need to extract text from every page in a PDF, for example, to analyze the entire document or prepare it for search indexing. PyPDF makes it easy to loop through all pages and retrieve their content systematically.

from pypdf import PdfReader

reader = PdfReader("javafxmobile.pdf")

for i, page in enumerate(reader.pages):
    text = page.extract_text()
    print(f"--- Page {i + 1} ---")
    print(text)
Note
Text extraction from PDFs is not always perfect. PDFs store content according to layout rather than reading order, so extracted text can sometimes appear jumbled or have missing spaces. However, for most well-structured documents, PyPDF provides reliable and usable results.

6. Splitting PDF Pages

Splitting PDFs is useful when extracting individual pages or separating large documents.

Split Each Page into a Separate PDF

from pypdf import PdfReader, PdfWriter

reader = PdfReader("javafxmobile.pdf")

for index, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)

    output_filename = f"page_{index + 1}.pdf"
    with open(output_filename, "wb") as output_file:
        writer.write(output_file)

    print(f"Created {output_filename}")

Each page is now saved as its own PDF file.

Extract a Specific Page Range

If you only need certain pages from a PDF, like a chapter from a report or selected invoices, you can extract a specific range of pages and save them as a new PDF.

from pypdf import PdfReader, PdfWriter

reader = PdfReader("javafxmobile.pdf")
writer = PdfWriter()

# Extract pages 2 to 4 (0-based indexing)
for i in range(1, 4):
    writer.add_page(reader.pages[i])

with open("pages_2_to_4.pdf", "wb") as output_file:
    writer.write(output_file)

print("pages_2_to_4.pdf created")

7. Merging Multiple PDF Files

Merging PDFs is ideal for combining reports, invoices, or documents generated by different systems.

from pypdf import PdfReader, PdfWriter

files_to_merge = ["document1.pdf", "document2.pdf", "document3.pdf"]

writer = PdfWriter()

for file_name in files_to_merge:
    reader = PdfReader(file_name)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output_file:
    writer.write(output_file)

print("merged.pdf created successfully")

The resulting file preserves the order of pages from each source PDF.

8. Encryption and Decryption of PDFs

PyPDF allows us to secure PDFs with a password or unlock encrypted PDFs for reading and processing.

Encrypting a PDF

from pypdf import PdfReader, PdfWriter

# Load the PDF
reader = PdfReader("javafxmobile.pdf")
writer = PdfWriter()

# Copy all pages
for page in reader.pages:
    writer.add_page(page)

# Encrypt the PDF with a password
writer.encrypt(user_password="mypassword", owner_password="ownerpassword")

# Save the encrypted PDF
with open("encrypted_sample.pdf", "wb") as file:
    writer.write(file)

print("encrypted_sample.pdf created with password protection")

Decrypting a PDF

from pypdf import PdfReader

# Load the encrypted PDF
reader = PdfReader("encrypted_sample.pdf")

# Provide the password to decrypt
reader.decrypt("mypassword")

# Access pages after decryption
print("Total pages after decryption:", len(reader.pages))

Encryption is useful for protecting sensitive documents, and decryption allows us to process secured files programmatically.

9. Adding a Watermark to a PDF

You can overlay an existing PDF (like a watermark or company logo) on all pages of another PDF.

from pypdf import PdfReader, PdfWriter

# Load the original PDF and the watermark PDF
writer = PdfWriter(clone_from="javafxmobile.pdf")
watermark = PdfReader("page_1.pdf").pages[0]

# Apply watermark to every page
for page in writer.pages:
    page.merge_page(watermark, over=False)

# Save the watermarked PDF
with open("watermarked_sample.pdf", "wb") as file:
    writer.write(file)

print("watermarked_sample.pdf created with watermark applied")
Tip
The watermark PDF can be semi-transparent text or a logo, designed to appear on every page.

Adding a Watermark or Stamp to a PDF (Using an Image)

We can overlay an image on every page of a PDF as a watermark or stamp. First, we need to convert the image to a PDF page. This can be done using the Pillow library.

from io import BytesIO

from PIL import Image
from pypdf import PdfReader, PdfWriter, Transformation


pdf_path = "javafxmobile.pdf"
image_path = "logo.png"
output_pdf = "watermarked_sample.pdf"

# Step 1: Convert the image to a PDF in memory
img = Image.open(image_path)
img_as_pdf = BytesIO()
img.save(img_as_pdf, "pdf")

# Load the image PDF as a stamp page
stamp_pdf = PdfReader(img_as_pdf)
stamp_page = stamp_pdf.pages[0]

# Load the PDF you want to watermark
reader = PdfReader(pdf_path)
writer = PdfWriter()

# Merge the stamp page on every page of the PDF
for page in reader.pages:
    page.merge_transformed_page(
        stamp_page,
        Transformation()
    )
    writer.add_page(page)

# Step 4: Save the new PDF
with open(output_pdf, "wb") as f:
    writer.write(f)

print(f"{output_pdf} created successfully with image watermark")

The script imports the necessary libraries: BytesIO for in-memory streams, Pillow (Image) for handling images, and PyPDF (PdfReader, PdfWriter, Transformation) for PDF manipulation. It defines file paths for the original PDF, the watermark image, and the output PDF.

The image is opened with Pillow and saved as a PDF in memory using BytesIO(). This avoids creating a temporary file. The script loops through each page, merging the watermark using merge_transformed_page with no transformations applied, and adds the modified pages to the writer.

Finally, the watermarked pages are saved to a new PDF. This efficiently applies an image watermark to every page of the PDF.

10. Conclusion

In this article, we explored how to work with PDF files in Python using PyPDF. We covered reading PDFs, extracting text, splitting pages, merging multiple files, encrypting and decrypting documents, and adding watermarks using images. By combining PyPDF with Pillow for image handling, Python developers can automate PDF workflows efficiently, making tasks such as document processing, reporting, and content management much easier and more programmatic.

This article explored how to work with PDF files in Python using PyPDF as a guide.

Omozegie Aziegbe

Omos Aziegbe is a technical writer and web/application developer with a BSc in Computer Science and Software Engineering from the University of Bedfordshire. Specializing in Java enterprise applications with the Jakarta EE framework, Omos also works with HTML5, CSS, and JavaScript for web development. As a freelance web developer, Omos combines technical expertise with research and writing on topics such as software engineering, programming, web application development, computer science, and technology.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button