pypdf2 | Python Adventures

extract text from a PDF file

October 26, 2020 Jabba Laci Leave a comment

Problem

You have a PDF file and you want to extract text from it.

Solution

You can use the PyPDF2 module for this purpose.

import PyPDF2

def main():
    book = open('book.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(book)
    pages = pdfReader.numPages
    page = pdfReader.getPage(0)    # 1st page
    text = page.extractText()
    print(text)

Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).

Links

PyPDF2 on GitHub

Exercise

Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.

Categories: python Tags: extract, pdf, pdf2text, pypdf2

Python Adventures

Archive

extract text from a PDF file

Blog Stats

Random Post

Recent Posts

Archives

Meta