Archive

Posts Tagged ‘pypdf2’

extract text from a PDF file

October 26, 2020 Leave a comment

Problem

You have a PDF file and you want to extract text from it.

Solution

You can use the PyPDF2 module for this purpose.

import PyPDF2

def main():
    book = open('book.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(book)
    pages = pdfReader.numPages
    page = pdfReader.getPage(0)    # 1st page
    text = page.extractText()
    print(text)

Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).

Links

Exercise

Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.

Categories: python Tags: , , ,
Design a site like this with WordPress.com
Get started