Archive
Posts Tagged ‘pypdf2’
extract text from a PDF file
October 26, 2020
Leave a comment
Problem
You have a PDF file and you want to extract text from it.
Solution
You can use the PyPDF2 module for this purpose.
import PyPDF2
def main():
book = open('book.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(book)
pages = pdfReader.numPages
page = pdfReader.getPage(0) # 1st page
text = page.extractText()
print(text)
Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).
Links
- PyPDF2 on GitHub
Exercise
Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.
