Archive
extract text from a PDF file
Problem
You have a PDF file and you want to extract text from it.
Solution
You can use the PyPDF2 module for this purpose.
import PyPDF2
def main():
book = open('book.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(book)
pages = pdfReader.numPages
page = pdfReader.getPage(0) # 1st page
text = page.extractText()
print(text)
Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).
Links
- PyPDF2 on GitHub
Exercise
Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.
pdfmanip
Today I wrote a simple PDF manipulation CLI tool. You can find it here: https://github.com/jabbalaci/pdfmanip .
Python tutorials of Full Circle Magazine in a single PDF
On my other blog, I wrote a post on how to extract the Python tutorials from Full Circle Magazine and join them in a single PDF.
For the lazy pigs, here is the PDF (6 MB). Get it while it’s hot :)
Create a temporary file with unique name
Problem
I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.
Solution #1
#!/usr/bin/env python import os import tempfile temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False) html_file = temp.name (dirName, fileName) = os.path.split(html_file) fileBaseName = os.path.splitext(fileName)[0] pdf_file = dirName + '/' + fileBaseName + '.pdf' print html_file # /tmp/report_kWKEp5.html print pdf_file # /tmp/report_kWKEp5.pdf # calling of HTML to PDF converter is omitted
See the documentation of tempfile.NamedTemporaryFile here.
Solution #2 (update 20110303)
I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:
from time import time temp = "report.%.7f.html" % time() print temp # report.1299188541.3830960.html
The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.
Learn more
- tempfile – Create temporary filesystem resources (post by Doug Hellmann with lots of examples)
- Python doc on tempfile
Update (20150712): if you need a temp. file name in the current directory:
>>> import tempfile >>> tempfile.NamedTemporaryFile(dir='.').name '/home/jabba/tmpKrBzoY'
Update (20150910): if you need a temp. directory:
import tempfile import shutil dirpath = tempfile.mkdtemp() # the temp dir. is created # ... do stuff with dirpath shutil.rmtree(dirpath)
This tip is from here.
