pdf | Python Adventures

extract text from a PDF file

October 26, 2020 Jabba Laci Leave a comment

Problem

You have a PDF file and you want to extract text from it.

Solution

You can use the PyPDF2 module for this purpose.

import PyPDF2

def main():
    book = open('book.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(book)
    pages = pdfReader.numPages
    page = pdfReader.getPage(0)    # 1st page
    text = page.extractText()
    print(text)

Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19).

Links

PyPDF2 on GitHub

Exercise

Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt, page1.txt, etc.

Categories: python Tags: extract, pdf, pdf2text, pypdf2

pdfmanip

January 6, 2019 Jabba Laci Leave a comment

Today I wrote a simple PDF manipulation CLI tool. You can find it here: https://github.com/jabbalaci/pdfmanip .

Categories: python Tags: manipulation, pdf

Python tutorials of Full Circle Magazine in a single PDF

February 21, 2011 Jabba Laci Leave a comment

On my other blog, I wrote a post on how to extract the Python tutorials from Full Circle Magazine and join them in a single PDF.

For the lazy pigs, here is the PDF (6 MB). Get it while it’s hot :)

Categories: python Tags: full circle magazine, pdf, tutorial

Create a temporary file with unique name

February 19, 2011 Jabba Laci Leave a comment

Problem

I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.

Solution #1

#!/usr/bin/env python

import os
import tempfile

temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False)

html_file = temp.name
(dirName, fileName) = os.path.split(html_file)
fileBaseName = os.path.splitext(fileName)[0]
pdf_file = dirName + '/' + fileBaseName + '.pdf'

print html_file   # /tmp/report_kWKEp5.html
print pdf_file    # /tmp/report_kWKEp5.pdf
# calling of HTML to PDF converter is omitted

See the documentation of tempfile.NamedTemporaryFile here.

Solution #2 (update 20110303)

I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:

from time import time

temp = "report.%.7f.html" % time()
print temp    # report.1299188541.3830960.html

The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.

Learn more

tempfile – Create temporary filesystem resources (post by Doug Hellmann with lots of examples)
Python doc on tempfile

Update (20150712): if you need a temp. file name in the current directory:

>>> import tempfile
>>> tempfile.NamedTemporaryFile(dir='.').name
'/home/jabba/tmpKrBzoY'

Update (20150910): if you need a temp. directory:

import tempfile
import shutil

dirpath = tempfile.mkdtemp()    # the temp dir. is created
# ... do stuff with dirpath
shutil.rmtree(dirpath)

This tip is from here.

Categories: python Tags: cron, crontab, html, pdf, temp, temporary, time

Generators

October 19, 2010 Jabba Laci Leave a comment

“Generators are a simple and powerful tool for creating iterators. They are written like regular functions but use the yield statement whenever they want to return data. Each time next() is called, the generator resumes where it left-off (it remembers all the data values and which statement was last executed).”

Let’s rewrite our Fibonacci function using generators. In the previous approach, we specified how many Fibonacci numbers we want to get. The function calculated all of them and returned a list containing all the elements. With generators, we can calculate the numbers one by one. The new function will calculate a number, return it, and suspend its execution. When we call it again, it will resume where it left off and it runs until it computes another number, etc.

First let’s see a Fibonacci function that calculates the numbers in an infinite loop:

#!/usr/bin/env python

def fib():
    a, b = 0, 1
    while True:
        print a    # the current number is here
        a, b = b, a+b

fib()

In order to rewrite it in the form of a generator, we need to locate the part where the current value is calculated. This is the line with print a. We only need to replace this with yield a. It means that the function will return this value and suspend its execution until called again.

So, with generators it will look like this:

#!/usr/bin/env python

def fib():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a+b

f = fib()
for i in range(10):    # print the first ten Fibonacci numbers
    print f.next(),    # 0 1 1 2 3 5 8 13 21 34

It is also possible to get a slice from the values of a generator. For instance, we want the 5^th, 6^th, and 7^th Fibonacci numbers:

#!/usr/bin/env python

from itertools import islice

def fib():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a+b

for i in islice(fib(), 5, 8):
    print i    # 5 8 13

More info on islice is here. For this post I used tips from here.

Update (20110406)

Here is a presentation in PDF entitled “Generator Tricks For Systems Programmers” by David Beazley (presented at PyCon 2008). (_{Reddit thread is here.})

Categories: python Tags: David Beazley, fibonacci, generators, pdf, presentation, pycon 2008, slides

Python Adventures

Archive

extract text from a PDF file

pdfmanip

Python tutorials of Full Circle Magazine in a single PDF

Create a temporary file with unique name

Generators

Blog Stats

Random Post

Recent Posts

Archives

Meta