0% found this document useful (0 votes)
17 views3 pages

PDF Explination

The document provides a guide on extracting text from PDF files using three libraries: pdfminer.six, PyPDF2, and pdfplumber. Each library has a series of steps for importing, opening a PDF, and extracting text, with specific code examples for each. The document emphasizes the importance of handling file paths correctly and ensures proper file closure after operations.

Uploaded by

aitscserd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

PDF Explination

The document provides a guide on extracting text from PDF files using three libraries: pdfminer.six, PyPDF2, and pdfplumber. Each library has a series of steps for importing, opening a PDF, and extracting text, with specific code examples for each. The document emphasizes the importance of handling file paths correctly and ensures proper file closure after operations.

Uploaded by

aitscserd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Pdfminer.

six
Step 1: Import the extract_text function
from pdfminer.high_level import extract_text
• This line imports the extract_text() function from the pdfminer.high_level module.
• pdfminer.six is a library designed specifically for extracting and analyzing text data
from PDFs.
• This function is the high-level API that hides all the complexity behind parsing pages,
layouts, and fonts.

Step 2: Extract text from a PDF


text = extract_text('sample.pdf')
• extract_text() takes the path of a PDF file (here 'sample.pdf') and returns all readable
text from the PDF as a string.
• Works well for PDFs that are digitally generated (not scanned images).
• The text is now stored in the variable text.

Step 3: Print the extracted text


print(text)
• Displays the text content that was extracted from the PDF file.

PyPDF2
1. Import PyPDF2

import PyPDF2

• Imports the PyPDF2 library to enable reading and extracting text from PDF files.

2. Define the PDF File Path

file_path = r"C:\Users\sivan\Desktop\sample.pdf"

• Stores the path of the PDF file you want to read.

• The r prefix makes it a raw string to correctly interpret backslashes (\) in Windows file paths.
3. Open the PDF File in Binary Read Mode

with open(file_path, "rb") as file:

• Opens the PDF file in binary read mode ("rb").

• The with statement ensures the file is safely closed after use.

4. Read PDF Content Using PdfReader

reader = PyPDF2.PdfReader(file)

• PdfReader() creates a reader object that allows access to PDF pages, metadata, and text.

• It loads the entire structure of the PDF for further processing.

5. Loop Through Pages and Extract Text

for page_num, page in enumerate(reader.pages):

print(f"--- Page {page_num + 1} ---")

print(page.extract_text())

• reader.pages returns a list of Page objects.

• enumerate() is used to get both the page number and the page content.

• page.extract_text() extracts the text from each individual page.

• print() displays the extracted text page by page.

pdfplumber
Explanation:

1. Import pdfplumber:
The script starts by importing the pdfplumber library.

2. Define the PDF Path:


You specify the file path (using a raw string r"..." helps to handle Windows backslashes).

3. Open the PDF File:


pdfplumber.open(pdf_path) opens the file in read mode. The with statement ensures that
the file is properly closed after processing.

4. Extract Text from Pages:


The pdf.pages attribute returns a list of page objects in the PDF. Looping through the pages,
page.extract_text() extracts and returns the text content from each page.

5. Print the Output:


The code prints the total page count and then prints the text extracted from each page.

You might also like