Pdfminer.
six
Step 1: Import the extract_text function
from pdfminer.high_level import extract_text
• This line imports the extract_text() function from the pdfminer.high_level module.
• pdfminer.six is a library designed specifically for extracting and analyzing text data
from PDFs.
• This function is the high-level API that hides all the complexity behind parsing pages,
layouts, and fonts.
Step 2: Extract text from a PDF
text = extract_text('sample.pdf')
• extract_text() takes the path of a PDF file (here 'sample.pdf') and returns all readable
text from the PDF as a string.
• Works well for PDFs that are digitally generated (not scanned images).
• The text is now stored in the variable text.
Step 3: Print the extracted text
print(text)
• Displays the text content that was extracted from the PDF file.
PyPDF2
1. Import PyPDF2
import PyPDF2
• Imports the PyPDF2 library to enable reading and extracting text from PDF files.
2. Define the PDF File Path
file_path = r"C:\Users\sivan\Desktop\sample.pdf"
• Stores the path of the PDF file you want to read.
• The r prefix makes it a raw string to correctly interpret backslashes (\) in Windows file paths.
3. Open the PDF File in Binary Read Mode
with open(file_path, "rb") as file:
• Opens the PDF file in binary read mode ("rb").
• The with statement ensures the file is safely closed after use.
4. Read PDF Content Using PdfReader
reader = PyPDF2.PdfReader(file)
• PdfReader() creates a reader object that allows access to PDF pages, metadata, and text.
• It loads the entire structure of the PDF for further processing.
5. Loop Through Pages and Extract Text
for page_num, page in enumerate(reader.pages):
print(f"--- Page {page_num + 1} ---")
print(page.extract_text())
• reader.pages returns a list of Page objects.
• enumerate() is used to get both the page number and the page content.
• page.extract_text() extracts the text from each individual page.
• print() displays the extracted text page by page.
pdfplumber
Explanation:
1. Import pdfplumber:
The script starts by importing the pdfplumber library.
2. Define the PDF Path:
You specify the file path (using a raw string r"..." helps to handle Windows backslashes).
3. Open the PDF File:
pdfplumber.open(pdf_path) opens the file in read mode. The with statement ensures that
the file is properly closed after processing.
4. Extract Text from Pages:
The pdf.pages attribute returns a list of page objects in the PDF. Looping through the pages,
page.extract_text() extracts and returns the text content from each page.
5. Print the Output:
The code prints the total page count and then prints the text extracted from each page.