Skip to content

bladeacer/pdf-fmt

pdf-fmt

A PDF Text Extractor, Processor, and Formatter.

pdf-fmt is a powerful utility designed to extract text from PDF documents and then clean, filter, and structure the output.

It is useful for converting raw PDF dumps into clean, formatted text.

Note that pdf-fmt is under active development, you might encounter bugs and issues.

Project Status

pdf-fmt is currently undergoing a major rewrite. Stay tuned.

The script installer in the main branch will not work, use the compiled binary under the releases page.

Features

  • Raw text extraction
    • Copy to clipboard and/or write to file
  • Extensive configuration schema
  • Supports numerous formats
  • Image extraction
    • PNG, WEBP, SVG, etc. supported
  • Table extraction
    • Experimental, will add a configuration file entry to configure behaviour
  • and many others to come...

Why I made this

There are plenty of PDF tooling out there, but they seems to be geared towards OCR and generally do not help with extracting and processing the output text.

Personally, I use it to collate lecture slides for note taking and knowledge management. I hope that it would be useful for you as well.

What pdf-fmt is not

This is not an OCR (Optical Character Recognition) tool. It only processes selectable text (with your cursor) found in the PDF structure. It is also able to extract images and tables, though the output might not be perfect every time.

If your file contains images of text, you can use the image extraction feature before passing the output images to your OCR.

Handling non PDF formats

For converting non-PDF files (like .docx, .pptx, .odt) to PDF before extraction, either dependency needs to be installed and accessible in your $PATH:

Configuration

The configuration options available are documented in the pdf-fmt.yaml file.

  • filters: Regex rules for character exclusion and pattern-based filtering
    • excluding footers matching a regex pattern.
    • includes optional spelling enforcement (UK or US English).
  • conversion: Lists supported non-PDF formats (see handling non-PDF formats).
  • formatting: Controls line re-wrapping, indentation conversion
    • converting single-space indents to Markdown lists
    • enforcing capitalisation at the start of each line.
  • actions: Defines post-extraction behaviour
    • copying to the system clipboard and/or write to an output file.

For extensive customisation, you can consider create your own configuration file. If you do, ensure that it is named pdf-fmt.yaml.

Where to place the configuration file

pdf-fmt will look for the configuration file under the following locations.

  • $PDF_FMT_CONFIG_PATH environment variable
  • Default configuration directory
    • APPDATA if you are on Windows
    • $XDG_CONFIG_HOME or ~/.config if you are on Linux
  • The current working directory of the script

Known issues

Inaccurate locale enforcement e.g. localization -> localization even with UK locale enforcement enabled.

Upstream locale enforcement libraries may yield inaccurate words. I am working on adding a configuration option to define your own locale mappings to override Breame's.

Quick Start

Prerequisites

  • You would need to have Git and Python 3.10 or above installed
    • To confirm, run which git and which python in a Linux/macOS terminal
    • For Windows users, run where git and where python in Command Prompt

If you only downloading the compiled binaries, you can ignore this part.

These prerequisites also apply to compiling from source.

Install with uv

Requires uv.

uv tool install git+https://github.com/bladeacer/pdf-fmt
pdf-fmt

Or if you prefer a specific version.

uv tool install git+https://github.com/bladeacer/[email protected]
pdf-fmt

This should work for most platforms and architectures which are supported by uv.

Download from Release Page

You can get the compiled binary the latest release.

We recommend also downloading the associated .sha256 files to verify checksums. Place these and the executable in the same folder.

After downloading, Open PowerShell or the terminal on Linux/MacOS.

On Windows, run:

cd ~/Downloads
CertUtil -hashfile pdf-fmt-<arch>-<version-no>.exe SHA256
mv pdf-fmt-<arch>-<version-no>.exe pdf-fmt.exe
./pdf-fmt.exe

After running CertUtil, open the .sha256 file in your favourite text editor. If the string in the terminal matches the string in the file, your download is safe.

On Linux, run:

cd ~/Downloads
sha256sum --check pdf-fmt-<arch>-<version-no>.sha256
chmod +x pdf-fmt-<arch>-<version-no>
mv pdf-fmt-<arch>-<version-no> pdf-fmt
./pdf-fmt

If you see OK after calling sha256sum, the file is verified.

On Mac, run:

cd ~/Downloads
shasum -a 256 --check pdf-fmt-<arch>-<version-no>.sha256
chmod +x pdf-fmt-<arch>-<version-no>
mv pdf-fmt-<arch>-<version-no> pdf-fmt
xattr -d com.apple.quarantine pdf-fmt
./pdf-fmt

If you see OK after calling shasum, the file is verified.

You can also choose to do the following after this step:

  • Adding it to your system $PATH
  • Set an alias pointing to the binary or renaming it manually
  • Creating the configuration file

Available architectures for binaries

Platform Architecture
Windows x86-64
Linux x86-64
MacOS arm64

For other platforms or architectures, we recommend using uv tool install, the script installer or compiling from source.

About Downloaded Binaries

  • Choose the binary corresponding to your operating system
  • macOS is not supported.

If you wish to get an updated version of the executable, download the newer latest version and remove the old executable file.

If you wish to use pdf-fmt on macOS, you can use the other methods

About Versioning

The version number might be different from the one in the above example.

  • We encourage using the latest version, especially when major new features are added

Script Installer

You can also use pdf-fmt via the script installer, which sets up a isolated Python Virtual Environment to manage all dependencies.

Reviewing the scripts

  • The script will prompt for confirmation before starting the installation

Before running scripts, please review their contents by opening the URL they call in a browser. E.g. https://raw.githubusercontent.com/...

  • Alternatively, you can view them here

Windows

Set execution policy to RemoteSigned.

Then, open PowerShell.

Invoke-RestMethod -Uri 'https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.ps1' -OutFile install.ps1
Get-Content install.ps1

.\install.ps1

Linux or macOS

Open a terminal.

curl -o install.sh https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.sh
cat install.sh

chmod +x install.sh
./install.sh

Using the Script Installer

The installer places the Python script inside your new .venv folder. Activate the environment and run the script:

For Linux or macOS

source .venv/bin/activate
chmod +x ./pdf-fmt.py
./pdf-fmt.py

You might find the use of the Makefile helpful in this regard.

For Windows

.venv\Scripts\activate
pdf-fmt

The output is printed to the terminal and copied to your clipboard by default.

To update the script, run git pull in the repository the script creates under the pdf-fmt directory.

Compile from Source

Requires running the script installer or the following commands. This example assumes the use of Linux. See the script usage example on how to activate virtual environment for each OS.

It is recommended to use pyenv to manage different versions of Python. It is also recommended to install ccache for compiled binaries to be cached. You would also need the following nuitka requirements.

You might find the use of the Makefile helpful in this regard.

Pyenv setup (optional)

After installing pyenv, follow its instructions on configuring with pyenv init.

Then, run the following immediately after you change directory into the cloned repository.

pyenv install 3.11
pyenv local 3.11

You can use any other target Python version, though pdf-fmt primarily supports Python 3.10 or above.

Linux/macOS

# Either clone the repository or change directory to it if you have used the
# script installer prior
git clone --depth 1 https://github.com/bladeacer/pdf-fmt
cd pdf-fmt
chmod +x ./scripts/compile.sh
./scripts/compile.sh

The script creates a separate virtual environment for compiling from source. It would output the binary to the build/ directory once compiling is done.

Compilation too slow? Increase the number specified in the jobs count. Only do this if you have sufficient CPU cores and hardware. Remove the --low-memory flag at your own risk.

If the compilation takes up too much memory, it will crash and exit without completing.

Compilation logs will be found at nuitka-build.log. Crash reports would be found at nuitka-crash-report.xml.

Alternatively, you can call this script on Linux or macOS.

Development status

Note: the configuration schema in this repository reflects the development branch.

The released binaries might not support some options yet. These are indicated with [DEV].

Supported platforms

This table documents the currently supported platforms for pdf-fmt and highlights platforms where we are seeking community confirmation of functionality.

  • Primarily, we aim to support the latest, most widely used version of each platform
  • This means that LTS or stable versions of a platform are sometimes preferred when testing for compatibility

We welcome your contributions! Please help us by:

  • Opening a pull request (PR) to confirm that pdf-fmt works on your platform, noting any specific setup caveats or workarounds.
  • Creating an issue if you encounter problems with the installer script or compiling from source.
Platform Display Protocol C Standard Library Known to work? Comments
Alpine Linux x64 (musl-based) X11 musl Untested Contributions are welcome
Arch Linux x64 Wayland glibc Untested Contributions are welcome
Arch Linux x64 X11 glibc Untested Contributions are welcome
Debian x64 (glibc) Wayland glibc Untested Contributions are welcome
Debian x86 (glibc) X11 glibc Untested Contributions are welcome
EndeavourOS x64 (Arch-based) Wayland glibc Partial Script works out of the box. Contributions are welcome for binary/compiling from source.
EndeavourOS x64 (Arch-based) X11 glibc Yes uv install/Binary/script/compiling from source works.
Fedora x64 (RPM-based) Wayland glibc Partial Binary works out of the box. Contributions are welcome for script/compiling from source
Fedora x64 (RPM-based) X11 glibc Untested Contributions are welcome
FreeBSD stable x64 X11 BSD libc Untested Contributions are welcome
NetBSD x64 X11 BSD libc Untested Contributions are welcome
OpenBSD x64 X11 BSD libc Untested Contributions are welcome
Ubuntu LTS x64 (Debian-based) Wayland glibc Untested Contributions are welcome
Ubuntu LTS x64 (Debian-based) X11 glibc Untested Contributions are welcome
**macOS ** N/A libSystem (BSD libc) Partial `uv install works. Contributions are welcome for bianry/script/compiling from source.
Windows 10 x64 N/A MSVCRT (via MSVC/MinGW) Untested Contributions are welcome
Windows 11 x64 N/A MSVCRT (via MSVC/MinGW) Partial Binary works out of the box. Contributions are welcome for script/compiling from source
Windows Subsystem for Linux (WSL) 2 x64 N/A glibc/musl Untested Contributions are welcome

Note: Linux users

To check the C Standard Library used on Linux, run ldd --version.

To check the Display Protocol currently used on Linux, run echo $XDG_SESSION_TYPE.

You may need to install patchelf

Supported Python Versions

Python Version Known to work? Comments
3.10 Yes Compiling from source, script works. Used as default compilation/script version.
3.11 Yes Compiling from source, script works.
3.12 Yes Compiling from source, script works. Used in GitHub Actions.
3.13 Partial Compiling from source, script works.
3.14 Untested PRs welcome

Contributing

Create your own fork or clone the repository. The below example shows cloning this repository with the use of Linux.

Do note that this repository has its own Code of Conduct and Contributing Guide.

Setup

git clone https://github.com/bladeacer/pdf-fmt
chmod +x scripts/setup.sh
./scripts/dev.sh

Benchmarks

TBC

A note on Compatibility

The script, compiled binaries and compiling from source should work for all major operating systems that support Git, Python, pdfplumber and pyperclip.

Note: These dependencies are slightly larger than their C equivalents, though this is a calculated trade off.

Tests

Unit Tests

Using unittest, which is of Python's standard library. You can make use of the script installer for cloning the repository.

python -m unittest discover -sv tests

Alternatively, you can run the script.

License

GPLv3, See license file for details.

License Notice

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Credits

Existing PDF tooling for inspiration, LibreOffice CLI. Nuitka for compilation, GitHub for hosting and CI.

My friend Potato for testing the binary on Windows.

My friend Floodlight for testing the binary on Fedora.

My friend rori for testing the uv install method on MacOS.

The code of conduct was adopted from the Contributor Covenant.

The contributing guide was adopted from conduct.

About

A PDF extractor, processor and formatter. Supports regex based exclusions and other niceties.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Contributors