doc-convert

A CLI tool for converting documents (especially scanned PDFs) to markdown, EPUB, HTML, DOCX, or plain text — using GPU-powered neural OCR. Optional auto-translate to any language.

Everything runs locally. Nothing leaves your machine.

What it does

OCR — Extracts text from PDFs (including scanned/image-only) using Marker with Surya neural OCR on GPU
Translate (optional) — Auto-detects source language and translates to your target language using Argos Translate (local neural MT)
Convert — Outputs to your desired format via Pandoc with auto-detected metadata (title, author), table of contents, and proper styling

Quick start

# PDF → Markdown (default)
doc-convert paper.pdf

# Scanned book → EPUB with OCR
doc-convert book.pdf -f epub --ocr

# Russian PDF → English markdown
doc-convert russian-paper.pdf --translate

# Auto-detect language → English EPUB
doc-convert article.pdf -f epub --translate

# Force OCR + LLM-enhanced accuracy
doc-convert old-scan.pdf -f epub --ocr --llm

Installation

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA (for Marker/Surya OCR). CPU-only is possible but very slow.
Pandoc (apt install pandoc or brew install pandoc)

Setup

# Create a virtualenv for marker-pdf
python3 -m venv ~/marker-env
source ~/marker-env/bin/activate

# Install dependencies
pip install marker-pdf pypandoc PyMuPDF

# Optional: translation support
pip install argos-translate langdetect

# Install the script
chmod +x doc-convert.sh
cp doc-convert.sh ~/.local/bin/doc-convert

By default, the script looks for the marker venv at ~/marker-env. Override with:

export DOC_CONVERT_VENV=/path/to/your/marker-env

Options

Flag	Description
`-f, --format FORMAT`	Output format: `md`, `epub`, `html`, `txt`, `docx` (default: `md`)
`--translate`	Auto-detect language and translate to English
`--translate-to LANG`	Set target language (ISO 639-1 code, e.g. `ru`, `es`, `fr`)
`-o, --output PATH`	Output file path (default: auto-named next to input)
`--ocr`	Force OCR on all pages (for scanned/image PDFs)
`--llm`	Use LLM for higher OCR accuracy (slower)
`--title TITLE`	Set document title (auto-detected from PDF metadata if not set)
`--author AUTHOR`	Set author name (auto-detected if not set)
`--config`	Show/edit persistent settings
`-h, --help`	Show help

Configuration

Persistent settings live at ~/.config/doc-convert/settings.conf:

# Default target language for --translate
DEFAULT_TRANSLATE_TO="en"

# Default output format
DEFAULT_FORMAT="md"

Run doc-convert --config to create or view the config file.

How it works

Marker is the heavy lifter — it uses Surya's neural OCR models to extract text from PDFs, handling scanned documents, complex layouts, tables, and equations. It runs on GPU (CUDA) for speed.

Pandoc handles format conversion. For EPUBs, doc-convert auto-detects title and author from PDF metadata (or first-page text), generates a table of contents, and applies clean typography.

Argos Translate provides local neural machine translation between 30+ language pairs. No API keys, no cloud services.

Large scanned PDFs

For books over ~100 pages, Marker may run out of VRAM. The workaround is chunked processing:

# Split into chunks (requires pdftk or qpdf)
qpdf --split-pages=25 big-book.pdf chunk_%d.pdf

# Convert each chunk
for f in chunk_*.pdf; do
    doc-convert "$f" --ocr
done

# Merge the markdown
cat chunk_*.md > full-book.md

# Convert merged output to EPUB
doc-convert full-book.md -f epub --title "Book Title" --author "Author Name"

Remote usage

If you have a GPU server but work from a laptop, you can wrap doc-convert in a shell function that handles the round trip:

# Add to your .bashrc / .zshrc
nconvert() {
    local file="$1"
    local fmt="${2:-epub}"
    local name=$(basename "$file")
    local base="${name%.*}"
    
    if [[ -z "$file" ]]; then
        echo "Usage: nconvert <file> [format]"
        echo "Formats: epub, md, html, txt, docx"
        return 1
    fi
    
    echo "📤 Sending $name to server..."
    scp -r "$file" myserver:~/drop/ || { echo "Failed to send file"; return 1; }
    
    echo "⚙️  Converting to $fmt..."
    ssh myserver "bash ~/doc-convert.sh \"\$HOME/drop/$name\" -f $fmt" || { echo "Conversion failed"; return 1; }
    
    echo "📥 Fetching result..."
    scp "myserver:~/drop/${base}.${fmt}" . || { echo "Failed to fetch result"; return 1; }
    
    echo "✅ Done: $base.$fmt"
}

Replace myserver with your SSH host alias. Then from your laptop:

nconvert "Scanned Book.pdf" epub

The PDF goes up, gets OCR'd on the GPU, and the EPUB comes back — all in one command.

Supported formats

Input	Output
PDF (native text)	Markdown
PDF (scanned/image)	EPUB
PPTX, DOCX	HTML
Any Marker-supported format	Plain text
	DOCX

Dependencies

Package	Purpose	Required?
marker-pdf	OCR + text extraction	Yes
pypandoc	Format conversion	Yes
PyMuPDF	PDF metadata extraction	Yes
argos-translate	Local neural translation	Only for `--translate`
langdetect	Language detection	Only for `--translate`

Made by

Anna Brezgis and Claude — brezgis.com

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
doc-convert.sh		doc-convert.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc-convert

What it does

Quick start

Installation

Prerequisites

Setup

Options

Configuration

How it works

Large scanned PDFs

Remote usage

Supported formats

Dependencies

Made by

License

About

Uh oh!

Releases

Packages

Languages

License

brezgis/doc-convert

Folders and files

Latest commit

History

Repository files navigation

doc-convert

What it does

Quick start

Installation

Prerequisites

Setup

Options

Configuration

How it works

Large scanned PDFs

Remote usage

Supported formats

Dependencies

Made by

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages