model : add PaddleOCR #16701

ngxson · 2025-10-21T14:11:37Z

This is a very early WIP

Progress:

~~Only the language model is working now. The vision encoder is not yet implemented~~
~~Vision encoder is added, but not yet numerically correct~~
Model generate hallucinated text, likely because of the projector being incorrect

TalonBvV · 2025-10-29T10:28:15Z

@ngxson thanks for the great work on this, I was really looking forward to benchmarking this model, until I saw it's limitations, on your point here "Model generate hallucinated text, likely because of the projector being incorrect" I don't think it's due to the projector, I cloned your branch to see why it's hallucinating, it seems to be due to the lack of pre-processing input done by this model "PP-DocLayoutV2"... PaddleOCR-VL is not an end to end VLM, it relies on "PP-DocLayoutV2" for detection, it's basically a glorified version of LayoutLM.

ngxson · 2025-11-03T11:33:06Z

@TalonBvV thanks for the info. Yes I also almost come to the same conclusion. The main issue is that PaddleOCR is not just one monolithic model like Qwen or Deepseek-OCR, but it's more like a pipeline of multiple models glued together. Therefore, I don't think we currently have the infrastructure to bring it into llama.cpp.

I'll close this PR for now as it's not giving any meaningful results. For users who need to do OCR task, I would recommend having a look at the latest Qwen3-VL series, or LightOnOCR-1B

predict-woo · 2025-12-10T17:24:37Z

@TalonBvV Hi, I managed to convert the PP-DocLayoutV2 part of the pipeline into onnx format by adding a onnx conversion mapping for index_put to paddle2onnx.
Do you think sharing the patch here would open up this PR again, or is this thread closed for good?

I've been looking into DeepSeek-OCR, but their accuracy is actually lower than PaddleOCR-VL for real-world use, and it is also only 0.9B, which makes it runnable on basically any device.

wip paddleocr

366abe7

ngxson linked an issue Oct 21, 2025 that may be closed by this pull request

Feature Request: support PaddleOCR-VL #16627

Open

4 tasks

github-actions bot added the python python script changes label Oct 21, 2025

ngxson added 4 commits October 22, 2025 11:54

convert vision encoder ok

e7a485c

fix chat template

ac41b54

rm V_RESMPL_ATTN_NORM

be80289

model load ok

ddfaca7

github-actions bot added the examples label Oct 22, 2025

output text but gibberish

ea2fbb8

ngxson mentioned this pull request Oct 22, 2025

mtmd-cli : allow using --jinja #16718

Merged

ngxson added 5 commits October 23, 2025 15:01

Merge branch 'master' into xsn/paddleocr

030f1b2

Merge branch 'master' into xsn/paddleocr

03a4a49

correct projector

a342f52

fix conversion script

bd38d7f

fix model load

70e7312

ngxson closed this Nov 3, 2025

ngxson mentioned this pull request Dec 9, 2025

mtmd: Add DeepSeekOCR Support #17400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model : add PaddleOCR #16701

model : add PaddleOCR #16701

Uh oh!

ngxson commented Oct 21, 2025 •

edited

Loading

Uh oh!

TalonBvV commented Oct 29, 2025

Uh oh!

ngxson commented Nov 3, 2025

Uh oh!

predict-woo commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

model : add PaddleOCR #16701

model : add PaddleOCR #16701

Uh oh!

Conversation

ngxson commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TalonBvV commented Oct 29, 2025

Uh oh!

ngxson commented Nov 3, 2025

Uh oh!

predict-woo commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Oct 21, 2025 •

edited

Loading