Skip to content

Prevent Skipping?  #67

@Ryandonofrio3

Description

@Ryandonofrio3

Hello. I am looking to simply process all pages of my PDF but I find it skipping about 50% of all pages due to repetition. But I can manually confirm they are not repeats. For instance my just 6 page PDF of an academic text only got the methods section. Is there a way to disable this and "force" the entire output?

(.venv) PS C:\Users\--\Desktop\Nougat> nougat .\t3.pdf -o .\output\ c:\Users\---\Desktop\Nougat\.venv\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3484.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] 0%| | 0/2 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0 INFO:root:Processing file t3.pdf with 6 pages WARNING:root:Skipping page 1 due to repetitions. 50%|███████████████████████████████████████████████████████▌ | 1/2 [00:21<00:21, 21.34s/it]WARNING:root:Found repetitions in sample 1 WARNING:root:Skipping page 5 due to repetitions. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.44s/it] (.venv) PS C:\Users\---\Desktop\Nougat>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions