Skip to content

Parsing file with lots of dictionaries is extremely slow #775

@fancycode

Description

@fancycode

I experienced this with a PDF that is a 9-page export of CAD drawings containing lots of small lines and numbers (created using "PDFTron PDFNet, V6.40292"). Unfortunately I can't share the document.

Tested with latest master (b89d7b1) and the test script below.

package main

import (
	"log"
	"os"

	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu/model"
)

func main() {
	log.SetFlags(log.Flags() | log.Lmicroseconds)
	fp, err := os.Open("slow-parse.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer fp.Close()

	conf := model.NewDefaultConfiguration()
	log.Printf("Parsing ...")
	pdf, err := pdfcpu.Read(fp, conf)
	if err != nil {
		log.Fatal(err)
	}
	log.Printf("Done")

	if err := pdf.EnsurePageCount(); err != nil {
		log.Fatal(err)
	}

	log.Printf("Parsed %d pages", pdf.PageCount)
}

Parsing doesn't stop (I killed it after 5 minutes taking 100% CPU). I have a patch ready takes improves parsing this file to below 4 seconds.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions