Skip to content

feat: add embedded font extraction API with TTF round-trip support#71

Merged
kolkov merged 2 commits into
mainfrom
feat/font-extraction
May 7, 2026
Merged

feat: add embedded font extraction API with TTF round-trip support#71
kolkov merged 2 commits into
mainfrom
feat/font-extraction

Conversation

@kolkov

@kolkov kolkov commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Document.GetEmbeddedFonts() and GetEmbeddedFontsForPage() for extracting embedded font binary data (TTF/OTF) from parsed PDFs
  • Add fonts.LoadTTFFromBytes() constructor for round-trip font workflows (extract → re-embed)
  • Add shared decodeStreamData() utility for FlateDecode stream decompression
  • Support TrueType (simple) and Type0/CIDFontType2 (composite) font extraction
  • Graceful skip for Standard 14 fonts (not embedded) and unsupported font types

Closes #67

Test plan

  • go fmt ./... clean
  • go vet ./... clean
  • go test ./... — 23 packages, all passing
  • golangci-lint run — no new issues
  • 36 new tests: stream decoding, font walk path, round-trip integration (create PDF with embedded font → extract → verify metrics match), LoadTTFFromBytes metric equality vs LoadTTF, invalid/empty data rejection
  • No breaking changes to existing API
  • CHANGELOG, README updated

kolkov added 2 commits May 7, 2026 20:28
- Add LoadTTFFromBytes to parse TTF/OTF fonts from in-memory byte slices,
  enabling round-trip use of fonts extracted from existing PDFs
- Add FontExtractor with ExtractFromPage/ExtractFromDocument, walking the
  PDF font dictionary chain (Font→FontDescriptor→FontFile2/FontFile/FontFile3)
  with FlateDecode decompression and deduplication across pages
- Add decodeStreamData shared utility in extractor package for reusable
  zlib stream decoding without breaking existing private methods
- Add GetEmbeddedFonts and GetEmbeddedFontsForPage to public Document API
- Add 36 tests covering stream decoding, font walking, round-trip extraction,
  empty-font PDFs, and out-of-range page error handling
@kolkov kolkov merged commit 63e56b1 into main May 7, 2026
7 checks passed
@kolkov kolkov deleted the feat/font-extraction branch May 7, 2026 17:32
@codecov

codecov Bot commented May 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 57.20165% with 104 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/extractor/font_extractor.go 58.72% 49 Missing and 22 partials ⚠️
fonts.go 0.00% 25 Missing ⚠️
internal/extractor/stream_decoder.go 81.08% 5 Missing and 2 partials ⚠️
internal/fonts/ttf_parser.go 88.88% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Font data extraction for round-trip preservation

1 participant