Skip to content

🐛 fix(mq-crawler): improve HTML to Markdown conversion for pages withsidebars and JS#1469

Merged
harehare merged 9 commits intomainfrom
fix/html-to-markdown-content-extraction
Mar 18, 2026
Merged

🐛 fix(mq-crawler): improve HTML to Markdown conversion for pages withsidebars and JS#1469
harehare merged 9 commits intomainfrom
fix/html-to-markdown-content-extraction

Conversation

@harehare
Copy link
Copy Markdown
Owner

@harehare harehare commented Mar 18, 2026

  • Skip , , and elements during conversion to avoid polluting Markdown output with navigation menus, sidebars, and JS fallback content
  • Add smart content extraction: prefer → [role="main"] →
    before falling back to the full document
  • Output Markdown only to stdout; move Filename header and tracing logs to stderr

… sidebars and JS

- Skip <nav>, <aside>, and <noscript> elements during conversion to avoid
  polluting Markdown output with navigation menus, sidebars, and JS fallback content
- Add smart content extraction: prefer <main> → [role="main"] → <article>
  before falling back to the full document
- Output Markdown only to stdout; move Filename header and tracing logs to stderr
- Add integration tests for noisy element skipping and smart content extraction
Copilot AI review requested due to automatic review settings March 18, 2026 06:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves mq-crawler’s HTML→Markdown pipeline for real-world pages (sidebars/nav/JS), and adds crawler capabilities for JS-rendered pages and multi-domain crawling while keeping Markdown output clean for piping.

Changes:

  • Enhance HTML→Markdown conversion by skipping noisy elements (nav/aside/noscript) and extracting content from main / [role="main"] / article preferentially.
  • Update crawler CLI + runtime to support headless Chrome (CDP via chromiumoxide), allowed-domain crawling, stderr-only logs/headers, and per-domain rate limiting.
  • Add/extend tests for noisy element skipping, smart extraction, and allowed-domain filtering.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
crates/mq-markdown/tests/html_to_markdown_tests.rs Adds integration tests for skipping noisy elements and smart extraction.
crates/mq-markdown/src/html_to_markdown/converter.rs Skips noisy tags during conversion; adds unit tests.
crates/mq-markdown/src/html_to_markdown.rs Implements “smart extraction” of main content before conversion.
crates/mq-crawler/src/main.rs Adds --headless, --chrome-path, --allowed-domains; routes logs to stderr; wires new domain options.
crates/mq-crawler/src/http_client.rs Adds a Chromium HTTP client variant using chromiumoxide.
crates/mq-crawler/src/crawler.rs Adds allowed-domain filtering, new parallel scheduler, per-domain rate limiting, and stdout/stderr separation.
crates/mq-crawler/Cargo.toml Adds chromiumoxide and rustls, adjusts fantoccini features.
Cargo.toml Adds workspace deps for chromiumoxide and rustls.
Cargo.lock Locks new transitive dependencies introduced by Chromium support.

Copilot AI review requested due to automatic review settings March 18, 2026 09:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves mq-crawler’s HTML→Markdown output quality for real-world pages (sidebars/nav/JS fallbacks) and enhances crawling capabilities for JS-heavy and multi-domain sites, while keeping stdout clean for piping.

Changes:

  • Improve HTML extraction/conversion: smart content selection (main[role="main"]article), skip noisy elements (nav/aside/noscript), and handle <title> as optional H1 at the top level.
  • Add headless Chrome (CDP via chromiumoxide) and domain filtering support; introduce per-domain rate limiting for parallel crawling.
  • Route filename headers and tracing output to stderr so stdout contains only Markdown.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/books/src/start/crawler.md Updates crawler docs (options, examples, install command, new features).
crates/mq-markdown/tests/html_to_markdown_tests.rs Updates/extends HTML→Markdown tests for new extraction and skipping behavior.
crates/mq-markdown/src/html_to_markdown/converter.rs Skips nav/aside/noscript; moves title handling out of node conversion.
crates/mq-markdown/src/html_to_markdown.rs Adds smart extraction and top-level title extraction for use_title_as_h1.
crates/mq-crawler/src/main.rs Adds CLI flags (--headless, --chrome-path, --allowed-domains) and stderr logging.
crates/mq-crawler/src/http_client.rs Adds Chromium client (headless Chrome) and multi-domain reqwest builder.
crates/mq-crawler/src/crawler.rs Adds domain filtering, per-domain rate limiting, and new parallel scheduling loop.
crates/mq-crawler/README.md Documents new flags/features and updates CLI help snippet.
crates/mq-crawler/Cargo.toml Adds chromiumoxide/rustls deps and adjusts fantoccini TLS features.
Cargo.toml Adds workspace deps for chromiumoxide and rustls.
Cargo.lock Records new transitive dependencies (notably chromiumoxide + additional reqwest).

Copilot AI review requested due to automatic review settings March 18, 2026 09:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves mq-crawler’s content quality and crawling capabilities by enhancing HTML→Markdown extraction (removing common “noise” sections and preferring main content) and adding a built-in headless Chrome mode for JavaScript-rendered pages, while keeping stdout clean for Markdown piping.

Changes:

  • Improve HTML→Markdown conversion by skipping <nav>/<aside>/<noscript> and adding “smart extraction” (<main>[role="main"]<article>).
  • Add built-in headless Chrome crawling (Chromiumoxide) plus domain filtering support.
  • Route logs/filename headers to stderr and expand user docs/README to reflect new CLI behavior.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
docs/books/src/start/crawler.md Updates crawler docs, options, and examples (now includes headless + domain filtering).
crates/mq-markdown/tests/html_to_markdown_tests.rs Updates expectations and adds tests for smart extraction / noisy element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs Skips <nav>/<aside>/<noscript> and ignores <title> in body conversion.
crates/mq-markdown/src/html_to_markdown.rs Adds smart content extraction and title handling for --use-title-as-h1.
crates/mq-crawler/src/main.rs Adds CLI flags for headless Chrome + allowed domains; routes tracing to stderr.
crates/mq-crawler/src/http_client.rs Implements Chromiumoxide-based headless Chrome client.
crates/mq-crawler/src/crawler.rs Adds allowed-domain filtering, per-domain rate limiting, and stderr filename header output.
crates/mq-crawler/README.md Documents new features/options (headless chrome + allowed domains) and revised CLI help.
crates/mq-crawler/Cargo.toml Adds chromiumoxide + rustls dependencies and rustls-tls config for fantoccini.
Cargo.toml Adds workspace dependencies for chromiumoxide and rustls.
Cargo.lock Locks new dependencies pulled in by chromiumoxide/rustls changes.

Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Copilot AI review requested due to automatic review settings March 18, 2026 09:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances mq-crawler’s web extraction pipeline to produce cleaner Markdown from modern HTML pages (sidebars/nav/JS-heavy layouts), and adds a built-in headless Chrome fetching mode for dynamic content.

Changes:

  • Improve HTML→Markdown conversion by skipping noisy elements (nav, aside, noscript) and by extracting content from <main> / [role="main"] / <article> before falling back to the full document.
  • Add headless Chrome fetching via chromiumoxide, plus --allowed-domains domain filtering support.
  • Route logs/headers to stderr while keeping Markdown-only output on stdout.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/books/src/start/crawler.md Expands user documentation (options, examples) for new crawler capabilities.
crates/mq-markdown/tests/html_to_markdown_tests.rs Updates expectations for title handling and adds coverage for smart extraction/noisy-element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs Skips nav/aside/noscript during conversion; treats title as metadata.
crates/mq-markdown/src/html_to_markdown.rs Adds “smart extraction” and top-level <title>→H1 support independent of selected content subtree.
crates/mq-crawler/src/main.rs Adds CLI flags for headless Chrome and allowed domains; switches tracing output to stderr.
crates/mq-crawler/src/http_client.rs Introduces a Chromium-based client and a multi-domain-tuned reqwest client builder.
crates/mq-crawler/src/crawler.rs Adds allowed-domain filtering, per-domain rate limiting, new parallel scheduling, and stderr filename headers.
crates/mq-crawler/README.md Documents new features/flags and updates CLI option descriptions.
crates/mq-crawler/Cargo.toml Adds dependencies/features needed for Chromium mode and rustls unification.
Cargo.toml Adds workspace deps for chromiumoxide and rustls configuration.
Cargo.lock Locks new dependency graph additions (Chromium + transitive deps).

Copilot AI review requested due to automatic review settings March 18, 2026 10:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves mq-crawler’s crawl output quality and JS-site support by enhancing HTML→Markdown extraction and adding a built-in headless Chrome fetch path, while also adjusting logging/output streams for better piping behavior.

Changes:

  • Improve HTML→Markdown conversion by skipping <nav>/<aside>/<noscript> and extracting content from <main> / [role="main"] / <article> first.
  • Add built-in headless Chrome (CDP via chromiumoxide) plus domain allowlisting and per-domain rate limiting improvements.
  • Update docs/README to reflect new CLI options and output behavior.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/books/src/start/crawler.md Updates crawler docs, adds expanded options/examples for headless/WebDriver/domain filtering.
crates/mq-markdown/tests/html_to_markdown_tests.rs Adjusts expectations and adds tests for smart extraction and noisy element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs Skips noisy tags during conversion and adjusts title handling.
crates/mq-markdown/src/html_to_markdown.rs Implements smart extraction + extracts <title> separately for optional H1 prefix.
crates/mq-crawler/src/main.rs Adds CLI flags for headless Chrome and allowed domains; routes tracing to stderr.
crates/mq-crawler/src/http_client.rs Adds a Chromium-backed HTTP client using chromiumoxide.
crates/mq-crawler/src/crawler.rs Adds allowed-domain filtering, per-domain rate limiting, new parallel runner structure, and stderr filename headers.
crates/mq-crawler/README.md Documents new features and CLI flags (domain filtering, headless Chrome).
crates/mq-crawler/Cargo.toml Adds chromiumoxide dependency and adjusts HTTP/WebDriver dependency features.
Cargo.toml Adds workspace dependencies for Chromium support and bumps reqwest.
Cargo.lock Locks new dependencies and updated transitive graph.

@harehare harehare merged commit 8acb567 into main Mar 18, 2026
4 checks passed
@harehare harehare deleted the fix/html-to-markdown-content-extraction branch March 18, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants