🐛 fix(mq-crawler): improve HTML to Markdown conversion for pages withsidebars and JS by harehare · Pull Request #1469 · harehare/mq

harehare · 2026-03-18T06:10:27Z

Skip , , and elements during conversion to avoid polluting Markdown output with navigation menus, sidebars, and JS fallback content
Add smart content extraction: prefer → [role="main"] →
before falling back to the full document
Output Markdown only to stdout; move Filename header and tracing logs to stderr

… sidebars and JS - Skip <nav>, <aside>, and <noscript> elements during conversion to avoid polluting Markdown output with navigation menus, sidebars, and JS fallback content - Add smart content extraction: prefer <main> → [role="main"] → <article> before falling back to the full document - Output Markdown only to stdout; move Filename header and tracing logs to stderr - Add integration tests for noisy element skipping and smart content extraction

Copilot

Pull request overview

This PR improves mq-crawler’s HTML→Markdown pipeline for real-world pages (sidebars/nav/JS), and adds crawler capabilities for JS-rendered pages and multi-domain crawling while keeping Markdown output clean for piping.

Changes:

Enhance HTML→Markdown conversion by skipping noisy elements (nav/aside/noscript) and extracting content from main / [role="main"] / article preferentially.
Update crawler CLI + runtime to support headless Chrome (CDP via chromiumoxide), allowed-domain crawling, stderr-only logs/headers, and per-domain rate limiting.
Add/extend tests for noisy element skipping, smart extraction, and allowed-domain filtering.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
crates/mq-markdown/tests/html_to_markdown_tests.rs	Adds integration tests for skipping noisy elements and smart extraction.
crates/mq-markdown/src/html_to_markdown/converter.rs	Skips noisy tags during conversion; adds unit tests.
crates/mq-markdown/src/html_to_markdown.rs	Implements “smart extraction” of main content before conversion.
crates/mq-crawler/src/main.rs	Adds `--headless`, `--chrome-path`, `--allowed-domains`; routes logs to stderr; wires new domain options.
crates/mq-crawler/src/http_client.rs	Adds a `Chromium` HTTP client variant using `chromiumoxide`.
crates/mq-crawler/src/crawler.rs	Adds allowed-domain filtering, new parallel scheduler, per-domain rate limiting, and stdout/stderr separation.
crates/mq-crawler/Cargo.toml	Adds `chromiumoxide` and `rustls`, adjusts `fantoccini` features.
Cargo.toml	Adds workspace deps for `chromiumoxide` and `rustls`.
Cargo.lock	Locks new transitive dependencies introduced by Chromium support.

crates/mq-crawler/src/crawler.rs

crates/mq-markdown/tests/html_to_markdown_tests.rs

crates/mq-markdown/src/html_to_markdown.rs

crates/mq-crawler/src/main.rs

crates/mq-crawler/src/crawler.rs

… features and usage

… title and adjusting front matter handling

Copilot

Pull request overview

This PR improves mq-crawler’s HTML→Markdown output quality for real-world pages (sidebars/nav/JS fallbacks) and enhances crawling capabilities for JS-heavy and multi-domain sites, while keeping stdout clean for piping.

Changes:

Improve HTML extraction/conversion: smart content selection (main → [role="main"] → article), skip noisy elements (nav/aside/noscript), and handle <title> as optional H1 at the top level.
Add headless Chrome (CDP via chromiumoxide) and domain filtering support; introduce per-domain rate limiting for parallel crawling.
Route filename headers and tracing output to stderr so stdout contains only Markdown.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
docs/books/src/start/crawler.md	Updates crawler docs (options, examples, install command, new features).
crates/mq-markdown/tests/html_to_markdown_tests.rs	Updates/extends HTML→Markdown tests for new extraction and skipping behavior.
crates/mq-markdown/src/html_to_markdown/converter.rs	Skips `nav`/`aside`/`noscript`; moves `title` handling out of node conversion.
crates/mq-markdown/src/html_to_markdown.rs	Adds smart extraction and top-level title extraction for `use_title_as_h1`.
crates/mq-crawler/src/main.rs	Adds CLI flags (`--headless`, `--chrome-path`, `--allowed-domains`) and stderr logging.
crates/mq-crawler/src/http_client.rs	Adds `Chromium` client (headless Chrome) and multi-domain reqwest builder.
crates/mq-crawler/src/crawler.rs	Adds domain filtering, per-domain rate limiting, and new parallel scheduling loop.
crates/mq-crawler/README.md	Documents new flags/features and updates CLI help snippet.
crates/mq-crawler/Cargo.toml	Adds `chromiumoxide`/`rustls` deps and adjusts fantoccini TLS features.
Cargo.toml	Adds workspace deps for `chromiumoxide` and `rustls`.
Cargo.lock	Records new transitive dependencies (notably chromiumoxide + additional reqwest).

crates/mq-crawler/src/crawler.rs

crates/mq-crawler/src/main.rs

docs/books/src/start/crawler.md

crates/mq-crawler/README.md

Cargo.toml

crates/mq-markdown/tests/html_to_markdown_tests.rs

crates/mq-crawler/src/crawler.rs

crates/mq-crawler/src/main.rs

… mechanism for task management

…wn conversion

Copilot

Pull request overview

This PR improves mq-crawler’s content quality and crawling capabilities by enhancing HTML→Markdown extraction (removing common “noise” sections and preferring main content) and adding a built-in headless Chrome mode for JavaScript-rendered pages, while keeping stdout clean for Markdown piping.

Changes:

Improve HTML→Markdown conversion by skipping <nav>/<aside>/<noscript> and adding “smart extraction” (<main> → [role="main"] → <article>).
Add built-in headless Chrome crawling (Chromiumoxide) plus domain filtering support.
Route logs/filename headers to stderr and expand user docs/README to reflect new CLI behavior.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
docs/books/src/start/crawler.md	Updates crawler docs, options, and examples (now includes headless + domain filtering).
crates/mq-markdown/tests/html_to_markdown_tests.rs	Updates expectations and adds tests for smart extraction / noisy element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs	Skips `<nav>/<aside>/<noscript>` and ignores `<title>` in body conversion.
crates/mq-markdown/src/html_to_markdown.rs	Adds smart content extraction and title handling for `--use-title-as-h1`.
crates/mq-crawler/src/main.rs	Adds CLI flags for headless Chrome + allowed domains; routes tracing to stderr.
crates/mq-crawler/src/http_client.rs	Implements Chromiumoxide-based headless Chrome client.
crates/mq-crawler/src/crawler.rs	Adds allowed-domain filtering, per-domain rate limiting, and stderr filename header output.
crates/mq-crawler/README.md	Documents new features/options (headless chrome + allowed domains) and revised CLI help.
crates/mq-crawler/Cargo.toml	Adds chromiumoxide + rustls dependencies and rustls-tls config for fantoccini.
Cargo.toml	Adds workspace dependencies for chromiumoxide and rustls.
Cargo.lock	Locks new dependencies pulled in by chromiumoxide/rustls changes.

crates/mq-markdown/tests/html_to_markdown_tests.rs

crates/mq-crawler/src/main.rs

crates/mq-crawler/src/http_client.rs

crates/mq-crawler/README.md

docs/books/src/start/crawler.md

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

Copilot

Pull request overview

This PR enhances mq-crawler’s web extraction pipeline to produce cleaner Markdown from modern HTML pages (sidebars/nav/JS-heavy layouts), and adds a built-in headless Chrome fetching mode for dynamic content.

Changes:

Improve HTML→Markdown conversion by skipping noisy elements (nav, aside, noscript) and by extracting content from <main> / [role="main"] / <article> before falling back to the full document.
Add headless Chrome fetching via chromiumoxide, plus --allowed-domains domain filtering support.
Route logs/headers to stderr while keeping Markdown-only output on stdout.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
docs/books/src/start/crawler.md	Expands user documentation (options, examples) for new crawler capabilities.
crates/mq-markdown/tests/html_to_markdown_tests.rs	Updates expectations for title handling and adds coverage for smart extraction/noisy-element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs	Skips `nav`/`aside`/`noscript` during conversion; treats `title` as metadata.
crates/mq-markdown/src/html_to_markdown.rs	Adds “smart extraction” and top-level `<title>`→H1 support independent of selected content subtree.
crates/mq-crawler/src/main.rs	Adds CLI flags for headless Chrome and allowed domains; switches tracing output to `stderr`.
crates/mq-crawler/src/http_client.rs	Introduces a Chromium-based client and a multi-domain-tuned reqwest client builder.
crates/mq-crawler/src/crawler.rs	Adds allowed-domain filtering, per-domain rate limiting, new parallel scheduling, and stderr filename headers.
crates/mq-crawler/README.md	Documents new features/flags and updates CLI option descriptions.
crates/mq-crawler/Cargo.toml	Adds dependencies/features needed for Chromium mode and rustls unification.
Cargo.toml	Adds workspace deps for `chromiumoxide` and `rustls` configuration.
Cargo.lock	Locks new dependency graph additions (Chromium + transitive deps).

docs/books/src/start/crawler.md

crates/mq-crawler/README.md

crates/mq-crawler/src/main.rs

crates/mq-crawler/src/crawler.rs

crates/mq-markdown/tests/html_to_markdown_tests.rs

docs/books/src/start/crawler.md

…y to resolve conflicts

…lowercase

Copilot

Pull request overview

This PR improves mq-crawler’s crawl output quality and JS-site support by enhancing HTML→Markdown extraction and adding a built-in headless Chrome fetch path, while also adjusting logging/output streams for better piping behavior.

Changes:

Improve HTML→Markdown conversion by skipping <nav>/<aside>/<noscript> and extracting content from <main> / [role="main"] / <article> first.
Add built-in headless Chrome (CDP via chromiumoxide) plus domain allowlisting and per-domain rate limiting improvements.
Update docs/README to reflect new CLI options and output behavior.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
docs/books/src/start/crawler.md	Updates crawler docs, adds expanded options/examples for headless/WebDriver/domain filtering.
crates/mq-markdown/tests/html_to_markdown_tests.rs	Adjusts expectations and adds tests for smart extraction and noisy element skipping.
crates/mq-markdown/src/html_to_markdown/converter.rs	Skips noisy tags during conversion and adjusts title handling.
crates/mq-markdown/src/html_to_markdown.rs	Implements smart extraction + extracts `<title>` separately for optional H1 prefix.
crates/mq-crawler/src/main.rs	Adds CLI flags for headless Chrome and allowed domains; routes tracing to stderr.
crates/mq-crawler/src/http_client.rs	Adds a Chromium-backed HTTP client using `chromiumoxide`.
crates/mq-crawler/src/crawler.rs	Adds allowed-domain filtering, per-domain rate limiting, new parallel runner structure, and stderr filename headers.
crates/mq-crawler/README.md	Documents new features and CLI flags (domain filtering, headless Chrome).
crates/mq-crawler/Cargo.toml	Adds `chromiumoxide` dependency and adjusts HTTP/WebDriver dependency features.
Cargo.toml	Adds workspace dependencies for Chromium support and bumps `reqwest`.
Cargo.lock	Locks new dependencies and updated transitive graph.

docs/books/src/start/crawler.md

crates/mq-markdown/tests/html_to_markdown_tests.rs

crates/mq-crawler/Cargo.toml

crates/mq-crawler/src/main.rs

crates/mq-crawler/src/http_client.rs

docs/books/src/start/crawler.md

crates/mq-crawler/src/main.rs

crates/mq-crawler/README.md

…ium config

Copilot AI review requested due to automatic review settings March 18, 2026 06:10

Copilot started reviewing on behalf of harehare March 18, 2026 06:10 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

harehare added 2 commits March 18, 2026 15:26

📝 docs: update README and crawler documentation to enhance clarity on…

44423b5

… features and usage

🐛 fix(mq-markdown): enhance HTML to Markdown conversion by extracting…

1a01fdb

… title and adjusting front matter handling

Copilot AI review requested due to automatic review settings March 18, 2026 09:10

Copilot started reviewing on behalf of harehare March 18, 2026 09:11 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

harehare added 2 commits March 18, 2026 18:21

🐛 fix(mq-crawler): enhance Crawler concurrency by adding notification…

5a222f4

… mechanism for task management

🐛 fix(mq-markdown): simplify title extraction logic in HTML to Markdo…

74d446d

…wn conversion

Copilot AI review requested due to automatic review settings March 18, 2026 09:21

Copilot started reviewing on behalf of harehare March 18, 2026 09:21 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Potential fix for pull request finding

b93c385

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

Copilot AI review requested due to automatic review settings March 18, 2026 09:39

Copilot started reviewing on behalf of harehare March 18, 2026 09:39 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

harehare added 2 commits March 18, 2026 19:49

🐛 fix(mq-crawler): update reqwest version and remove rustls dependenc…

1f90421

…y to resolve conflicts

🐛 fix(mq-crawler): normalize allowed domains and start URL domain to …

4194a49

…lowercase

Copilot AI review requested due to automatic review settings March 18, 2026 10:53

Copilot started reviewing on behalf of harehare March 18, 2026 10:53 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

🐛 fix(http-client): remove unnecessary no-sandbox argument from Chrom…

53aa48a

…ium config

harehare merged commit 8acb567 into main Mar 18, 2026
4 checks passed

harehare deleted the fix/html-to-markdown-content-extraction branch March 18, 2026 12:27

BrewTestBot mentioned this pull request Mar 18, 2026

mq 0.5.20 Homebrew/homebrew-core#272991

Merged

Conversation

harehare commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harehare commented Mar 18, 2026 •

edited

Loading