Conversation
… sidebars and JS - Skip <nav>, <aside>, and <noscript> elements during conversion to avoid polluting Markdown output with navigation menus, sidebars, and JS fallback content - Add smart content extraction: prefer <main> → [role="main"] → <article> before falling back to the full document - Output Markdown only to stdout; move Filename header and tracing logs to stderr - Add integration tests for noisy element skipping and smart content extraction
There was a problem hiding this comment.
Pull request overview
This PR improves mq-crawler’s HTML→Markdown pipeline for real-world pages (sidebars/nav/JS), and adds crawler capabilities for JS-rendered pages and multi-domain crawling while keeping Markdown output clean for piping.
Changes:
- Enhance HTML→Markdown conversion by skipping noisy elements (
nav/aside/noscript) and extracting content frommain/[role="main"]/articlepreferentially. - Update crawler CLI + runtime to support headless Chrome (CDP via
chromiumoxide), allowed-domain crawling, stderr-only logs/headers, and per-domain rate limiting. - Add/extend tests for noisy element skipping, smart extraction, and allowed-domain filtering.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/mq-markdown/tests/html_to_markdown_tests.rs | Adds integration tests for skipping noisy elements and smart extraction. |
| crates/mq-markdown/src/html_to_markdown/converter.rs | Skips noisy tags during conversion; adds unit tests. |
| crates/mq-markdown/src/html_to_markdown.rs | Implements “smart extraction” of main content before conversion. |
| crates/mq-crawler/src/main.rs | Adds --headless, --chrome-path, --allowed-domains; routes logs to stderr; wires new domain options. |
| crates/mq-crawler/src/http_client.rs | Adds a Chromium HTTP client variant using chromiumoxide. |
| crates/mq-crawler/src/crawler.rs | Adds allowed-domain filtering, new parallel scheduler, per-domain rate limiting, and stdout/stderr separation. |
| crates/mq-crawler/Cargo.toml | Adds chromiumoxide and rustls, adjusts fantoccini features. |
| Cargo.toml | Adds workspace deps for chromiumoxide and rustls. |
| Cargo.lock | Locks new transitive dependencies introduced by Chromium support. |
… features and usage
… title and adjusting front matter handling
There was a problem hiding this comment.
Pull request overview
This PR improves mq-crawler’s HTML→Markdown output quality for real-world pages (sidebars/nav/JS fallbacks) and enhances crawling capabilities for JS-heavy and multi-domain sites, while keeping stdout clean for piping.
Changes:
- Improve HTML extraction/conversion: smart content selection (
main→[role="main"]→article), skip noisy elements (nav/aside/noscript), and handle<title>as optional H1 at the top level. - Add headless Chrome (CDP via
chromiumoxide) and domain filtering support; introduce per-domain rate limiting for parallel crawling. - Route filename headers and tracing output to stderr so stdout contains only Markdown.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/books/src/start/crawler.md | Updates crawler docs (options, examples, install command, new features). |
| crates/mq-markdown/tests/html_to_markdown_tests.rs | Updates/extends HTML→Markdown tests for new extraction and skipping behavior. |
| crates/mq-markdown/src/html_to_markdown/converter.rs | Skips nav/aside/noscript; moves title handling out of node conversion. |
| crates/mq-markdown/src/html_to_markdown.rs | Adds smart extraction and top-level title extraction for use_title_as_h1. |
| crates/mq-crawler/src/main.rs | Adds CLI flags (--headless, --chrome-path, --allowed-domains) and stderr logging. |
| crates/mq-crawler/src/http_client.rs | Adds Chromium client (headless Chrome) and multi-domain reqwest builder. |
| crates/mq-crawler/src/crawler.rs | Adds domain filtering, per-domain rate limiting, and new parallel scheduling loop. |
| crates/mq-crawler/README.md | Documents new flags/features and updates CLI help snippet. |
| crates/mq-crawler/Cargo.toml | Adds chromiumoxide/rustls deps and adjusts fantoccini TLS features. |
| Cargo.toml | Adds workspace deps for chromiumoxide and rustls. |
| Cargo.lock | Records new transitive dependencies (notably chromiumoxide + additional reqwest). |
… mechanism for task management
There was a problem hiding this comment.
Pull request overview
This PR improves mq-crawler’s content quality and crawling capabilities by enhancing HTML→Markdown extraction (removing common “noise” sections and preferring main content) and adding a built-in headless Chrome mode for JavaScript-rendered pages, while keeping stdout clean for Markdown piping.
Changes:
- Improve HTML→Markdown conversion by skipping
<nav>/<aside>/<noscript>and adding “smart extraction” (<main>→[role="main"]→<article>). - Add built-in headless Chrome crawling (Chromiumoxide) plus domain filtering support.
- Route logs/filename headers to stderr and expand user docs/README to reflect new CLI behavior.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/books/src/start/crawler.md | Updates crawler docs, options, and examples (now includes headless + domain filtering). |
| crates/mq-markdown/tests/html_to_markdown_tests.rs | Updates expectations and adds tests for smart extraction / noisy element skipping. |
| crates/mq-markdown/src/html_to_markdown/converter.rs | Skips <nav>/<aside>/<noscript> and ignores <title> in body conversion. |
| crates/mq-markdown/src/html_to_markdown.rs | Adds smart content extraction and title handling for --use-title-as-h1. |
| crates/mq-crawler/src/main.rs | Adds CLI flags for headless Chrome + allowed domains; routes tracing to stderr. |
| crates/mq-crawler/src/http_client.rs | Implements Chromiumoxide-based headless Chrome client. |
| crates/mq-crawler/src/crawler.rs | Adds allowed-domain filtering, per-domain rate limiting, and stderr filename header output. |
| crates/mq-crawler/README.md | Documents new features/options (headless chrome + allowed domains) and revised CLI help. |
| crates/mq-crawler/Cargo.toml | Adds chromiumoxide + rustls dependencies and rustls-tls config for fantoccini. |
| Cargo.toml | Adds workspace dependencies for chromiumoxide and rustls. |
| Cargo.lock | Locks new dependencies pulled in by chromiumoxide/rustls changes. |
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR enhances mq-crawler’s web extraction pipeline to produce cleaner Markdown from modern HTML pages (sidebars/nav/JS-heavy layouts), and adds a built-in headless Chrome fetching mode for dynamic content.
Changes:
- Improve HTML→Markdown conversion by skipping noisy elements (
nav,aside,noscript) and by extracting content from<main>/[role="main"]/<article>before falling back to the full document. - Add headless Chrome fetching via
chromiumoxide, plus--allowed-domainsdomain filtering support. - Route logs/headers to
stderrwhile keeping Markdown-only output onstdout.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/books/src/start/crawler.md | Expands user documentation (options, examples) for new crawler capabilities. |
| crates/mq-markdown/tests/html_to_markdown_tests.rs | Updates expectations for title handling and adds coverage for smart extraction/noisy-element skipping. |
| crates/mq-markdown/src/html_to_markdown/converter.rs | Skips nav/aside/noscript during conversion; treats title as metadata. |
| crates/mq-markdown/src/html_to_markdown.rs | Adds “smart extraction” and top-level <title>→H1 support independent of selected content subtree. |
| crates/mq-crawler/src/main.rs | Adds CLI flags for headless Chrome and allowed domains; switches tracing output to stderr. |
| crates/mq-crawler/src/http_client.rs | Introduces a Chromium-based client and a multi-domain-tuned reqwest client builder. |
| crates/mq-crawler/src/crawler.rs | Adds allowed-domain filtering, per-domain rate limiting, new parallel scheduling, and stderr filename headers. |
| crates/mq-crawler/README.md | Documents new features/flags and updates CLI option descriptions. |
| crates/mq-crawler/Cargo.toml | Adds dependencies/features needed for Chromium mode and rustls unification. |
| Cargo.toml | Adds workspace deps for chromiumoxide and rustls configuration. |
| Cargo.lock | Locks new dependency graph additions (Chromium + transitive deps). |
There was a problem hiding this comment.
Pull request overview
This PR improves mq-crawler’s crawl output quality and JS-site support by enhancing HTML→Markdown extraction and adding a built-in headless Chrome fetch path, while also adjusting logging/output streams for better piping behavior.
Changes:
- Improve HTML→Markdown conversion by skipping
<nav>/<aside>/<noscript>and extracting content from<main>/[role="main"]/<article>first. - Add built-in headless Chrome (CDP via
chromiumoxide) plus domain allowlisting and per-domain rate limiting improvements. - Update docs/README to reflect new CLI options and output behavior.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/books/src/start/crawler.md | Updates crawler docs, adds expanded options/examples for headless/WebDriver/domain filtering. |
| crates/mq-markdown/tests/html_to_markdown_tests.rs | Adjusts expectations and adds tests for smart extraction and noisy element skipping. |
| crates/mq-markdown/src/html_to_markdown/converter.rs | Skips noisy tags during conversion and adjusts title handling. |
| crates/mq-markdown/src/html_to_markdown.rs | Implements smart extraction + extracts <title> separately for optional H1 prefix. |
| crates/mq-crawler/src/main.rs | Adds CLI flags for headless Chrome and allowed domains; routes tracing to stderr. |
| crates/mq-crawler/src/http_client.rs | Adds a Chromium-backed HTTP client using chromiumoxide. |
| crates/mq-crawler/src/crawler.rs | Adds allowed-domain filtering, per-domain rate limiting, new parallel runner structure, and stderr filename headers. |
| crates/mq-crawler/README.md | Documents new features and CLI flags (domain filtering, headless Chrome). |
| crates/mq-crawler/Cargo.toml | Adds chromiumoxide dependency and adjusts HTTP/WebDriver dependency features. |
| Cargo.toml | Adds workspace dependencies for Chromium support and bumps reqwest. |
| Cargo.lock | Locks new dependencies and updated transitive graph. |
before falling back to the full document