Conversation
- Add scripts/translation-dictionary.ts with 50+ Swedish parliamentary terms translated into all 13 non-Swedish target languages (en, da, no, fi, de, fr, es, nl, ar, he, ja, ko, zh). Exports translateTerm (exact match), translatePhrase (prefix + exact match), and DICTIONARIES. - Add translateSwedishContent(html, lang) in generate-news-enhanced.ts which processes all <span data-translate="true" lang="sv"> spans: for 'sv' retains text and removes marker; for other languages translates via dictionary then removes marker. Handles both attribute orderings. - Apply post-processing in writeSingleArticle (week-ahead, committee- reports, propositions, motions) and via new writeArticleWithTranslation wrapper passed to monthly-review, weekly-review, month-ahead, breaking- news sub-generators. - Add tests/translation-dictionary.test.ts with 25 unit tests. Co-authored-by: pethers <[email protected]>
… titles - #454: Fix motion author/party extraction when enrichDocumentsWithContent returns 'Unknown' sentinel. Guard changed from `!authorName` to `!authorName || authorName === UNKNOWN_SENTINEL` in both generateMotionsContent and generateEnhancedSummary. Added UNKNOWN_SENTINEL constant to centralise comparison. - #462: Deduplicate motions with identical titles in generateMotionsContent using a Set<string>, eliminating repeated same-proposition entries. - #456: Replace static 'Battle Lines / Policy Priorities' article titles with content-aware titles via extractTopicFromDocs() helper. Strips common Swedish boilerplate prefixes and appends first significant topic to committee-reports, propositions and motions titles. - Tests: 4 new motion-quality tests in data-transformers.test.ts, 8 new extractTopicFromDocs tests in translation-dictionary.test.ts. Total: 1635 passing tests. Co-authored-by: pethers <[email protected]>
…n gate #452: --require-mcp flag (default: true) parses from CLI args. getSharedClient() now throws a descriptive Error when the MCP warm-up fails and requireMcpArg is true, aborting the pipeline before any articles are generated from empty data. --require-mcp=false restores previous degraded-mode behaviour for local testing. #468: validateArticleQuality(html, lang, type, hint) exported function scores articles 0-100: - Word count ≤ 50 pts (proportional at 300-word target) - Analytical h2 sections ≤ 30 pts (proportional at 3-section target) - Each untranslated span costs 2 pts, capped at 20 pts total Scores tracked in stats.qualityScores across the run. End-of-run quality summary printed to console. Process exits with code 2 if ALL generated articles score below QUALITY_THRESHOLD (40). CodeQL: replaced two-phase style/script strip (which triggered js/bad-tag-filter + js/incomplete-multi-character-sanitization) with a single-pass tag stripper — appropriate since HTML is always from our own pipeline, never user input. Tests: 7 new unit tests in generate-news-enhanced-part2.test.ts covering empty HTML, rich articles, proportional section scoring, translation penalty (including sv exemption), and QUALITY_THRESHOLD export. Total: 1642 passing, 0 CodeQL alerts. Co-authored-by: pethers <[email protected]>
There was a problem hiding this comment.
Pull request overview
This pull request addresses translation quality issues and implements quality gates for the news generation pipeline. It eliminates 131 article files containing untranslated Swedish data-translate spans by introducing a translation dictionary and post-processing layer. Additionally, it fixes motion article quality issues (duplicate entries, "Unknown" authors, generic titles) and adds MCP fail-fast validation with quality scoring.
Changes:
- Introduces comprehensive Swedish-to-multi-language translation dictionary covering 50+ parliamentary terms, committee names, and document prefixes
- Implements translation post-processing that runs before article writing, processing all
data-translatespans - Adds content-based article titles extracted from document topics instead of static boilerplate
- Implements quality validation scoring (0-100) based on word count, analytical sections, and translation completeness
- Adds
--require-mcpflag (default true) to abort pipeline on MCP server failures, preventing empty placeholder articles - Fixes motion author/party "Unknown" sentinel handling with fallback parsing and deduplication by title
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/translation-dictionary.ts | New module providing static translations for 50+ Swedish terms into 13 languages with exact-match and prefix-match logic |
| scripts/generate-news-enhanced.ts | Adds translateSwedishContent, extractTopicFromDocs, validateArticleQuality, requireMcpArg flag, quality gate exit codes, and integrates translation post-processing into article workflow |
| scripts/data-transformers.ts | Introduces UNKNOWN_SENTINEL constant, motion deduplication by title, and improved author/party fallback parsing |
| tests/translation-dictionary.test.ts | 25 tests covering translation dictionary exact/prefix matching and translateSwedishContent for all languages |
| tests/generate-news-enhanced-part2.test.ts | 7 tests for quality validation scoring, span penalties, proportional metrics, and threshold enforcement |
| tests/data-transformers.test.ts | 4 tests for Unknown sentinel handling and motion deduplication behavior |
scripts/data-transformers.ts
Outdated
| // Deduplicate: keep only the first motion per unique title. | ||
| // Motions referencing the same parent proposition (e.g. "med anledning av prop. X") | ||
| // share an identical title and would otherwise repeat in the article. | ||
| const seenTitles = new Set<string>(); | ||
| const motions = rawMotions.filter(m => { | ||
| const key = (m.titel || m.title || m.dok_id || '').trim().toLowerCase(); | ||
| if (!key || seenTitles.has(key)) return false; | ||
| seenTitles.add(key); |
There was a problem hiding this comment.
The motion deduplication logic uses the entire title (lowercased and trimmed) as the deduplication key. However, different motions could legitimately have the same title if they address the same topic but from different perspectives or parties. The deduplication keeps only the first motion, potentially losing important alternative viewpoints.
Consider including additional fields in the deduplication key (e.g., author name or party) to only deduplicate truly identical entries while preserving distinct motions with the same title, or document why complete title-based deduplication is the desired behavior.
| // Deduplicate: keep only the first motion per unique title. | |
| // Motions referencing the same parent proposition (e.g. "med anledning av prop. X") | |
| // share an identical title and would otherwise repeat in the article. | |
| const seenTitles = new Set<string>(); | |
| const motions = rawMotions.filter(m => { | |
| const key = (m.titel || m.title || m.dok_id || '').trim().toLowerCase(); | |
| if (!key || seenTitles.has(key)) return false; | |
| seenTitles.add(key); | |
| // Deduplicate: keep only the first occurrence of each unique motion. | |
| // Use a composite key (dok_id, title, party, author) so that motions | |
| // from different parties/authors with the same title are preserved, | |
| // while true duplicates are collapsed. | |
| const seenMotions = new Set<string>(); | |
| const motions = rawMotions.filter(m => { | |
| const title = ((m as any).titel || (m as any).title || '').trim().toLowerCase(); | |
| const party = ((m as any).parti || (m as any).party || '').trim().toLowerCase(); | |
| const author = ((m as any).huvud_forfattare || (m as any).author || (m as any).intressent || '').trim().toLowerCase(); | |
| const dokId = ((m as any).dok_id || '').trim().toLowerCase(); | |
| const keyParts: string[] = []; | |
| if (dokId) keyParts.push(`id:${dokId}`); | |
| if (title) keyParts.push(`t:${title}`); | |
| if (party) keyParts.push(`p:${party}`); | |
| if (author) keyParts.push(`a:${author}`); | |
| // If we have no identifying information at all, do not deduplicate on this entry. | |
| if (keyParts.length === 0) { | |
| return true; | |
| } | |
| const key = keyParts.join('|'); | |
| if (seenMotions.has(key)) { | |
| return false; | |
| } | |
| seenMotions.add(key); |
scripts/generate-news-enhanced.ts
Outdated
| // Match <span> that has both data-translate="true" and lang="sv" in either attribute order. | ||
| // The inner text may contain HTML entities but no nested tags (escapeHtml is applied upstream). | ||
| return html.replace( | ||
| /<span\s+(?=[^>]*data-translate="true")(?=[^>]*lang="sv")[^>]*>([\s\S]*?)<\/span>/g, |
There was a problem hiding this comment.
The regex pattern uses a non-greedy match ([\s\S]*?) for the span's inner content, but this could fail if the Swedish text contains nested HTML tags or special characters. While the comment on line 528 states "The inner text may contain HTML entities but no nested tags (escapeHtml is applied upstream)", if any code path creates spans with nested HTML, this regex will only capture content up to the first closing tag.
Consider adding validation or documentation to ensure that all code paths creating data-translate spans properly escape their content first, or update the regex to handle potential edge cases more robustly.
scripts/generate-news-enhanced.ts
Outdated
| const requireMcpArg: boolean = (() => { | ||
| const raw = args.find(arg => arg.startsWith('--require-mcp')); | ||
| if (raw === undefined) return true; // default: enforce | ||
| if (raw === '--require-mcp=false') return false; | ||
| return true; // --require-mcp or --require-mcp=true | ||
| })(); |
There was a problem hiding this comment.
The requireMcpArg constant is not exported, but according to the repository memory "MCP fail-fast" fact, it states "requireMcpArg is exported for testability." This inconsistency means tests cannot verify the fail-fast behavior with and without the flag.
Add requireMcpArg to the exports list at the end of the file so tests can import and verify its value and behavior.
scripts/generate-news-enhanced.ts
Outdated
|
|
||
| // Count residual data-translate spans (should be 0 after post-processing) | ||
| const untranslatedSpans = lang !== 'sv' | ||
| ? (html.match(/data-translate="true"/g) ?? []).length |
There was a problem hiding this comment.
The regex pattern /data-translate="true"/g on line 600 will match the attribute anywhere in the HTML, not just in span elements. If other elements (like div, p, etc.) ever have this attribute, they will incorrectly contribute to the untranslatedSpans count.
For consistency with the translateSwedishContent function which only processes span elements, consider using a more specific pattern like /<span[^>]*data-translate="true"[^>]*>/g to only count span elements.
| ? (html.match(/data-translate="true"/g) ?? []).length | |
| ? (html.match(/<span[^>]*data-translate="true"[^>]*>/gi) ?? []).length |
scripts/generate-news-enhanced.ts
Outdated
| const report = validateArticleQuality(processedHtml, lang, typeHint, filename); | ||
| stats.qualityScores.push(report.qualityScore); | ||
| return writeArticle(processedHtml, filename); |
There was a problem hiding this comment.
The quality report is logged before the writeArticle call, but if writeArticle fails (e.g., disk full, permissions), the quality score will still be tracked in stats.qualityScores. This means the final quality summary could include scores for articles that were never actually written.
Consider moving the stats.qualityScores.push() call after the writeArticle() call succeeds, or handling the case where writing fails.
scripts/generate-news-enhanced.ts
Outdated
| const report = validateArticleQuality(processedHtml, lang, typeHint, filename); | ||
| stats.qualityScores.push(report.qualityScore); | ||
| await writeArticle(processedHtml, filename); |
There was a problem hiding this comment.
Same issue as writeArticleWithTranslation: the quality score is pushed to stats before writeArticle is called. If the write operation fails, the score is still recorded.
Consider moving line 676 (stats.qualityScores.push) after line 677 (await writeArticle) to ensure scores are only recorded for successfully written articles.
scripts/generate-news-enhanced.ts
Outdated
| if (!primary) return ''; | ||
|
|
There was a problem hiding this comment.
The function returns empty string when primary is undefined (line 489), but this can never happen because the nullish coalescing operator on line 487 ensures primary is always set to docs[0] if the find returns undefined. This check is redundant.
The real edge case is when docs[0] itself is undefined (empty array), which is already handled on line 481. Consider removing the redundant check on line 489.
| if (!primary) return ''; |
|
@copilot apply changes based on the comments in this thread |
- data-transformers.ts: replace title-only motion dedup key with composite
key (dok_id + title + party + author) using typed RawDocument fields
(no (m as any) casts). Same-title motions from different parties/authors
are now correctly preserved as distinct entries.
- generate-news-enhanced.ts:
* exportrequireMcpArg for testability
* validateArticleQuality: use /<span[^>]*data-translate="true"[^>]*>/gi
instead of /data-translate="true"/g to count only span elements
* writeArticleWithTranslation + writeSingleArticle: push quality score
AFTER writeArticle() returns, only when write succeeded
* extractTopicFromDocs: remove unreachable `if (!primary) return ''`
* translateSwedishContent: add upstream escapeHtml invariant to JSDoc
- tests/data-transformers.test.ts: update #462 dedup test fixtures —
true-duplicate test uses identical dok_id/title/party/author; added
a second test verifying same-title motions from different parties are
preserved (count ≥ 2).
Co-authored-by: pethers <[email protected]>
Applied all six reviewer suggestions in commit
All 1643 tests pass, 0 CodeQL alerts. |
… clone needed) String.prototype.replace resets lastIndex on a global regex before each call, so creating `new RegExp(source, flags)` on every invocation was unnecessary overhead. Use the named constant directly. Co-authored-by: pethers <[email protected]>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
scripts/generate-news-enhanced.ts
Outdated
| await writeArticle(translatedHtml, filename); | ||
| stats.qualityScores.push(qualityScore); |
There was a problem hiding this comment.
The quality score is computed after translation but pushed to stats array after the file write. If writeArticle() throws an exception, the quality score won't be recorded. Consider moving stats.qualityScores.push(qualityScore) immediately after validation to ensure consistent statistics tracking regardless of write success.
| await writeArticle(translatedHtml, filename); | |
| stats.qualityScores.push(qualityScore); | |
| stats.qualityScores.push(qualityScore); | |
| await writeArticle(translatedHtml, filename); |
There was a problem hiding this comment.
Fixed in e5dab26 — stats.qualityScores.push(qualityScore) now runs immediately after validateArticleQuality() and before await writeArticle(...), so quality statistics are recorded even if the write throws.
|
@copilot apply changes based on the comments in this thread |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
Move stats.qualityScores.push(qualityScore) to immediately after validateArticleQuality() and before await writeArticle(), so quality statistics are recorded even if the file write throws. Co-authored-by: pethers <[email protected]>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
…anslated data-translate spans Co-authored-by: pethers <[email protected]>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
Signed-off-by: James Pether Sörling <[email protected]>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
131 article files contained raw
<span data-translate="true" lang="sv">spans rendering verbatim Swedish in German, French, Japanese, Arabic, and other language versions. This adds a static dictionary-based post-processing layer that resolves those spans at article-write time.scripts/translation-dictionary.ts(new)translateTerm(exact match),translatePhrase(longest-prefix match),DICTIONARIESexportedTRANSLATABLE_SV_SPAN_REGEX— named constant with JSDoc covering both attribute orderings and capture-group semanticstranslateSwedishContent(html, targetLang: Language)— single-pass span replacement preserving<span lang="sv">wrapper (removes only thedata-translatemarker):scripts/generate-news-enhanced.tstranslateSwedishContentcalled insidewriteSingleArticlebeforevalidateArticleQuality— quality score reflects translated contentstats.qualityScoresimmediately after validation (before write), so statistics are captured even ifwriteArticlethrowsrequireMcpArgexported for testability; duplicate export and missing comma removed from export listtranslateSwedishContentre-exported for downstream consumerstests/translation-dictionary.test.ts(new)svpassthrough verifyinglang="sv"is preserved and no over-translation occursOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.