Conversation
- Add extractPolicyDomains() (private) to data-transformers.ts - Add generateContentTitle() (exported) to data-transformers.ts - Wire generateContentTitle into getTitles() in motions.ts, propositions.ts, and committee-reports.ts with static fallback - Add 35 unit tests in tests/data-transformers.test.ts covering all 14 languages, all 3 article types, fallback, and frequency ranking Co-authored-by: pethers <[email protected]>
Copilot
AI
changed the title
[WIP] Update article titles to be content-aware and dynamic
feat: generate content-based article titles from top policy domains
Feb 23, 2026
…ouping - Fix #454: Skip 'Unknown' sentinel in generateMotionsContent and generateEnhancedSummary; fall back to parseMotionAuthorParty on motion.titel/rubrik before displaying 'Unknown'. Also treat 'Unknown' parti as ungrouped in the party-breakdown section. - Fix #458: Add svSpan(text, lang) helper — emits lang="sv" only for non-Swedish articles (no data-translate="true" marker), preventing validation failures in the 131 non-Swedish articles. Swedish articles still carry the data-translate marker. Replace all 14 call sites. Add undertitel field to RawDocument interface. - Fix #462: Add extractPropRef() + groupMotionsByProposition() helpers. generateMotionsContent now groups motions by parent proposition under 'Responses to Government Propositions' section header, eliminates repetitive duplication of the same prop title. - Update 6 existing tests and add 10 new tests (1630 total, all green) Co-authored-by: pethers <[email protected]>
Copilot
AI
changed the title
feat: generate content-based article titles from top policy domains
fix: Unknown author sentinel, data-translate span leakage, motion proposition grouping
Feb 23, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request fixes three content-quality bugs in the article generation pipeline that were causing poor quality output across all 14 language versions:
Changes:
- Fixed #454: Unknown author sentinel bypass where
'Unknown'string sentinels fromenrichDocumentsWithContentwere treated as valid data, preventing fallback author/party parsing - Fixed #458: Eliminated
data-translate="true"spans from non-Swedish articles by introducingsvSpan()helper that conditionally adds the marker only for Swedish articles while preservinglang="sv"accessibility attribute for all languages - Fixed #462: Grouped motions by parent proposition reference to eliminate repetitive section headers where multiple motions responding to the same proposition each rendered separate h3 headers
- Added #456: Implemented
generateContentTitle()for content-based article titles derived from policy domain keywords, replacing generic static titles with dynamic summaries of actual document themes
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/data-transformers.ts | Core implementation: svSpan() helper (14 call sites updated), Unknown sentinel checks in generateMotionsContent/generateEnhancedSummary, extractPropRef()/groupMotionsByProposition() functions, generateContentTitle() with policy domain extraction and 14-language translation tables |
| scripts/news-types/motions.ts | Integrated generateContentTitle() into getTitles() with fallback to static titles |
| scripts/news-types/propositions.ts | Integrated generateContentTitle() into getTitles() with fallback to static titles |
| scripts/news-types/committee-reports.ts | Integrated generateContentTitle() into getTitles() with fallback to static titles |
| tests/data-transformers.test.ts | Updated 11 existing tests to expect no data-translate in non-SV articles, added 3 new test suites: generateContentTitle (78 tests), Unknown sentinel fix (2 tests), motion grouping (3 tests) |
Comments suppressed due to low confidence (3)
scripts/data-transformers.ts:2844
- In extractPolicyDomains(), the function scans document text for Swedish keywords to determine policy domains. However, it uses simple string inclusion checks (
text.includes(kw)) without word boundaries. This means keywords like "energi" will match "energiminister", "förnybar" will match "förnybartdirektivet", and so on. This is acceptable for broad domain detection, but could lead to false positives if a keyword appears as a substring in an unrelated term. Consider whether this fuzziness is intentional for catching variations, or if word-boundary matching would be more precise.
function extractPolicyDomains(documents: RawDocument[]): string[] {
const counts: Record<string, number> = {};
for (const doc of documents) {
const text = `${doc.titel ?? ''} ${doc.rubrik ?? ''} ${doc.summary ?? ''} ${doc.notis ?? ''}`.toLowerCase();
for (const [domain, keywords] of Object.entries(POLICY_DOMAIN_KEYWORDS)) {
if (keywords.some(kw => text.includes(kw))) {
scripts/data-transformers.ts:2881
- In generateContentTitle(), when fewer than 2 domains are detected, the function returns null and callers fall back to static titles. However, there's no fallback when exactly 2 domains are found but the translation lookup fails (lines 2877-2881). If either DOMAIN_TRANSLATIONS[domains[0]] or DOMAIN_TRANSLATIONS[domains[1]] is undefined, or if the language key doesn't exist in those translations, d1 or d2 will be undefined, causing the null check at line 2881 to return null. While this is safe, it means translation table completeness is critical. Consider adding a comment warning that all domains in POLICY_DOMAIN_KEYWORDS must have complete translations in DOMAIN_TRANSLATIONS for all 14 languages.
const domains = extractPolicyDomains(documents);
if (domains.length < 2) return null;
const langKey = lang as string;
const trans0 = DOMAIN_TRANSLATIONS[domains[0]!];
const trans1 = DOMAIN_TRANSLATIONS[domains[1]!];
const d1 = trans0?.[langKey] ?? trans0?.['en'];
const d2 = trans1?.[langKey] ?? trans1?.['en'];
if (!d1 || !d2) return null;
scripts/data-transformers.ts:2003
- In groupMotionsByProposition(), when a motion references a proposition via extractPropRef() but the ref is truthy (line 1998), it's added to the grouped map. However, if extractPropRef() returns an empty string (which is falsy in JavaScript), the motion will be treated as independent. The regex pattern
/med anledning av prop\.\s+(\S+)/ishould always capture at least one non-whitespace character if it matches, but consider explicitly checking for empty strings if defensive programming is desired, or document that extractPropRef() returns either a non-empty string or null.
for (const motion of motions) {
const title = motion.titel || motion.title || '';
const ref = extractPropRef(title);
if (ref) {
if (!grouped.has(ref)) grouped.set(ref, []);
grouped.get(ref)!.push(motion);
} else {
independent.push(motion);
}
Signed-off-by: James Pether Sörling <[email protected]>
pethers
approved these changes
Feb 23, 2026
Contributor
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
Contributor
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
Copilot AI
added a commit
that referenced
this pull request
Feb 23, 2026
- Take origin/main as base for data-transformers.ts, types/content.ts and tests/data-transformers.test.ts (which had landed PRs #453, #457, etc.) - Apply our PR changes on top: * extractPropRef: tighten regex to \d{4}\/\d{2}:\d+ (strict ID format) * groupMotionsByProposition: add export keyword; keep {grouped,independent} return type from main; replace inline lang-switches with L() labels * generateMotionsContent: use L(lang,'responsesToProp') and L(lang,'independentMotions') instead of 14-branch ternary chains * groupPropositionsByCommittee: new exported helper (Map<committeeKey,[...]>) * generatePropositionsContent: committee-grouped rendering (h3 per committee, h4 per prop when multi-committee; h3 per prop + Referred-to line otherwise) reuses byCommitteeGroup.size for policyImplicationsContext domainCount * ContentLabelSet: add responsesToProp and independentMotions fields * CONTENT_LABELS: add both new fields to all 14 language objects - Tests: add unit tests for exported helpers ({grouped,independent} API), label presence tests (28 tests for 2 keys × 14 langs), proposition grouping rendering tests; all 1674 tests passing across 39 test files Co-authored-by: pethers <[email protected]>
Copilot AI
added a commit
that referenced
this pull request
Feb 23, 2026
… take-main + apply) Take origin/main as clean base for data-transformers.ts, types/content.ts, tests/data-transformers.test.ts and sitemap.xml (main had landed PRs #453, #457, #469 etc.), then re-apply every PR-specific change on top: data-transformers.ts: - svSpan(): fix data-translate direction per translation workflow (lang!='sv' → data-translate="true"; lang='sv' → no marker) - PROP_REFERENCE_REGEX: strict \d{4}\/\d{2}:\d+ (was \S+) - PROP_FULL_REF_REGEX: non-greedy [^<]+? with HTML-tag lookahead - export groupMotionsByProposition (was private) - generateMotionsContent: replace 14-branch ternaries with L(lang,'responsesToProp') / L(lang,'independentMotions') - use PROP_FULL_REF_REGEX + svSpan() for proposition group heading - export groupPropositionsByCommittee (new) - generatePropositionsContent: committee-grouped rendering (h3+h4 when multi-committee; h3+referredLine when single-committee) - Add responsesToProp + independentMotions to all 14 language objects scripts/types/content.ts: - Add responsesToProp: string; independentMotions: string to ContentLabelSet tests/data-transformers.test.ts: - Import groupMotionsByProposition and groupPropositionsByCommittee - Fix 8 existing svSpan tests: flip data-translate expectations to match corrected svSpan() semantics - Add 5 unit tests for groupMotionsByProposition (strict-ID regex, fallback to title field, empty input, total-count invariant) - Add 5 unit tests for groupPropositionsByCommittee (organ/committee fallback, empty-string key, empty input, total-count invariant) - Add 28 label-presence tests (2 keys × 14 langs) - Add 4 rendering tests (single-committee h3, multi-committee h4, referredLine shown/hidden) All 1672 tests pass across 39 test files. Co-authored-by: pethers <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related content-quality bugs in
scripts/data-transformers.tscausing "Filed by: Unknown (Unknown)", false validation failures on 131 non-Swedish articles, and repetitive motion section headers.#454 — Unknown author/party sentinel bypass
enrichDocumentsWithContentsetsintressent_namn='Unknown'/parti='Unknown'as sentinels on enrichment failure. The oldif (!authorName)guard was skipped because'Unknown'is truthy, soparseMotionAuthorPartywas never invoked.generateMotionsContent: guard now checks!authorName || authorName === 'Unknown'; sentinel values stripped before the fallback parse runs;motion.titel/rubrikadded as raw-text sourcesgenerateEnhancedSummary: same sentinel guard for the motion author/party metadata line'Unknown'parti now buckets into'other'in the party-breakdown section#458 —
data-translate="true"leaking into non-Swedish articlesAll Swedish API content was unconditionally wrapped in
<span data-translate="true" lang="sv">. No client-side translator exists, so this marker appears 20+ times per article in non-Swedish files and fails validation.Added
svSpan(escapedText, lang)helper:Replaced all 14 call sites. Swedish articles retain the marker; non-Swedish articles get
lang="sv"only (screen-reader accessibility preserved).#462 — Motions referencing the same proposition rendered under separate headers
Multiple motions responding to the same government proposition each rendered their own section, duplicating the proposition title repeatedly.
Added
extractPropRef(title)(extractsYYYY/YY:NNNfrom"med anledning av prop. …"prefixes) andgroupMotionsByProposition(motions)which partitions into aMap<propRef, motions[]>and anindependent[]list.generateMotionsContentnow renders:<h3>Prop. 2025/26:118: [title]</h3>per proposition grouping all related motions beneath itSection headers are translated for all 14 languages.
Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.