Skip to content

fix(news): 225 articles report wordCount: 0 in news-articles.json metadata β€” broken word count pipelineΒ #490

@pethers

Description

@pethers

πŸ“‹ Issue Type

Bug Fix / SEO Quality

🎯 Objective

Fix the wordCount: 0 metadata bug in data/news-articles.json where 225 out of 668 articles (34%) report zero word count. This affects SEO structured data (Schema.org wordCount), quality metrics reporting, and article quality scoring.

πŸ“Š Current State β€” Evidence

Measured: 225 articles in data/news-articles.json have "wordCount": 0

{
  "slug": "2026-02-24-legislative-push",
  "file": "2026-02-24-legislative-push-en.html",
  "lang": "en",
  "headline": "Sweden Bolsters Civilian Defence...",
  "wordCount": 0,   // ❌ Should be ~1500-2000
  ...
}

All 14 language variants of recent breaking news articles (#485) have wordCount: 0.

The actual articles have content β€” news/2026-02-24-committee-reports-en.html has ~2239 words, but data/news-articles.json reports 0.

Root cause: The word count is calculated correctly in scripts/generate-news-enhanced.ts (line ~505) during article quality scoring, but this value is not persisted to data/news-articles.json. The news index generator (scripts/generate-news-indexes.ts) creates the JSON entries but doesn't read back the word count from the HTML files.

For agentic workflow-generated articles (breaking news from realtime monitor), the article HTML is written directly by the agentic engine without going through generate-news-enhanced.ts at all, so word count is never calculated.

πŸš€ Desired State

  1. Calculate word count during news index generation (generate-news-indexes.ts) by:

    • Reading each HTML file
    • Stripping HTML tags
    • Counting whitespace-delimited tokens
    • Writing the result to data/news-articles.json
  2. Propagate word count to Schema.org structured data in the article HTML:

    { "@type": "NewsArticle", "wordCount": 1500 }
  3. Update quality metrics to use actual word counts for quality scoring.

πŸ”§ Implementation Approach

  1. Update scripts/generate-news-indexes.ts to read HTML files and calculate word count
  2. Populate wordCount field in data/news-articles.json entries
  3. Add word count calculation for agentic-workflow-generated articles
  4. Run npx vitest run and verify
  5. Regenerate data/news-articles.json with correct word counts

πŸ€– Recommended Agent

code-quality-engineer β€” fix metadata generation pipeline

βœ… Acceptance Criteria

  • data/news-articles.json has accurate wordCount for all articles (not 0)
  • Word count calculation strips HTML tags before counting
  • Agentic workflow articles also get word count populated
  • Schema.org wordCount in article HTML matches JSON metadata
  • Existing tests pass (npx vitest run)
  • Quality metrics reporting reflects actual word counts

πŸ“š References

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions