Skip to content

feat: restructure as framework, add feedback loop + ranker + GitHub importer#1

Merged
hallelx2 merged 2 commits into
mainfrom
feat/framework-restructure
Apr 17, 2026
Merged

feat: restructure as framework, add feedback loop + ranker + GitHub importer#1
hallelx2 merged 2 commits into
mainfrom
feat/framework-restructure

Conversation

@pathfindermilan

@pathfindermilan pathfindermilan commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Restructures Context8 from a flat module layout into a focused subpackage framework, and adds the four production capabilities that turn it from a hackathon demo into a credible submission.

What's new

  • Framework layoutembeddings/, search/, ingest/, benchmark/, mcp/, cli/commands/ (one responsibility per module)
  • GitHub Issues importercontext8 import-github vercel/next.js --label bug --max-issues 50
  • Agent feedback loopcontext8_rate MCP tool; worked_ratio feeds the ranker
  • Per-strategy attribution — every result shows which named vector / sparse path surfaced it, at what rank
  • Quality ranker — final = retrieval × confidence × recency_decay × worked_ratio (configurable floors)
  • CLIbench (Recall@K ablation across 5 configs), demo (4 scripted scenarios)

Bug fixes

  • StorageService.client now calls connect() — fixes 503 from the Actian sync wrapper
  • context8 doctor asserts hybrid / named / sparse / filter are actually live (no silent degradation)
  • docker-compose.yml uses fully-qualified docker.io/... image (Podman compatibility)

Test plan

  • 79 unit tests pass (was 29) — added test_ranking, test_attribution, test_github_importer, test_models_extended, test_benchmark
  • tests/test_e2e.py covers hybrid retrieval, filter isolation, feedback persistence, quality boost (live DB, auto-skips when unreachable)
  • ruff check src/ tests/ clean
  • context8 --help shows all 12 commands; context8 doctor reports green on a live DB
  • Run context8 bench against live DB and paste numbers into RESULTS.md

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GitHub issue ingestion with import-github command
    • Added benchmarking and evaluation with bench and demo commands
    • Added per-strategy attribution to search results
    • Added feedback rating system (context8_rate) with worked/applied counters
    • Added solution-based search capability
    • Added quality ranking based on recency, confidence, and user feedback
  • Documentation

    • Updated CLI documentation with new commands and capabilities
    • Added comprehensive submission results documentation
  • Chores

    • Reorganized CLI into modular command structure
    • Refactored codebase into logical packages for search, ingestion, and benchmarking

…orter

- Reorganize into subpackages: embeddings/, search/, ingest/, benchmark/, mcp/, cli/commands/
- Add `context8 import-github`, `context8_rate` MCP tool, per-strategy attribution, and confidence/recency/worked-ratio quality ranker
- Wire previously-unused solution named vector into search
- Add `bench` (Recall@K ablation) and `demo` CLI commands
- Fix StorageService not calling connect() on the Actian sync wrapper
- Harden `context8 doctor` to verify hybrid/named/sparse are actually live

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @pathfindermilan, your pull request is larger than the review limit of 150000 diff characters

@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae40c5cd-9b83-476c-9ebf-4b4a6683087f

📥 Commits

Reviewing files that changed from the base of the PR and between 46ed687 and 13a053c.

📒 Files selected for processing (49)
  • README.md
  • RESULTS.md
  • docker-compose.yml
  • pyproject.toml
  • src/context8/__init__.py
  • src/context8/__main__.py
  • src/context8/agents.py
  • src/context8/benchmark/__init__.py
  • src/context8/benchmark/ground_truth.py
  • src/context8/benchmark/runner.py
  • src/context8/cli.py
  • src/context8/cli/__init__.py
  • src/context8/cli/commands/__init__.py
  • src/context8/cli/commands/bench.py
  • src/context8/cli/commands/ingest.py
  • src/context8/cli/commands/integrations.py
  • src/context8/cli/commands/lifecycle.py
  • src/context8/cli/commands/ops.py
  • src/context8/cli/commands/serve.py
  • src/context8/cli/main.py
  • src/context8/cli/ui.py
  • src/context8/config.py
  • src/context8/embeddings/__init__.py
  • src/context8/embeddings/service.py
  • src/context8/embeddings/tokenizer.py
  • src/context8/feedback.py
  • src/context8/ingest/__init__.py
  • src/context8/ingest/github.py
  • src/context8/ingest/pipeline.py
  • src/context8/ingest/seed.py
  • src/context8/mcp/__init__.py
  • src/context8/mcp/server.py
  • src/context8/mcp/tools.py
  • src/context8/models.py
  • src/context8/search.py
  • src/context8/search/__init__.py
  • src/context8/search/analyzer.py
  • src/context8/search/attribution.py
  • src/context8/search/engine.py
  • src/context8/search/ranking.py
  • src/context8/storage.py
  • tests/test_agents.py
  • tests/test_attribution.py
  • tests/test_benchmark.py
  • tests/test_e2e.py
  • tests/test_embeddings.py
  • tests/test_github_importer.py
  • tests/test_models_extended.py
  • tests/test_ranking.py

📝 Walkthrough

Walkthrough

Context8 undergoes a comprehensive reorganization from flat structure to modular packages: CLI commands split into separate files under cli/commands/, search logic refactored into search/ submodules with new ranking and attribution tracking, and ingest operations consolidated into ingest/ with GitHub importer. Adds feedback rating loop, quality ranking with configurable boosts, per-strategy attribution in results, GitHub issue ingestion pipeline, and benchmark/evaluation artifacts. Expands test coverage significantly with e2e, benchmark, and component-specific test modules.

Changes

Cohort / File(s) Summary
Documentation & Configuration
README.md, RESULTS.md, docker-compose.yml
Added RESULTS.md for submission documentation, expanded README with feature descriptions and benchmark instructions, updated Docker image registry reference.
Version & Package Setup
src/context8/__init__.py, src/context8/__main__.py
Bumped version 0.1.1 → 0.2.0, added from __future__ import annotations, simplified relative imports.
Lint Configuration
pyproject.toml
Updated Ruff ignore rules: relocated E501 exemptions to reflect new seed/tools file locations.
CLI Core Reorganization
src/context8/cli.py (removed), src/context8/cli/__init__.py, src/context8/cli/main.py, src/context8/cli/ui.py
Refactored monolithic CLI into modular package: removed 542-line cli.py, created main.py entrypoint with Click group, added ui.py helpers for Docker/SDK/DB checks, reorganized command handling.
CLI Commands - Lifecycle
src/context8/cli/commands/__init__.py, src/context8/cli/commands/lifecycle.py
Created lifecycle command module: start, stop, init with Docker orchestration, database initialization, and optional seeding.
CLI Commands - Operations
src/context8/cli/commands/ops.py
Added health/status commands: stats (knowledge base metrics), doctor (comprehensive health checks including vector/filter support), search (interactive retrieval with attribution display).
CLI Commands - Integration & Ingestion
src/context8/cli/commands/integrations.py, src/context8/cli/commands/ingest.py
New MCP agent integration (add/remove commands) and GitHub issue ingestion (import-github) with label filtering, state control, and resolution handling.
CLI Commands - Benchmarking & Server
src/context8/cli/commands/bench.py, src/context8/cli/commands/serve.py
Added bench command with configuration comparison and recall/MRR reporting, demo command with scripted scenarios, and serve entrypoint for MCP stdio server.
Agent Management
src/context8/agents.py
Removed docstrings, refactored conditional logic to eliminate redundant else blocks, maintained JSON/agent integration functionality.
Config & Constants
src/context8/config.py
Added ranking-related constants (RECENCY_HALF_LIFE_DAYS, boost floors, min feedback samples), new continue_config_path(), updated MCP server path to context8.mcp.server.
Models - Feedback & Attribution
src/context8/models.py
Extended with FeedbackStats (applied/worked counters), updated ResolutionRecord with feedback field, added Attribution/StrategyContribution for ranking source tracking, expanded SearchResult with raw_score and boost_factors.
Embeddings Package Split
src/context8/embeddings/__init__.py, src/context8/embeddings/service.py, src/context8/embeddings/tokenizer.py
Extracted sparse tokenization logic into new BM25Tokenizer class, delegated embed_sparse() to tokenizer, created embeddings package API surface.
Feedback & Rating Loop
src/context8/feedback.py
New service for feedback persistence: FeedbackService.rate() increments applied/worked counters, updates record tags/timestamps, and re-embeds for storage.
Ingest Package Refactoring
src/context8/ingest/__init__.py, src/context8/ingest/seed.py, src/context8/ingest/pipeline.py, src/context8/ingest/github.py
Reorganized seeding with UUID-based slug-to-ID mapping, created IngestPipeline for batch processing with stats tracking, implemented GitHubIssueImporter for authenticated fetching/conversion with label/state/resolution filters.
Search Engine Refactoring
src/context8/search.py (removed), src/context8/search/__init__.py, src/context8/search/engine.py, src/context8/search/analyzer.py, src/context8/search/attribution.py, src/context8/search/ranking.py
Decomposed monolithic 295-line search.py into modular package: engine handles hybrid/dense/sparse fusion with quality boosting, analyzer determines strategy weights, attribution tracks per-strategy contributions, ranking applies confidence/recency/feedback boosts.
Storage Service Enhancements
src/context8/storage.py
Added update_record() for feedback persistence, enhanced get_collection_info() to discover named/sparse vectors dynamically with metadata flags, ensured client connection initialization.
MCP Server Restructuring
src/context8/mcp/__init__.py, src/context8/mcp/server.py, src/context8/mcp/tools.py
Moved server bootstrap to dedicated module, separated concerns: server.py handles stdio/MCP setup, tools.py provides tool list/dispatch, added feedback rating tool (context8_rate) and solution search tool (context8_search_solutions).
Benchmark & Evaluation
src/context8/benchmark/__init__.py, src/context8/benchmark/ground_truth.py, src/context8/benchmark/runner.py
New evaluation framework: ground truth dataset with 50+ problem/solution pairs, configurable benchmark runner testing named/sparse/filter/quality-boost combinations with recall@K/MRR/latency metrics.
Test Suite - Core Models
tests/test_models_extended.py
New comprehensive model tests: FeedbackStats serialization, Attribution selection logic, SearchResult defaults.
Test Suite - Search Components
tests/test_attribution.py, tests/test_ranking.py, tests/test_embeddings.py
New attribution tracker tests validating multi-strategy contribution tracking, ranking tests verifying confidence/recency/feedback boost application, refactored tokenizer tests to use public BM25Tokenizer API.
Test Suite - Ingestion
tests/test_github_importer.py
New GitHub importer tests: slug-to-ID determinism, error/code/language/framework detection, resolution inference, record construction.
Test Suite - Benchmark & E2E
tests/test_benchmark.py, tests/test_e2e.py
New benchmark validation tests (ground truth completeness, metric correctness across scenarios), comprehensive e2e integration tests covering collection shape, hybrid/sparse/filtered/named-vector search, feedback loop, attribution, quality boosting.
Test Suite - Configuration
tests/test_agents.py
Updated MCP server path assertion from context8.server to context8.mcp.server.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as CLI (import-github)
    participant GitHub as GitHub API
    participant Importer as GitHubIssueImporter
    participant Embeddings as EmbeddingService
    participant Pipeline as IngestPipeline
    participant Storage as StorageService
    
    User->>CLI: context8 import-github owner/repo --label bug
    CLI->>CLI: check_db_connection()
    CLI->>Importer: fetch(repo, labels, max_issues, state)
    Importer->>GitHub: GET /repos/owner/repo/issues?labels=bug
    GitHub-->>Importer: issues + comments
    Importer-->>CLI: FetchResult
    CLI->>Importer: to_records(FetchResult, require_resolution)
    Importer-->>CLI: list[ResolutionRecord]
    CLI->>Pipeline: ingest(records, skip_existing=True)
    Pipeline->>Storage: get_record(id) [check duplicates]
    Storage-->>Pipeline: existing or None
    Pipeline->>Embeddings: embed_record(problem, solution, code)
    Embeddings-->>Pipeline: vectors dict
    Pipeline->>Storage: store_record(record, vectors)
    Storage-->>Pipeline: record_id
    Pipeline-->>CLI: IngestStats(attempted=N, stored=M, duplicates=X)
    CLI->>User: Display summary table
Loading
sequenceDiagram
    participant User
    participant SearchEngine as SearchEngine
    participant Embeddings as EmbeddingService
    participant Attribution as AttributionTracker
    participant Ranking as QualityRanker
    participant CLI as CLI (search)
    
    User->>CLI: context8 search "error message"
    CLI->>SearchEngine: search(query, language=..., limit=5)
    SearchEngine->>Embeddings: embed_query("error message")
    Embeddings-->>SearchEngine: dense vector
    SearchEngine->>SearchEngine: _search_named("problem", vector, filter, 5)
    SearchEngine-->>SearchEngine: problem_results[]
    SearchEngine->>SearchEngine: _search_sparse(sparse_vector, filter, 5)
    SearchEngine-->>SearchEngine: sparse_results[]
    SearchEngine->>SearchEngine: RRF fusion(problem_results, sparse_results)
    SearchEngine-->>SearchEngine: fused_results[]
    SearchEngine->>Attribution: record(strategy="dense", results)
    SearchEngine->>Attribution: record(strategy="sparse", results)
    SearchEngine->>Ranking: boost(fused_results)
    Ranking->>Ranking: Apply confidence/recency/feedback factors
    Ranking-->>SearchEngine: boosted_results[]
    SearchEngine->>Attribution: build_for(record_id) for each result
    SearchEngine-->>CLI: list[SearchResult with attribution]
    CLI->>User: Display results with source tracking
Loading
sequenceDiagram
    participant User
    participant CLI as CLI (rate tool)
    participant MCP as MCP Handler
    participant FeedbackService as FeedbackService
    participant Storage as StorageService
    participant Embeddings as EmbeddingService
    
    User->>MCP: Call context8_rate(record_id, worked=true)
    MCP->>FeedbackService: rate(record_id, worked=true)
    FeedbackService->>Storage: get_record(record_id)
    Storage-->>FeedbackService: ResolutionRecord
    FeedbackService->>FeedbackService: Increment applied_count
    FeedbackService->>FeedbackService: Increment worked_count (if worked=true)
    FeedbackService->>FeedbackService: Update last_seen timestamp
    FeedbackService->>Embeddings: embed_record(updated record)
    Embeddings-->>FeedbackService: vectors dict
    FeedbackService->>Storage: update_record(record, vectors)
    Storage-->>FeedbackService: success
    FeedbackService-->>MCP: FeedbackOutcome(accepted=true, worked_ratio=0.5)
    MCP-->>User: "Feedback recorded: 1/2 worked"
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Poem

A rabbit hops through refactored ground,
Where modular packages abound!
🐰 New feedback loops keep records keen,
Rankings bloom with signals unseen,
GitHub seeds the knowledge base bright,
Attribution shines—a guiding light!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/framework-restructure

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pathfindermilan

Copy link
Copy Markdown
Collaborator Author

docs: add ROADMAP — DO NOT MERGE; pluggable ingestion sources next

Flags branch as not-ready-to-merge and lists the highest-leverage
follow-ups, chiefly: refactor ingest/ into a pluggable sources/ registry
so users can context8 import <source> from agent transcripts, git logs,
shell history, Stack Overflow, URLs, JSONL, and markdown — not just GitHub.

@hallelx2 hallelx2 marked this pull request as ready for review April 17, 2026 21:50
Copilot AI review requested due to automatic review settings April 17, 2026 21:50

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, your pull request is larger than the review limit of 150000 diff characters

@hallelx2 hallelx2 merged commit 79519cf into main Apr 17, 2026
6 of 7 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures Context8 into a capability-oriented framework package layout and adds production features around ingestion, ranking, attribution, benchmarking, and MCP/CLI operations.

Changes:

  • Replaces the previous flat module layout (search.py, cli.py, etc.) with subpackages (search/, ingest/, benchmark/, mcp/, cli/commands/, embeddings/).
  • Adds GitHub Issues ingestion, an agent feedback loop, per-strategy attribution, and a quality re-ranker.
  • Adds a benchmark harness + expanded unit/e2e tests and updates docs/artifacts (RESULTS.md, README.md).

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_ranking.py Adds unit tests for quality ranker factors and boosting behavior
tests/test_models_extended.py Adds unit tests for feedback + attribution model behavior
tests/test_github_importer.py Adds unit tests for GitHub importer parsing/detection helpers
tests/test_embeddings.py Updates tests to target BM25Tokenizer + QueryAnalyzer in new layout
tests/test_e2e.py Adds live-DB end-to-end coverage for hybrid/filter/feedback/boosting
tests/test_benchmark.py Adds unit tests for benchmark math + ground-truth integrity
tests/test_attribution.py Adds unit tests for attribution tracking logic
tests/test_agents.py Updates MCP entrypoint assertion after module move
src/context8/storage.py Connect-on-demand client + collection introspection + update_record
src/context8/search/ranking.py Implements confidence/recency/feedback-based score boosting
src/context8/search/engine.py New hybrid search engine with attribution + quality boost hooks
src/context8/search/attribution.py Tracks per-strategy rank/score contributions for results
src/context8/search/analyzer.py QueryAnalyzer extracted to its own module
src/context8/search/init.py Exposes new search package surface
src/context8/search.py Removes legacy monolithic search module
src/context8/models.py Adds feedback stats + attribution + raw_score/boost_factors
src/context8/mcp/tools.py Adds MCP tools for rating + solution-approach search + formatting
src/context8/mcp/server.py New MCP server entrypoint wrapping tools module
src/context8/mcp/init.py Exposes MCP app/run_server API
src/context8/ingest/seed.py Adds deterministic seed slugs + routes seeding via ingest pipeline
src/context8/ingest/pipeline.py Adds generic ingest pipeline + ingest stats
src/context8/ingest/github.py Adds GitHub issue importer and extraction heuristics
src/context8/ingest/init.py Exposes ingest package API (seed/pipeline/importer)
src/context8/feedback.py Adds FeedbackService to persist agent success/failure ratings
src/context8/embeddings/tokenizer.py Extracts BM25 tokenizer used for sparse vectors
src/context8/embeddings/service.py Refactors embeddings to use BM25Tokenizer and new package layout
src/context8/embeddings/init.py Exposes embeddings package surface
src/context8/config.py Adds ranker tuning constants + updates MCP server command path
src/context8/cli/ui.py Adds shared CLI helpers (docker compose selection, DB checks)
src/context8/cli/main.py New Click CLI group wiring commands from cli/commands/
src/context8/cli/commands/serve.py Adds context8 serve command to run MCP server
src/context8/cli/commands/ops.py Adds/updates stats/doctor/search CLI operations
src/context8/cli/commands/lifecycle.py Adds start/stop/init commands for DB lifecycle + seeding
src/context8/cli/commands/integrations.py Adds agent integration commands (add/remove) with aliases
src/context8/cli/commands/ingest.py Adds import-github CLI command to ingest GitHub issues
src/context8/cli/commands/bench.py Adds bench and demo CLI commands
src/context8/cli/commands/init.py Exports CLI commands for main registration
src/context8/cli/init.py Exposes CLI main entrypoint
src/context8/cli.py Removes legacy monolithic CLI module
src/context8/benchmark/runner.py Adds benchmark runner with ablation configurations
src/context8/benchmark/ground_truth.py Adds ground-truth query set for benchmark evaluation
src/context8/benchmark/init.py Exposes benchmark package surface
src/context8/agents.py Updates agent config writer to new MCP module command
src/context8/main.py Updates module entry to run new CLI main
src/context8/init.py Bumps version to 0.2.0
pyproject.toml Updates ruff per-file ignores to new file paths
docker-compose.yml Switches to fully-qualified docker.io image reference
RESULTS.md Adds submission results template + reproduction steps
README.md Updates docs for new capabilities, commands, and layout

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/context8/storage.py
Comment on lines +194 to +203
named_vectors = self._discover_named_vectors(info)
sparse_vectors = self._discover_sparse_vectors(info)
return {
"status": str(getattr(info, "status", "unknown")),
"points": getattr(info, "points_count", 0),
"vectors": ["problem", "solution", "code_context"],
"vectors": named_vectors or ["problem", "solution", "code_context"],
"named_vector_count": len(named_vectors),
"sparse_vectors": sparse_vectors,
"sparse_supported": bool(sparse_vectors),
"hybrid_enabled": len(named_vectors) >= 2 and bool(sparse_vectors),

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_collection_info() returns a fallback list of vector names when discovery fails, but named_vector_count is still len(named_vectors) (0). Downstream checks like context8 doctor treat named_vector_count < 3 as a hard failure even though vectors reports the fallback 3. Consider making named_vector_count consistent with the vectors you return (or add a separate flag like named_vectors_discovered).

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +38
def __init__(
self,
storage: StorageService,
embeddings: EmbeddingService,
ranker: QualityRanker | None = None,
dense_weight: float = DEFAULT_DENSE_WEIGHT,
code_weight: float = DEFAULT_CODE_WEIGHT,
sparse_weight: float = DEFAULT_SPARSE_WEIGHT,
):
self.storage = storage
self.embeddings = embeddings
self.ranker = ranker or QualityRanker()
self.dense_weight = dense_weight
self.code_weight = code_weight
self.sparse_weight = sparse_weight

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchEngine accepts dense_weight/code_weight/sparse_weight and stores them, but they are not used when building fusion_weights (QueryAnalyzer weights are used directly). This makes the constructor parameters misleading and prevents config-level tuning. Either incorporate these weights into fusion_weights (e.g., multiply or use as defaults when QueryAnalyzer doesn't apply) or remove the parameters/fields.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +49
table.add_row("Status", "[green]HEALTHY[/]")

if collection_info:
table.add_row("Vector spaces", ", ".join(collection_info.get("vectors", [])))
table.add_row("Status", collection_info.get("status", "unknown"))

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats command adds a hard-coded "Status: HEALTHY" row and then (when collection_info is present) adds another "Status" row from the collection metadata. This produces duplicate/conflicting metrics in the output; consider renaming one (e.g., "DB health" vs "Collection status") or removing the hard-coded row.

Suggested change
table.add_row("Status", "[green]HEALTHY[/]")
if collection_info:
table.add_row("Vector spaces", ", ".join(collection_info.get("vectors", [])))
table.add_row("Status", collection_info.get("status", "unknown"))
table.add_row("DB health", "[green]HEALTHY[/]")
if collection_info:
table.add_row("Vector spaces", ", ".join(collection_info.get("vectors", [])))
table.add_row("Collection status", collection_info.get("status", "unknown"))

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
for token, freq in sorted(term_freqs.items()):
idx = abs(hash(token)) % self.vocab_size
weight = freq / (freq + 1.0)

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BM25Tokenizer.encode() derives sparse indices via Python's built-in hash(), which is salted per process (PYTHONHASHSEED). That makes stored sparse vectors and query-time sparse vectors inconsistent across restarts/processes, effectively breaking sparse retrieval and hybrid fusion.

Copilot uses AI. Check for mistakes.
Comment thread src/context8/storage.py
Comment on lines 102 to 110
@property
def sparse_supported(self) -> bool:
"""Check if the collection supports sparse vectors."""
if self._sparse_supported is None:
try:
self.client.collections.get_info(COLLECTION_NAME)
self._sparse_supported = False # Safe default
self._sparse_supported = False
except Exception:
self._sparse_supported = False
return self._sparse_supported

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageService.sparse_supported always resolves to False when _sparse_supported is None (even if the existing collection supports sparse vectors). This will disable sparse search paths on fresh processes that connect to a pre-existing hybrid collection (e.g., when initialize() returns False because the collection already exists). Consider introspecting collection info and setting _sparse_supported based on discovered sparse vector config instead of hard-coding False.

Copilot uses AI. Check for mistakes.
@pathfindermilan pathfindermilan deleted the feat/framework-restructure branch April 25, 2026 14:42
hallelx2 added a commit that referenced this pull request May 2, 2026
README rewritten around the new SQLite-first install (single command:
pip install context8 && context8 init --seed && context8 add claude-code).
The Actian path is preserved as an "Optional: Actian VectorAI DB
backend" section with the hackathon-era install. The "Hackathon:
Advanced Features Used" section becomes "Capabilities (and how each
backend delivers them)" — the same three capabilities (named vectors,
hybrid fusion, filtered search) framed as backend-portable.

Architecture diagram redrawn: pluggable Protocol fanning out to two
concrete backends (SQLite vec0+FTS5 below, Actian gRPC container right).
Tech-stack table promotes sqlite-vec and FTS5 to primary, demotes
Actian to optional. Project-structure tree updated to reflect the new
storage/ package and search/fusion.py. v0.5.0 changelog entry added.

CLAUDE.md fully rewritten — new project overview, structure, key
design decisions (#1 is now "pluggable storage backend"), commands,
plus distinct SQLite Backend Notes / Actian Backend Notes sections.

RESULTS.md kept as the hackathon submission narrative but reframed:
the Actian-feature table is rewritten as a backend-portability table
(SQLite delivers each capability via vec0/FTS5/SQL+JSON1; Actian
delivers them via named vectors / sparse vectors / FilterBuilder).
Benchmark section now has placeholders for both backends so the
ablation can be run side-by-side.

All twelve docs/*.md design docs (CONCEPT, ARCHITECTURE, BOTTLENECKS,
PLAN-01..08, Hackathon Demo Video — Script) prepended with a
historical-artifact banner pointing readers to README.md / CLAUDE.md.
The hackathon design narrative is preserved in place; it just stops
being the canonical source for the current architecture.

tests/test_e2e.py:
- pytestmark adds pytest.mark.actian + a new skipif on
  CONTEXT8_BACKEND != "actian", so the Actian e2e suite skips
  cleanly under the default SQLite install (15 tests skip).
- Fixed pre-existing FeedbackService(storage, embeddings) arity
  mismatch on lines 281 and 300 — production constructor takes
  storage only (mcp/tools.py:44 confirms). Drop the second arg.
- isolated_collection fixture now patches actian_backend.COLLECTION_NAME
  (the captured-at-import-time copy) rather than the package module
  attribute — the module's attribute is no longer load-time-bound to
  COLLECTION_NAME after the storage package split.

Verification: 127 passed, 15 actian e2e skipped, ruff clean,
context8 init/doctor/stats/search/bench/export/import all green
under SQLite.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants