Skip to content

AndreaBozzo/Ceres

Ceres Logo

Ceres

Semantic search engine for open data portals

crates.io CI License Discord HuggingFace Dataset

Quick StartFeaturesUsageRoadmap


Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Named after the Roman goddess of harvest and agriculture.

Why Ceres?

Open Data Galaxy — ML-generated visualization
354,000+ datasets (dedup. to 270k) from 22 portals, embedded with all-MiniLM-L6-v2, projected to 3D via UMAP, and clustered with HDBSCAN. Each color is a portal — nearby points are semantically similar.

Open data portals are everywhere, but finding the right dataset is still painful:

  • Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
  • Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
  • No cross-portal search: You can't query Milano and Roma datasets together

Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.

$ ceres search "trasporto pubblico" --limit 3

Found 3 matching datasets:

1. [████████░░] [78%] TPL - Percorsi linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie

2. [████████░░] [76%] TPL - Fermate linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie

3. [███████░░░] [72%] Mobilità: flussi veicolari rilevati dai spire
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari

Features

  • CKAN Harvester — Fetch datasets from any CKAN-compatible portal, including multilingual portals
  • Multi-portal Batch Harvest — Configure multiple portals in portals.toml and harvest them all at once
  • Streaming Harvest — Memory-efficient streaming pipeline for large portals (100k+ datasets)
  • Delta Detection — Only regenerate embeddings for changed datasets (99.8% API cost savings). See Harvesting Architecture
  • Batch Embeddings — Batched embedding API calls for higher throughput during harvest
  • Persistent Jobs — Recoverable database-backed job queue with automatic retries and exponential backoff
  • Graceful Shutdown — Safely interrupt harvesting to ensure data consistency and release in-progress jobs back to the queue
  • Real-time Progress — Live progress reporting during harvest with batch timestamp updates
  • Dry Run Mode — Preview what a harvest would do without writing to DB or calling embedding APIs
  • Semantic Search — Find datasets by meaning using vector embeddings
  • Pluggable Embeddings — Switchable embedding backend via trait (Gemini, OpenAI)
  • Bearer Token Auth — Protected admin endpoints with configurable API key authentication
  • Docker Support — Production-ready multi-stage Docker image and Docker Compose setup
  • Multi-format Export — Export to JSON, JSON Lines, CSV, or Parquet
  • Custom URL Templates — Support portals with non-standard frontend URLs

Pre-configured Portals

Ceres comes with 25 verified CKAN portals ready to use, covering 354,000+ datasets:

Portal Region Datasets
Australia Australia ~109,440
Italy (National) Italy ~70,141
Ukraine Ukraine ~39,790
HDX (Humanitarian) Global ~26,654
NRW Germany ~22,849
Ireland Ireland ~21,855
Switzerland Switzerland ~14,559
Toscana Italy ~12,886
Tokyo Japan ~9,707
Marche Italy ~5,440
Romania Romania ~5,038
Chile Chile ~2,897
Aragón Spain ~2,881
Emilia-Romagna Italy ~2,871
Milano Italy ~2,580
Puglia Italy ~1,801
Trentino Italy ~1,388
Umbria Italy ~457
Lazio Italy ~407
Roma Italy ~365
Campania Italy ~332
Sicilia Italy ~186
Genova Italy ~171
Liguria Italy ~124
Napoli Italy ~33

See examples/portals.toml for the full configuration. Want to add more? Check issue #19.

Tech Stack

Component Technology
Language Rust (async with Tokio)
Database PostgreSQL 16+ with pgvector
Embeddings Pluggable: Google Gemini, OpenAI (trait-based)
Portal Protocol CKAN API v3
REST API Axum with OpenAPI/Swagger UI

Quick Start

Prerequisites

  • Rust 1.87+
  • Docker & Docker Compose
  • Google Gemini API key (get one free) or OpenAI API key

Docker (recommended)

# Clone and configure
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cp .env.example .env
# Edit .env with your API key and settings

# Start everything (DB + server)
docker compose up -d

From source

# Install from crates.io
cargo install ceres-search

# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release

Setup (from source)

# Start PostgreSQL with pgvector
docker compose up db -d

# Configure environment
cp .env.example .env
# Edit .env with your API key

# Run database migrations
make migrate

Tip: Run make help to see all available Makefile shortcuts.

Usage

Harvest datasets from a CKAN portal

ceres harvest https://dati.comune.milano.it

Tip: Running a harvest command for the first time without a config generates a pre-configured portals.toml automatically.

Search indexed datasets

ceres search "trasporto pubblico" --limit 10

Export datasets

# JSON Lines (default)
ceres export > datasets.jsonl

# JSON array
ceres export --format json > datasets.json

# CSV
ceres export --format csv > datasets.csv

# Filter by portal
ceres export --portal https://dati.comune.milano.it

View statistics

ceres stats

CLI Reference

ceres <COMMAND>

Commands:
  harvest  Harvest datasets from a CKAN portal or batch harvest from portals.toml
  search   Search indexed datasets using semantic similarity
  export   Export indexed datasets to various formats
  stats    Show database statistics
  help     Print help information

Harvest Flags:
  --full-sync       Force full sync even if incremental is available
  --dry-run         Preview what would be harvested without writing to DB
  --portal <NAME>   Harvest a specific portal by name from config
  --config <PATH>   Use custom portals.toml location

Environment Variables:
  DATABASE_URL          PostgreSQL connection string
  EMBEDDING_PROVIDER    Embedding backend: gemini or openai (default: gemini)
  GEMINI_API_KEY        Google Gemini API key (when using gemini provider)
  OPENAI_API_KEY        OpenAI API key (when using openai provider)

REST API

Start the server:

ceres-server

Available endpoints:

  • GET /api/v1/health — Health check
  • GET /api/v1/stats — Database statistics
  • GET /api/v1/search — Semantic search
  • GET /api/v1/portals — List configured portals
  • GET /api/v1/portals/:name/stats — Portal-specific statistics
  • POST /api/v1/portals/:name/harvest — Trigger harvest for a portal
  • POST /api/v1/harvest — Trigger harvest for all portals
  • GET /api/v1/harvest/status — Check harvest job status
  • GET /api/v1/export — Export datasets
  • GET /api/v1/datasets/:id — Get dataset by ID
  • GET /swagger-ui — Interactive API docs

Server environment variables:

PORT                   Server port (default: 3000)
HOST                   Server host (default: 0.0.0.0)
EMBEDDING_PROVIDER     Embedding backend: gemini or openai (default: gemini)
EMBEDDING_MODEL        Model name (uses provider default if unset)
ADMIN_API_KEY          Bearer token for protected endpoints (harvest, etc.)
PORTALS_CONFIG         Path to portals.toml (optional)
CORS_ALLOWED_ORIGINS   Comma-separated allowed origins (default: *)
RATE_LIMIT_RPS         Requests per second per IP (default: 10)
RATE_LIMIT_BURST       Burst size for rate limiting (default: 30)

Architecture

Ceres Architecture Diagram
High-level architecture of Ceres components and data flow.

Harvesting Internals

Harvesting Flow Diagram
Two-tier optimization flow: incremental sync + delta detection.
Circuit Breaker Diagram
Circuit breaker states and recovery behavior for embedding requests.

Roadmap

For past releases, see the CHANGELOG.

v0.4.0 — Scale & Ecosystem

  • Parquet export endpoint in REST API (#98)
  • HNSW index tuning for production (#92)
  • Multi-tenancy support (#91)
  • Local embeddings via Ollama (#79)
  • Schema-level search (#68)
  • Socrata / DCAT-AP portal support (#61)

Backlog

  • Standalone library support (#35)
  • data.europa.eu integration

Related Projects

  • databricks-ceres-pipeline — A Databricks medallion architecture pipeline that provides batch analytics, ML features, and dashboards on top of the same open data index.

Contributing

Contributions are welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

Apache-2.0 — see LICENSE.


Built with pgvector, Google Gemini, and CKAN.