Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.
Named after the Roman goddess of harvest and agriculture.
354,000+ datasets (dedup. to 270k) from 22 portals, embedded with all-MiniLM-L6-v2, projected to 3D via UMAP, and clustered with HDBSCAN. Each color is a portal — nearby points are semantically similar.
Open data portals are everywhere, but finding the right dataset is still painful:
- Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
- Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
- No cross-portal search: You can't query Milano and Roma datasets together
Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.
$ ceres search "trasporto pubblico" --limit 3
Found 3 matching datasets:
1. [████████░░] [78%] TPL - Percorsi linee di superficie
📍 https://dati.comune.milano.it
🔗 https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie
2. [████████░░] [76%] TPL - Fermate linee di superficie
📍 https://dati.comune.milano.it
🔗 https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie
3. [███████░░░] [72%] Mobilità: flussi veicolari rilevati dai spire
📍 https://dati.comune.milano.it
🔗 https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari
- CKAN Harvester — Fetch datasets from any CKAN-compatible portal, including multilingual portals
- Multi-portal Batch Harvest — Configure multiple portals in
portals.tomland harvest them all at once - Streaming Harvest — Memory-efficient streaming pipeline for large portals (100k+ datasets)
- Delta Detection — Only regenerate embeddings for changed datasets (99.8% API cost savings). See Harvesting Architecture
- Batch Embeddings — Batched embedding API calls for higher throughput during harvest
- Persistent Jobs — Recoverable database-backed job queue with automatic retries and exponential backoff
- Graceful Shutdown — Safely interrupt harvesting to ensure data consistency and release in-progress jobs back to the queue
- Real-time Progress — Live progress reporting during harvest with batch timestamp updates
- Dry Run Mode — Preview what a harvest would do without writing to DB or calling embedding APIs
- Semantic Search — Find datasets by meaning using vector embeddings
- Pluggable Embeddings — Switchable embedding backend via trait (Gemini, OpenAI)
- Bearer Token Auth — Protected admin endpoints with configurable API key authentication
- Docker Support — Production-ready multi-stage Docker image and Docker Compose setup
- Multi-format Export — Export to JSON, JSON Lines, CSV, or Parquet
- Custom URL Templates — Support portals with non-standard frontend URLs
Ceres comes with 25 verified CKAN portals ready to use, covering 354,000+ datasets:
| Portal | Region | Datasets |
|---|---|---|
| Australia | Australia | ~109,440 |
| Italy (National) | Italy | ~70,141 |
| Ukraine | Ukraine | ~39,790 |
| HDX (Humanitarian) | Global | ~26,654 |
| NRW | Germany | ~22,849 |
| Ireland | Ireland | ~21,855 |
| Switzerland | Switzerland | ~14,559 |
| Toscana | Italy | ~12,886 |
| Tokyo | Japan | ~9,707 |
| Marche | Italy | ~5,440 |
| Romania | Romania | ~5,038 |
| Chile | Chile | ~2,897 |
| Aragón | Spain | ~2,881 |
| Emilia-Romagna | Italy | ~2,871 |
| Milano | Italy | ~2,580 |
| Puglia | Italy | ~1,801 |
| Trentino | Italy | ~1,388 |
| Umbria | Italy | ~457 |
| Lazio | Italy | ~407 |
| Roma | Italy | ~365 |
| Campania | Italy | ~332 |
| Sicilia | Italy | ~186 |
| Genova | Italy | ~171 |
| Liguria | Italy | ~124 |
| Napoli | Italy | ~33 |
See examples/portals.toml for the full configuration. Want to add more? Check issue #19.
| Component | Technology |
|---|---|
| Language | Rust (async with Tokio) |
| Database | PostgreSQL 16+ with pgvector |
| Embeddings | Pluggable: Google Gemini, OpenAI (trait-based) |
| Portal Protocol | CKAN API v3 |
| REST API | Axum with OpenAPI/Swagger UI |
- Rust 1.87+
- Docker & Docker Compose
- Google Gemini API key (get one free) or OpenAI API key
# Clone and configure
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cp .env.example .env
# Edit .env with your API key and settings
# Start everything (DB + server)
docker compose up -d# Install from crates.io
cargo install ceres-search
# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release# Start PostgreSQL with pgvector
docker compose up db -d
# Configure environment
cp .env.example .env
# Edit .env with your API key
# Run database migrations
make migrateTip: Run
make helpto see all available Makefile shortcuts.
ceres harvest https://dati.comune.milano.itTip: Running a harvest command for the first time without a config generates a pre-configured
portals.tomlautomatically.
ceres search "trasporto pubblico" --limit 10# JSON Lines (default)
ceres export > datasets.jsonl
# JSON array
ceres export --format json > datasets.json
# CSV
ceres export --format csv > datasets.csv
# Filter by portal
ceres export --portal https://dati.comune.milano.itceres statsceres <COMMAND>
Commands:
harvest Harvest datasets from a CKAN portal or batch harvest from portals.toml
search Search indexed datasets using semantic similarity
export Export indexed datasets to various formats
stats Show database statistics
help Print help information
Harvest Flags:
--full-sync Force full sync even if incremental is available
--dry-run Preview what would be harvested without writing to DB
--portal <NAME> Harvest a specific portal by name from config
--config <PATH> Use custom portals.toml location
Environment Variables:
DATABASE_URL PostgreSQL connection string
EMBEDDING_PROVIDER Embedding backend: gemini or openai (default: gemini)
GEMINI_API_KEY Google Gemini API key (when using gemini provider)
OPENAI_API_KEY OpenAI API key (when using openai provider)
Start the server:
ceres-serverAvailable endpoints:
GET /api/v1/health— Health checkGET /api/v1/stats— Database statisticsGET /api/v1/search— Semantic searchGET /api/v1/portals— List configured portalsGET /api/v1/portals/:name/stats— Portal-specific statisticsPOST /api/v1/portals/:name/harvest— Trigger harvest for a portalPOST /api/v1/harvest— Trigger harvest for all portalsGET /api/v1/harvest/status— Check harvest job statusGET /api/v1/export— Export datasetsGET /api/v1/datasets/:id— Get dataset by IDGET /swagger-ui— Interactive API docs
Server environment variables:
PORT Server port (default: 3000)
HOST Server host (default: 0.0.0.0)
EMBEDDING_PROVIDER Embedding backend: gemini or openai (default: gemini)
EMBEDDING_MODEL Model name (uses provider default if unset)
ADMIN_API_KEY Bearer token for protected endpoints (harvest, etc.)
PORTALS_CONFIG Path to portals.toml (optional)
CORS_ALLOWED_ORIGINS Comma-separated allowed origins (default: *)
RATE_LIMIT_RPS Requests per second per IP (default: 10)
RATE_LIMIT_BURST Burst size for rate limiting (default: 30)
For past releases, see the CHANGELOG.
- Parquet export endpoint in REST API (#98)
- HNSW index tuning for production (#92)
- Multi-tenancy support (#91)
- Local embeddings via Ollama (#79)
- Schema-level search (#68)
- Socrata / DCAT-AP portal support (#61)
- Standalone library support (#35)
- data.europa.eu integration
- databricks-ceres-pipeline — A Databricks medallion architecture pipeline that provides batch analytics, ML features, and dashboards on top of the same open data index.
Contributions are welcome! See CONTRIBUTING.md for setup instructions and guidelines.
Apache-2.0 — see LICENSE.



