Ceres

Semantic search engine for open data portals

Quick Start • Features • Usage • Roadmap

Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Named after the Roman goddess of harvest and agriculture.

Why Ceres?

Open Data Galaxy — ML-generated visualization

_{354,000+ datasets (dedup. to 270k) from 22 portals, embedded with all-MiniLM-L6-v2, projected to 3D via UMAP, and clustered with HDBSCAN. Each color is a portal — nearby points are semantically similar.}

Open data portals are everywhere, but finding the right dataset is still painful:

Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
No cross-portal search: You can't query Milano and Roma datasets together

Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.

$ ceres search "trasporto pubblico" --limit 3

Found 3 matching datasets:

1. [████████░░] [78%] TPL - Percorsi linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie

2. [████████░░] [76%] TPL - Fermate linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie

3. [███████░░░] [72%] Mobilità: flussi veicolari rilevati dai spire
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari

Features

CKAN Harvester — Fetch datasets from any CKAN-compatible portal, including multilingual portals
Multi-portal Batch Harvest — Configure multiple portals in portals.toml and harvest them all at once
Streaming Harvest — Memory-efficient streaming pipeline for large portals (100k+ datasets)
Delta Detection — Only regenerate embeddings for changed datasets (99.8% API cost savings). See Harvesting Architecture
Batch Embeddings — Batched embedding API calls for higher throughput during harvest
Persistent Jobs — Recoverable database-backed job queue with automatic retries and exponential backoff
Graceful Shutdown — Safely interrupt harvesting to ensure data consistency and release in-progress jobs back to the queue
Real-time Progress — Live progress reporting during harvest with batch timestamp updates
Dry Run Mode — Preview what a harvest would do without writing to DB or calling embedding APIs
Semantic Search — Find datasets by meaning using vector embeddings
Pluggable Embeddings — Switchable embedding backend via trait (Gemini, OpenAI)
Bearer Token Auth — Protected admin endpoints with configurable API key authentication
Docker Support — Production-ready multi-stage Docker image and Docker Compose setup
Multi-format Export — Export to JSON, JSON Lines, CSV, or Parquet
Custom URL Templates — Support portals with non-standard frontend URLs

Pre-configured Portals

Ceres comes with 25 verified CKAN portals ready to use, covering 354,000+ datasets:

Portal	Region	Datasets
Australia	Australia	~109,440
Italy (National)	Italy	~70,141
Ukraine	Ukraine	~39,790
HDX (Humanitarian)	Global	~26,654
NRW	Germany	~22,849
Ireland	Ireland	~21,855
Switzerland	Switzerland	~14,559
Toscana	Italy	~12,886
Tokyo	Japan	~9,707
Marche	Italy	~5,440
Romania	Romania	~5,038
Chile	Chile	~2,897
Aragón	Spain	~2,881
Emilia-Romagna	Italy	~2,871
Milano	Italy	~2,580
Puglia	Italy	~1,801
Trentino	Italy	~1,388
Umbria	Italy	~457
Lazio	Italy	~407
Roma	Italy	~365
Campania	Italy	~332
Sicilia	Italy	~186
Genova	Italy	~171
Liguria	Italy	~124
Napoli	Italy	~33

See examples/portals.toml for the full configuration. Want to add more? Check issue #19.

Tech Stack

Component	Technology
Language	Rust (async with Tokio)
Database	PostgreSQL 16+ with pgvector
Embeddings	Pluggable: Google Gemini, OpenAI (trait-based)
Portal Protocol	CKAN API v3
REST API	Axum with OpenAPI/Swagger UI

Quick Start

Prerequisites

Rust 1.87+
Docker & Docker Compose
Google Gemini API key (get one free) or OpenAI API key

Docker (recommended)

# Clone and configure
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cp .env.example .env
# Edit .env with your API key and settings

# Start everything (DB + server)
docker compose up -d

From source

# Install from crates.io
cargo install ceres-search

# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release

Setup (from source)

# Start PostgreSQL with pgvector
docker compose up db -d

# Configure environment
cp .env.example .env
# Edit .env with your API key

# Run database migrations
make migrate

Tip: Run make help to see all available Makefile shortcuts.

Usage

Harvest datasets from a CKAN portal

ceres harvest https://dati.comune.milano.it

Tip: Running a harvest command for the first time without a config generates a pre-configured portals.toml automatically.

Search indexed datasets

ceres search "trasporto pubblico" --limit 10

Export datasets

# JSON Lines (default)
ceres export > datasets.jsonl

# JSON array
ceres export --format json > datasets.json

# CSV
ceres export --format csv > datasets.csv

# Filter by portal
ceres export --portal https://dati.comune.milano.it

View statistics

ceres stats

CLI Reference

ceres <COMMAND>

Commands:
  harvest  Harvest datasets from a CKAN portal or batch harvest from portals.toml
  search   Search indexed datasets using semantic similarity
  export   Export indexed datasets to various formats
  stats    Show database statistics
  help     Print help information

Harvest Flags:
  --full-sync       Force full sync even if incremental is available
  --dry-run         Preview what would be harvested without writing to DB
  --portal <NAME>   Harvest a specific portal by name from config
  --config <PATH>   Use custom portals.toml location

Environment Variables:
  DATABASE_URL          PostgreSQL connection string
  EMBEDDING_PROVIDER    Embedding backend: gemini or openai (default: gemini)
  GEMINI_API_KEY        Google Gemini API key (when using gemini provider)
  OPENAI_API_KEY        OpenAI API key (when using openai provider)

REST API

Start the server:

ceres-server

Available endpoints:

GET /api/v1/health — Health check
GET /api/v1/stats — Database statistics
GET /api/v1/search — Semantic search
GET /api/v1/portals — List configured portals
GET /api/v1/portals/:name/stats — Portal-specific statistics
POST /api/v1/portals/:name/harvest — Trigger harvest for a portal
POST /api/v1/harvest — Trigger harvest for all portals
GET /api/v1/harvest/status — Check harvest job status
GET /api/v1/export — Export datasets
GET /api/v1/datasets/:id — Get dataset by ID
GET /swagger-ui — Interactive API docs

Server environment variables:

PORT                   Server port (default: 3000)
HOST                   Server host (default: 0.0.0.0)
EMBEDDING_PROVIDER     Embedding backend: gemini or openai (default: gemini)
EMBEDDING_MODEL        Model name (uses provider default if unset)
ADMIN_API_KEY          Bearer token for protected endpoints (harvest, etc.)
PORTALS_CONFIG         Path to portals.toml (optional)
CORS_ALLOWED_ORIGINS   Comma-separated allowed origins (default: *)
RATE_LIMIT_RPS         Requests per second per IP (default: 10)
RATE_LIMIT_BURST       Burst size for rate limiting (default: 30)

Architecture

_{High-level architecture of Ceres components and data flow.}

Harvesting Internals

_{Two-tier optimization flow: incremental sync + delta detection.}

_{Circuit breaker states and recovery behavior for embedding requests.}

Roadmap

For past releases, see the CHANGELOG.

v0.4.0 — Scale & Ecosystem

Parquet export endpoint in REST API (#98)
HNSW index tuning for production (#92)
Multi-tenancy support (#91)
Local embeddings via Ollama (#79)
Schema-level search (#68)
Socrata / DCAT-AP portal support (#61)

Backlog

Standalone library support (#35)
data.europa.eu integration

Related Projects

databricks-ceres-pipeline — A Databricks medallion architecture pipeline that provides batch analytics, ML features, and dashboards on top of the same open data index.

Contributing

Contributions are welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

Apache-2.0 — see LICENSE.

_{Built with pgvector, Google Gemini, and CKAN.}

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.cargo		.cargo
.github		.github
crates		crates
docs		docs
examples		examples
migrations		migrations
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cliff.toml		cliff.toml
clippy.toml		clippy.toml
compose.yml		compose.yml
deny.toml		deny.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ceres

Why Ceres?

Features

Pre-configured Portals

Tech Stack

Quick Start

Prerequisites

Docker (recommended)

From source

Setup (from source)

Usage

Harvest datasets from a CKAN portal

Search indexed datasets

Export datasets

View statistics

CLI Reference

REST API

Architecture

Harvesting Internals

Roadmap

v0.4.0 — Scale & Ecosystem

Backlog

Related Projects

Contributing

License

About

Uh oh!

Releases 4

Uh oh!

Contributors 7

Uh oh!

Languages

License

AndreaBozzo/Ceres

Folders and files

Latest commit

History

Repository files navigation

Ceres

Why Ceres?

Features

Pre-configured Portals

Tech Stack

Quick Start

Prerequisites

Docker (recommended)

From source

Setup (from source)

Usage

Harvest datasets from a CKAN portal

Search indexed datasets

Export datasets

View statistics

CLI Reference

REST API

Architecture

Harvesting Internals

Roadmap

v0.4.0 — Scale & Ecosystem

Backlog

Related Projects

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors 7

Uh oh!

Languages