MoSPI Microdata SQL API Gateway — Architecture
& MVP
1) What we’re building (problem → solution)
Problem. MoSPI’s microdata (PLFS, HCES, etc.) are exposed as full file downloads (CSV/Excel/SPSS/Stata).
Users must locally wrangle them; developers have no API; there’s no role-based access, rate limiting, or
unified schema. Each dataset differs structurally, blocking quick, reproducible analysis.
Solution. A secure API Gateway that:
• Lets users run parameterized SQL (or templated queries) against standardized, columnar, versioned
copies of the survey datasets.
• Returns results in JSON (and optional CSV/Parquet) with pagination, caching, and metadata.
• Enforces RBAC, quotas/rate limits, per-dataset access tiers, and comprehensive auditing.
• Ships with an explorable catalog + schema registry + query templates for common tabulations.
2) High-level architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Public / Partner Consumers │
│ - Web apps - Jupyter / R - ETL tools - BI - curl │
└───────────▲──────────────────────────────────────────────────────────────┘
│ HTTPS (JWT/OAuth2), API Keys, Rate Limit, mTLS (partners)
┌───────────┴─────────────────────────┐
│ API GATEWAY │
│ - AuthN / AuthZ (RBAC) │
│ - Quotas & Rate limiting │
│ - Request validation (OpenAPI) │
│ - Caching (hot queries) │
│ - Observability (metrics, traces) │
└───────────┬─────────────────────────┘
│ gRPC/HTTP
┌───────────┴──────────────────────────────────────────┐
│ QUERY SERVICE (FastAPI/Go) │
│ - SQL validation/sandboxing │
│ - Query rewrites (row/col filters) │
│ - Templated queries (safe) │
│ - Pagination/Streaming to JSON │
└───────────┬──────────────────────────────────────────┘
│ ANSI SQL → execution
┌───────────┴──────────────────────────────────────────┐
1
│ ANALYTIC ENGINE LAYER │
│ Option A: DuckDB (embedded) / MotherDuck │
│ Option B: Trino/Presto cluster (scale-out) │
│ Option C: ClickHouse (OLAP) │
│ - Reads Parquet in object storage │
│ - Pushdown filters/projections │
└───────────┬──────────────────────────────────────────┘
│ Parquet + metadata
┌───────────┴──────────────────────────────────────────┐
│ DATA LAKE (S3-compatible / on-prem) │
│ - Versioned Parquet (Delta/Iceberg optional) │
│ - Dataset versions + partitions (year/round/state) │
│ - Encryption at rest │
└───────────┬──────────────────────────────────────────┘
│ Ingestion/standardization
┌───────────┴──────────────────────────────────────────┐
│ INGESTION PIPELINE (Airflow/Dagster) │
│ - Read CSV/SPSS/Stata (pyreadstat/Arrow) │
│ - Normalize schema + codebooks to YAML │
│ - Encode categorical maps, nulls, types │
│ - Write partitioned Parquet + stats │
└──────────────────────────────────────────────────────┘
3) Data model & catalog
• Dataset: dataset_id , title , round , year , geography , access_tier (open/restricted/
confidential), license , source_url .
• Table: table_id , dataset_id , name , description , row_count , bytes , partitions .
• Column: column_id , table_id , name , type , nullable , labels (value mapping),
description .
• Versioning: dataset_version with semantic tags: vYYYY.RR (e.g., PLFS.v2023.R4 ).
• Codebooks: YAML/JSON describing labels (e.g., employment status codes) and missing-value
semantics.
The Schema Registry publishes a stable logical schema in the API even if the physical layout changes (e.g.,
new partitions, additional columns).
4) Access model (RBAC + data governance)
• Roles: public , registered , researcher , partner , admin .
• Scopes: datasets:read , queries:run , meta:read , admin:* .
• Binding: per dataset/table/column; e.g., researchers can query restricted columns after DUA
approval; public role sees suppressed columns or coarser geography.
• Row-level security: automatic predicates injected by the Query Service (e.g., drop districts for
public , allow only state-level for registered ).
• Column-level security: masking/redaction for quasi-identifiers.
2
• Query timeouts & limits: max rows, CPU seconds, memory; disallow CREATE/ALTER/DROP ; allow
only SELECT .
5) Rate limiting & quotas
• Per-API key and per-IP rate limits (token bucket), burst and sustained thresholds.
• Daily/monthly result-row quotas per tier.
• Cost controls with max partitions scanned and max result size.
• Cached results keyed by canonicalized SQL + version; TTL varies by dataset updates.
6) API design (REST with OpenAPI)
Base URL: /v1/
6.1 Auth
• POST /auth/token → OAuth2 password/authorization_code (first-party) or API key provisioning
(partners).
6.2 Catalog & discovery
• GET /datasets → list with filters ( survey=PLFS , year=2023 ).
• GET /datasets/{id} → metadata, versions, tables.
• GET /datasets/{id}/schema → normalized columns with labels.
• GET /tables/{id}/preview?limit=50 → sample rows.
• GET /columns/search?q=employment&dataset=PLFS → semantic search.
6.3 Query endpoints
• POST /sql/execute → body: { sql, params, format=json|csv|parquet, pagination:
{limit, cursor} } .
• POST /sql/template/{name}/run → vetted, parameterized templates (safe for public apps).
• GET /jobs/{id} → status & result URL (for long-running queries, if async mode enabled).
6.4 Responses
• JSON default envelope:
{
"request_id": "uuid",
"dataset_version": "PLFS.v2023.R4",
"stats": {"rows": 1234, "elapsed_ms": 210, "bytes_scanned": 945000},
"result": [{"state": "KA", "employment_rate": 0.51}, ...],
"next_cursor": null
}
3
7) Query engine choices
MVP: DuckDB embedded in the Query Service reading Parquet from object storage; excellent for columnar
scans, filter/projection pushdown, and returns Arrow/JSON fast. Concurrency can be horizontally scaled with
multiple stateless pods behind the Gateway.
Scale-out: Trino/Presto with S3 connector if concurrency or data size grows; or ClickHouse for heavy OLAP
aggregations and materialized views.
8) Ingestion & standardization
• Use pyreadstat and Apache Arrow to load SPSS/Stata into Arrow tables.
• Normalize types, harmonize categorical codes using codebooks.
• Partition by survey/year/round/state (or PSU/SS), write Parquet with statistics and ZSTD
compression.
• Maintain Delta Lake / Apache Iceberg table metadata for ACID versioning and time travel (optional
but recommended).
• Produce dataset manifests (YAML) consumed by the API for schema and labels.
9) Safety & privacy controls
• Release-only columns in public tiers; sensitive columns either masked or dropped.
• Small cell suppression for tabulations: automatic suppression when counts < k (configurable, e.g.,
k=10) and complementary suppression to prevent inference.
• Noise addition (optional) for public endpoints: calibrated Laplace for DP counts when required.
• Query syntax linter blocks re-identification patterns (e.g., too many group-by keys).
10) Observability & governance
• Audit log: who queried what, when, row/byte counts, result retention.
• Metrics: P95 latency, QPS, error rates, top queries, datasets hit.
• Tracing: distributed tracing across Gateway, Query Service, and storage.
• Catalog UI: simple web app for browsing datasets, trying queries, and saving templates.
11) Deployment
• Containerized services; deploy on Kubernetes (or VM auto-scaling).
• API Gateway: Kong/NGINX/Envoy; JWT verification plugin; rate limit plugin.
• Secrets: KMS/HSM backed; rotate keys; mTLS to storage.
• Backups: object storage versioning + immutable backups.
12) Sample OpenAPI (excerpt)
openapi: 3.0.3
info:
4
title: MoSPI Microdata SQL API
version: 1.0.0
servers:
- url: https://api.microdata.gov.in/v1
paths:
/sql/execute:
post:
security: [{ bearerAuth: [] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
sql: { type: string }
params: { type: object, additionalProperties: true }
format: { type: string, enum: [json, csv, parquet], default:
json }
pagination:
type: object
properties:
limit: { type: integer, default: 1000, maximum: 10000 }
cursor: { type: string }
responses:
'200':
description: Query result
content:
application/json:
schema:
type: object
properties:
request_id: { type: string }
dataset_version: { type: string }
stats:
type: object
properties:
rows: { type: integer }
elapsed_ms: { type: integer }
bytes_scanned: { type: integer }
result:
type: array
items: { type: object }
next_cursor: { type: string, nullable: true }
securitySchemes:
bearerAuth:
type: http
5
scheme: bearer
bearerFormat: JWT
13) MVP server skeleton (Python / FastAPI + DuckDB)
The following illustrates the key control points; production code adds full validation, metrics,
and hardening.
from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
import duckdb, json, time, uuid
app = FastAPI()
class Pagination(BaseModel):
limit: int = 1000
cursor: str | None = None
class SqlRequest(BaseModel):
sql: str
params: dict[str, object] | None = None
format: str = "json"
pagination: Pagination | None = None
def enforce_policies(sql: str, role: str) -> str:
s = sql.strip().lower()
if not s.startswith("select"):
raise HTTPException(400, "Only SELECT queries are allowed")
# Example: inject row-level predicate for public role
if role == "publi