0% found this document useful (0 votes)
5 views6 pages

Mo Spi Microdata SQL API Gateway - Architecture & MVP

Uploaded by

Tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Mo Spi Microdata SQL API Gateway - Architecture & MVP

Uploaded by

Tejas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MoSPI Microdata SQL API Gateway — Architecture

& MVP
1) What we’re building (problem → solution)
Problem. MoSPI’s microdata (PLFS, HCES, etc.) are exposed as full file downloads (CSV/Excel/SPSS/Stata).
Users must locally wrangle them; developers have no API; there’s no role-based access, rate limiting, or
unified schema. Each dataset differs structurally, blocking quick, reproducible analysis.

Solution. A secure API Gateway that:

• Lets users run parameterized SQL (or templated queries) against standardized, columnar, versioned
copies of the survey datasets.
• Returns results in JSON (and optional CSV/Parquet) with pagination, caching, and metadata.
• Enforces RBAC, quotas/rate limits, per-dataset access tiers, and comprehensive auditing.
• Ships with an explorable catalog + schema registry + query templates for common tabulations.

2) High-level architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ Public / Partner Consumers │
│ - Web apps - Jupyter / R - ETL tools - BI - curl │
└───────────▲──────────────────────────────────────────────────────────────┘
│ HTTPS (JWT/OAuth2), API Keys, Rate Limit, mTLS (partners)
┌───────────┴─────────────────────────┐
│ API GATEWAY │
│ - AuthN / AuthZ (RBAC) │
│ - Quotas & Rate limiting │
│ - Request validation (OpenAPI) │
│ - Caching (hot queries) │
│ - Observability (metrics, traces) │
└───────────┬─────────────────────────┘
│ gRPC/HTTP
┌───────────┴──────────────────────────────────────────┐
│ QUERY SERVICE (FastAPI/Go) │
│ - SQL validation/sandboxing │
│ - Query rewrites (row/col filters) │
│ - Templated queries (safe) │
│ - Pagination/Streaming to JSON │
└───────────┬──────────────────────────────────────────┘
│ ANSI SQL → execution
┌───────────┴──────────────────────────────────────────┐

1
│ ANALYTIC ENGINE LAYER │
│ Option A: DuckDB (embedded) / MotherDuck │
│ Option B: Trino/Presto cluster (scale-out) │
│ Option C: ClickHouse (OLAP) │
│ - Reads Parquet in object storage │
│ - Pushdown filters/projections │
└───────────┬──────────────────────────────────────────┘
│ Parquet + metadata
┌───────────┴──────────────────────────────────────────┐
│ DATA LAKE (S3-compatible / on-prem) │
│ - Versioned Parquet (Delta/Iceberg optional) │
│ - Dataset versions + partitions (year/round/state) │
│ - Encryption at rest │
└───────────┬──────────────────────────────────────────┘
│ Ingestion/standardization
┌───────────┴──────────────────────────────────────────┐
│ INGESTION PIPELINE (Airflow/Dagster) │
│ - Read CSV/SPSS/Stata (pyreadstat/Arrow) │
│ - Normalize schema + codebooks to YAML │
│ - Encode categorical maps, nulls, types │
│ - Write partitioned Parquet + stats │
└──────────────────────────────────────────────────────┘

3) Data model & catalog


• Dataset: dataset_id , title , round , year , geography , access_tier (open/restricted/
confidential), license , source_url .
• Table: table_id , dataset_id , name , description , row_count , bytes , partitions .
• Column: column_id , table_id , name , type , nullable , labels (value mapping),
description .
• Versioning: dataset_version with semantic tags: vYYYY.RR (e.g., PLFS.v2023.R4 ).
• Codebooks: YAML/JSON describing labels (e.g., employment status codes) and missing-value
semantics.

The Schema Registry publishes a stable logical schema in the API even if the physical layout changes (e.g.,
new partitions, additional columns).

4) Access model (RBAC + data governance)


• Roles: public , registered , researcher , partner , admin .
• Scopes: datasets:read , queries:run , meta:read , admin:* .
• Binding: per dataset/table/column; e.g., researchers can query restricted columns after DUA
approval; public role sees suppressed columns or coarser geography.
• Row-level security: automatic predicates injected by the Query Service (e.g., drop districts for
public , allow only state-level for registered ).
• Column-level security: masking/redaction for quasi-identifiers.

2
• Query timeouts & limits: max rows, CPU seconds, memory; disallow CREATE/ALTER/DROP ; allow
only SELECT .

5) Rate limiting & quotas


• Per-API key and per-IP rate limits (token bucket), burst and sustained thresholds.
• Daily/monthly result-row quotas per tier.
• Cost controls with max partitions scanned and max result size.
• Cached results keyed by canonicalized SQL + version; TTL varies by dataset updates.

6) API design (REST with OpenAPI)


Base URL: /v1/

6.1 Auth

• POST /auth/token → OAuth2 password/authorization_code (first-party) or API key provisioning


(partners).

6.2 Catalog & discovery

• GET /datasets → list with filters ( survey=PLFS , year=2023 ).


• GET /datasets/{id} → metadata, versions, tables.
• GET /datasets/{id}/schema → normalized columns with labels.
• GET /tables/{id}/preview?limit=50 → sample rows.
• GET /columns/search?q=employment&dataset=PLFS → semantic search.

6.3 Query endpoints

• POST /sql/execute → body: { sql, params, format=json|csv|parquet, pagination:


{limit, cursor} } .
• POST /sql/template/{name}/run → vetted, parameterized templates (safe for public apps).
• GET /jobs/{id} → status & result URL (for long-running queries, if async mode enabled).

6.4 Responses

• JSON default envelope:

{
"request_id": "uuid",
"dataset_version": "PLFS.v2023.R4",
"stats": {"rows": 1234, "elapsed_ms": 210, "bytes_scanned": 945000},
"result": [{"state": "KA", "employment_rate": 0.51}, ...],
"next_cursor": null
}

3
7) Query engine choices
MVP: DuckDB embedded in the Query Service reading Parquet from object storage; excellent for columnar
scans, filter/projection pushdown, and returns Arrow/JSON fast. Concurrency can be horizontally scaled with
multiple stateless pods behind the Gateway.

Scale-out: Trino/Presto with S3 connector if concurrency or data size grows; or ClickHouse for heavy OLAP
aggregations and materialized views.

8) Ingestion & standardization


• Use pyreadstat and Apache Arrow to load SPSS/Stata into Arrow tables.
• Normalize types, harmonize categorical codes using codebooks.
• Partition by survey/year/round/state (or PSU/SS), write Parquet with statistics and ZSTD
compression.
• Maintain Delta Lake / Apache Iceberg table metadata for ACID versioning and time travel (optional
but recommended).
• Produce dataset manifests (YAML) consumed by the API for schema and labels.

9) Safety & privacy controls


• Release-only columns in public tiers; sensitive columns either masked or dropped.
• Small cell suppression for tabulations: automatic suppression when counts < k (configurable, e.g.,
k=10) and complementary suppression to prevent inference.
• Noise addition (optional) for public endpoints: calibrated Laplace for DP counts when required.
• Query syntax linter blocks re-identification patterns (e.g., too many group-by keys).

10) Observability & governance


• Audit log: who queried what, when, row/byte counts, result retention.
• Metrics: P95 latency, QPS, error rates, top queries, datasets hit.
• Tracing: distributed tracing across Gateway, Query Service, and storage.
• Catalog UI: simple web app for browsing datasets, trying queries, and saving templates.

11) Deployment
• Containerized services; deploy on Kubernetes (or VM auto-scaling).
• API Gateway: Kong/NGINX/Envoy; JWT verification plugin; rate limit plugin.
• Secrets: KMS/HSM backed; rotate keys; mTLS to storage.
• Backups: object storage versioning + immutable backups.

12) Sample OpenAPI (excerpt)

openapi: 3.0.3
info:

4
title: MoSPI Microdata SQL API
version: 1.0.0
servers:
- url: https://api.microdata.gov.in/v1
paths:
/sql/execute:
post:
security: [{ bearerAuth: [] }]
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
sql: { type: string }
params: { type: object, additionalProperties: true }
format: { type: string, enum: [json, csv, parquet], default:
json }
pagination:
type: object
properties:
limit: { type: integer, default: 1000, maximum: 10000 }
cursor: { type: string }
responses:
'200':
description: Query result
content:
application/json:
schema:
type: object
properties:
request_id: { type: string }
dataset_version: { type: string }
stats:
type: object
properties:
rows: { type: integer }
elapsed_ms: { type: integer }
bytes_scanned: { type: integer }
result:
type: array
items: { type: object }
next_cursor: { type: string, nullable: true }
securitySchemes:
bearerAuth:
type: http

5
scheme: bearer
bearerFormat: JWT

13) MVP server skeleton (Python / FastAPI + DuckDB)


The following illustrates the key control points; production code adds full validation, metrics,
and hardening.

from fastapi import FastAPI, HTTPException, Depends


from pydantic import BaseModel
import duckdb, json, time, uuid

app = FastAPI()

class Pagination(BaseModel):
limit: int = 1000
cursor: str | None = None

class SqlRequest(BaseModel):
sql: str
params: dict[str, object] | None = None
format: str = "json"
pagination: Pagination | None = None

def enforce_policies(sql: str, role: str) -> str:


s = sql.strip().lower()
if not s.startswith("select"):
raise HTTPException(400, "Only SELECT queries are allowed")
# Example: inject row-level predicate for public role
if role == "publi

You might also like