Skip to content

A schema-aware translation engine that converts plain English questions into structured JSON filter objects for healthcare databases. Built as a proof of concept for intelligent cohort discovery in medical research platforms.

Notifications You must be signed in to change notification settings

Dinkar30/mediSchema-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MediSchema — Natural Language to Structured Query Engine

A schema-aware translation engine that converts plain English questions into structured JSON filter objects for healthcare databases. Built as a proof of concept for intelligent cohort discovery in medical research platforms.


The Problem It Solves

Healthcare researchers need to find specific patient populations — for example, "patients with metastatic tumors on the skin from the INRG consortium." To query a database for this, they would traditionally need to write complex structured queries by hand, requiring deep knowledge of the database schema and query syntax.

This is a real bottleneck in medical research. Researchers who understand medicine often don't understand query languages. And database engineers who can write the queries often don't understand the clinical context.

MediSchema sits in the middle. It takes a researcher's question in plain English, understands the structure of their database, and generates a valid, schema-conforming filter object — ready to be consumed by a GraphQL API endpoint.

The hard part isn't calling an LLM. Any wrapper can do that. The hard part is preventing hallucination — ensuring the LLM doesn't invent field names that don't exist in the database, and validating the output before it reaches the database layer. That's what this engine does.


How It Works

Plain English Question
        ↓
Schema Parser       → flattens raw JSON schema into readable LLM context
        ↓
Prompt Builder      → injects schema context + strict field constraints
        ↓
LLM (Groq/OpenAI)  → generates structured filter object
        ↓
Validator           → recursively extracts all field names from output
                    → cross-references against authoritative schema
                    → rejects any hallucinated fields
        ↓
Validated JSON Filter Object

The output filter object uses AND, IN, and nested keys — a structure directly compatible with the PCDC (Pediatric Cancer Data Commons) GraphQL API and similar healthcare data platforms.


Example

Input:

node src/index.js --schema ./schema/medical.json --query "Find metastatic tumor patients from the INRG consortium where tumor site is skin"

Output:

{
  "AND": [
    {
      "IN": {
        "consortium": ["INRG"]
      }
    },
    {
      "nested": {
        "path": "tumor_assessments",
        "AND": [
          { "IN": { "tumor_classification": ["Metastatic"] } },
          { "IN": { "tumor_site": ["Skin"] } }
        ]
      }
    }
  ]
}

If the user asks for a field that doesn't exist in the schema:

{ "error": "field not available" }

Architecture

medischema/
├── src/
│   ├── engine/
│   │   ├── schemaParser.js      # Converts raw schema JSON into LLM-readable context
│   │   ├── promptBuilder.js     # Constructs schema-aware system prompt
│   │   ├── queryGenerator.js    # Calls LLM API, returns raw output
│   │   └── validator.js         # Validates output against schema
│   ├── evaluation/
│   │   └── testRunner.js        # Runs test suite, scores accuracy
│   └── index.js                 # CLI entry point
├── schema/
│   └── medical.json             # Sample PCDC-style healthcare schema
├── tests/
│   └── testCases.json           # Evaluation test cases with expected outputs
├── .env
└── package.json

Each layer has a single responsibility. The validator is the most critical — it contains a recursive field extractor that walks the entire generated JSON tree and cross-references every field name against the authoritative schema before the output is returned.


The Validation Layer (The Hard Part)

Most LLM wrappers print whatever the model returns. This engine doesn't.

After generation, the validator runs three checks:

1. JSON validity — If the LLM returns malformed JSON, the engine catches the parse error and returns a structured failure instead of crashing.

2. Recursive field extraction — The validator walks the entire nested output tree, collecting every field name used inside IN objects. This handles arbitrarily deep nesting.

3. Schema cross-reference — Every extracted field name is checked against the complete list of valid fields from the user's schema — including fields inside nested relationships. Any field not in the schema is flagged as invalid and the query is rejected.

This prevents the most common LLM failure mode in data applications: confidently generating queries with field names that don't exist.


Tech Stack

  • Runtime: Node.js with ES Modules
  • LLM: Groq API (llama3-8b-8192) or OpenAI (gpt-3.5-turbo)
  • Schema format: JSON (bring your own)
  • Dependencies: groq-sdk / openai, dotenv, minimist

Installation

1. Clone the repository

git clone https://github.com/yourusername/medischema.git
cd medischema

2. Install dependencies

npm install

3. Set up your API key

Create a .env file in the root:

API_KEY=your_groq_or_openai_key_here

Get a free Groq API key at console.groq.com — no credit card required.


Usage

Query against the sample medical schema

node src/index.js --schema ./schema/medical.json --query "Find all male patients"

Use your own schema

node src/index.js --schema ./path/to/yourschema.json --query "Your question here"

Run the evaluation suite

node src/evaluation/testRunner.js

Bring Your Own Schema

MediSchema is schema-agnostic. Point it at any JSON schema file that follows this structure:

{
  "entityName": {
    "description": "what this entity represents",
    "fields": ["field1", "field2", "field3"],
    "nested_relationships": {
      "relationName": {
        "fields": ["nested_field1", "nested_field2"]
      }
    }
  }
}

The engine will automatically parse the schema, inject it into the prompt as authoritative context, and validate all generated output against it.


Evaluation

The test runner executes a suite of test cases covering three scenarios:

  • Simple flat queries (single entity, single condition)
  • Nested queries (conditions across related entities)
  • Invalid field queries (fields that don't exist — engine should return error)

Run the suite and check the score:

node src/evaluation/testRunner.js

Expected output:

Test 1: PASSED — "Find all male patients"
Test 2: PASSED — "Find metastatic tumors on skin from INRG consortium"
Test 3: PASSED — "Find patients by their insurance number"

Results: 3/3 tests passed

Motivation

Healthcare data platforms face a common bottleneck — researchers who understand the clinical questions often can't write structured database queries, and engineers who can write queries often lack the clinical context.

This engine was built to explore that problem: how do you build a translation layer that is not just a generic LLM wrapper, but a schema-aware, validated pipeline that can be trusted with real data?

The output filter format (AND, IN, nested) is designed to be directly consumable by GraphQL API endpoints used in healthcare data platforms.


Known Limitations and Future Work

  • Currently supports IN filters only — GTE/LTE numeric range filters are not yet implemented
  • Intent routing (general inquiry vs query generation) is not yet implemented
  • Schema must be provided as a local JSON file — URL-based schema loading would be a useful addition
  • Evaluation scoring is binary (pass/fail) — a partial credit scoring system would give more useful signal

License

MIT

About

A schema-aware translation engine that converts plain English questions into structured JSON filter objects for healthcare databases. Built as a proof of concept for intelligent cohort discovery in medical research platforms.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published