side-quest

HackED 2026 project

PDF → Markdown → Structured JSON Pipeline

Prereqs

Python 3.9+ on PATH
Dependencies from requirements.txt

1) Convert PDFs to Markdown

python -m pip install -r requirements.txt
python scripts/pdf_to_md.py --input-dir references --output-dir markdown

Outputs one .md per PDF in markdown/, with page headers and cleaned whitespace.

2) Add your OpenAI API key

Create .env (already present) and set:

OPENAI_API_KEY=sk-...

3) Extract structured requirements with OpenAI

python scripts/extract_md_to_structured.py \
  --markdown-dir markdown \
  --output-dir extracted_information \
  --ground-truth ground_truth_accessibility.json \
  --model gpt-4.1 \
  --env-file .env

Produces {source}_extracted.json in extracted_information/.
Use --dry-run to write prompts without calling the API.

3b) One-liner via bash helper

chmod +x scripts/run_extraction.sh
./scripts/run_extraction.sh

Optionally override the model: MODEL_NAME=gpt-4.1-mini ./scripts/run_extraction.sh

4) Normalize extracted JSON (adds provenance + stable IDs)

python scripts/normalize_json.py \
  --input-dir extracted_information \
  --output-dir normalized_information \
  --building-type commercial_interiors \
  --include-pattern ""   # optional substring filter (case-insensitive)

5) Merge normalized JSON deterministically + capture conflicts

python scripts/merge_json.py \
  --input-dir normalized_information \
  --output merged/merged.json \
  --conflicts merged/conflicts.json \
  --doc-priority "leed,standard" \
  --include-pattern ""   # optional substring filter (case-insensitive)

6) Render concise rubric-style Markdown (includes building type)

python scripts/render_doc.py \
  --input merged/merged.json \
  --output reports/commercial_interiors.md \
  --per-category 12 \
  --min-confidence 0.0 \
  --top-n 0

6b) One-liner for steps 4–6 (normalize → merge → render)

chmod +x scripts/run_postprocess.sh
bash scripts/run_postprocess.sh

Defaults now target housing: BUILDING_TYPE=housing, OUTPUT_FILE=reports/housing.md.
Override example for commercial interiors: BUILDING_TYPE=commercial_interiors OUTPUT_FILE=reports/commercial_interiors.md DOC_PRIORITY="leed,standard" PER_CATEGORY=12 MIN_CONFIDENCE=0.9 TOP_N=40 bash scripts/run_postprocess.sh Use INCLUDE_PATTERN=<substring> to run the pipeline on a subset of files (case-insensitive filename match).

7) Chunk-level vector comparison + DB persistence

Apply the latest migration (adds chunk embeddings and comparisons tables):

sqlite3 db/assessment.db ".read migrations/003_chunk_vectors.sql"

Compare extracted Markdown to the rubric, store embeddings and coverage results:

python scripts/compare_md_vectors.py \
  --candidate extracted_output/ilovepdf_merged_organized_smart.md \
  --rubric-housing reports/housing.md \
  --rubric-commercial reports/commercial_interiors.md \
  --db db/assessment.db \
  --model text-embedding-3-small \
  --write-assessment          # optional: also writes projects/assessments rows

To run on all Markdown files in a folder: python scripts/compare_md_vectors.py --candidate extracted_output
Building type is auto-detected (housing if it looks like a personal/home doc; otherwise commercial).
Rubric weights are prioritised: high-impact accessibility markers (e.g., tactile/contrast/signage/egress cues) get a 3× weight bump before weights are normalised.

Notes

Ground truth schema lives in ground_truth_accessibility.json; update it to change categories/ids.
Scripts assume Markdown pages are labeled with ## Page N (added by the PDF→MD step) to derive page_numbers.

How to run smart_extract.py

inside virtual environment

pip install -r requirements.txt

run smart_extract.py with being the .pdf containing images, or .txt/.md file conatining information on the plan. PDF or .txt/.md should be in

python scripts/smart_extract.py --input inputs/<input file in .pdf, .txt, or .md format>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

side-quest

PDF → Markdown → Structured JSON Pipeline

Prereqs

1) Convert PDFs to Markdown

2) Add your OpenAI API key

3) Extract structured requirements with OpenAI

3b) One-liner via bash helper

4) Normalize extracted JSON (adds provenance + stable IDs)

5) Merge normalized JSON deterministically + capture conflicts

6) Render concise rubric-style Markdown (includes building type)

6b) One-liner for steps 4–6 (normalize → merge → render)

7) Chunk-level vector comparison + DB persistence

Notes

How to run smart_extract.py

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
api		api
backend		backend
extracted_information		extracted_information
extracted_output		extracted_output
frontend		frontend
inputs		inputs
markdown		markdown
merged		merged
migrations		migrations
normalized_information		normalized_information
references		references
reports		reports
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ground_truth_accessibility.json		ground_truth_accessibility.json
manage.py		manage.py
requirements.txt		requirements.txt

WaltonColaco/side-quest

Folders and files

Latest commit

History

Repository files navigation

side-quest

PDF → Markdown → Structured JSON Pipeline

Prereqs

1) Convert PDFs to Markdown

2) Add your OpenAI API key

3) Extract structured requirements with OpenAI

3b) One-liner via bash helper

4) Normalize extracted JSON (adds provenance + stable IDs)

5) Merge normalized JSON deterministically + capture conflicts

6) Render concise rubric-style Markdown (includes building type)

6b) One-liner for steps 4–6 (normalize → merge → render)

7) Chunk-level vector comparison + DB persistence

Notes

How to run smart_extract.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages