EngDesign is a benchmark of 101 structured engineering‑design tasks spanning multiple domains. This repository supports our NeurIPS Datasets & Benchmarks track submission, "Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs."
Of these 101 tasks, 34 rely on proprietary scientific softwares (e.g., MATLAB or Cadence) and may not run on every system. We provide the complete input datasets and evaluation scripts for these tasks as well —— simply follow the detailed setup instructions to configure the required environments and run them.
The remaining 67 tasks have no license restrictions and can be evaluated using our hand‑authored scripts. To remove licensing barriers, we’ve extracted these into EngDesign-Open, a standalone subset whose repository includes evaluation scripts for all 67 tasks without any proprietary dependencies.
Our evaluation framework currently integrates with twelve LLM variants: GPT‑4o, o1, o3, o3‑high, o4‑mini, o4‑mini‑high, Gemini‑2.0‑flash, Gemini‑2.5‑pro‑preview‑05‑06, DeepSeek‑Chat, DeepSeek‑Reasoner, Claude‑3‑7‑Sonnet, and Claude‑3‑7‑Sonnet (Extended Reasoning Mode).
EngDesign-Open contains all 67 tasks without license restrictions. You can run them by following these steps:
- Register at hub.docker.com and verify your email.
- Download and install Docker Desktop on your machine: Download Docker Desktop
- Launch Docker Desktop and log in to your account.
- Make sure Docker Desktop has access to your drive (check settings).
In a terminal, run:
docker login -u your_dockerhub_usernameRun the following command in the root directory of this project:
docker build -t engdesign-sim .Mount your local project directory and start a bash session in the container:
docker run -it --rm -v /path/to/your/local/directory:/app --entrypoint bash engdesign-simOnce inside the container (you'll see a prompt like root@xxxxxxxxxxxx:/app#), you can run benchmark tasks using the following commands.
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model gpt-4o \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--k 1xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model gpt-4o \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1| Parameter | Description |
|---|---|
--task_dir |
Directory containing the task folders (e.g. ./EngDesign-Open) |
--task_list |
(Optional) Names of specific tasks to run (e.g. AB_01 AB_02). If not set, all tasks will run |
--model |
Model to use (Names of the twelve LLM variants using in the commands are: gpt-4o, o1, o3, o3‑high, o4‑mini, o4‑mini‑high, gemini-2.0-flash, gemini-2.5-pro-preview-05-06, deepseek-chat, deepseek-reasoner, claude-3-7, and claude-3-7-thinking) |
--api_key |
Your API key for the corresponding provider (OpenAI, Google, DeepSeek, Anthropic, etc.) |
--k |
Number of repetitions per task |
--reasoning_effort |
(Optional) Use high for o3 or o4-mini models with high-effort reasoning mode |
(1) GPT-4o:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model gpt-4o \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by GPT-4o will be saved in the {task_id}_log_gpt-4o_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(2) o1 - OpenAI GPT-4 variant (baseline configuration):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model o1 \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by o1 will be saved in the {task_id}_log_o1_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(3) o3 - OpenAI GPT-4 variant (enhanced reasoning):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model o3 \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by o3 will be saved in the {task_id}_log_o3_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(4) o3-high - OpenAI GPT-4 variant (high-effort reasoning mode):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model o3 \
--reasoning_effort high \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by o3-high will be saved in the {task_id}_log_o3_high_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(5) o4-mini - OpenAI GPT-4 Mini (lightweight version):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model o4-mini \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by o4-mini will be saved in the {task_id}_log_o4-mini_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(6) o4-mini-high - OpenAI GPT-4 Mini (high-effort reasoning mode):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model o4-mini \
--reasoning_effort high \
--api_key your_openai_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by o4-mini-high will be saved in the {task_id}_log_o4-mini_high_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(1) Gemini‑2.0‑flash:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model gemini-2.0-flash \
--api_key your_gemini_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by Gemini‑2.0‑flash will be saved in the {task_id}_log_gemini-2.0-flash_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(2) Gemini‑2.5‑pro‑preview‑05‑06:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model gemini-2.5-pro-preview-05-06 \
--api_key your_gemini_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by Gemini‑2.5‑pro‑preview‑05‑06 will be saved in the {task_id}_log_gemini-2.5-pro-preview-05-06_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(1) DeepSeek‑Chat:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model deepseek-chat \
--api_key your_deepseek_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by DeepSeek‑Chat will be saved in the {task_id}_log_deepseek-chat_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(2) DeepSeek‑Reasoner:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model deepseek-reasoner \
--api_key your_deepseek_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by DeepSeek‑Reasoner will be saved in the {task_id}_log_deepseek-reasoner_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(1) Claude‑3‑7‑Sonnet:
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model claude-3-7 \
--api_key your_anthropic_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by Claude‑3‑7‑Sonnet will be saved in the {task_id}_log_claude-3-7_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
(2) Claude‑3‑7‑Sonnet (Extended Reasoning Mode):
xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
python3 evaluation/evaluate_llm.py \
--model claude-3-7-thinking \
--api_key your_anthropic_api_key \
--task_dir ./EngDesign-Open \
--task_list AB_01 AB_02 \
--k 1The results and scores generated by Claude‑3‑7‑Sonnet (Extended Reasoning Mode) will be saved in the {task_id}_log_claude-3-7-thinking_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.
Make sure to replace the api_key in the commands with your actual API keys for the corresponding provider.
You can find the model's corresponding output files in the logs folder within each task directory, where you can view the scores and the model's generated outputs. See Section 6 (Example Commands for All 12 Supported Models) for details on the output file names.
Please note that each task folder in this GitHub repository contains a logs folder with pre-generated results from our own runs. We strongly recommend renaming the original logs folder before running your own experiments to avoid confusion with our provided outputs.
In particular, for certain models, we have renamed the output log files. As a result, the log files you generate may partially overlap in filename with those already present in the repository, while others may differ. Please be cautious when interpreting or comparing results. (This is also one of the reasons why we recommend renaming the original logs folder.)
Type exit to quit the container shell.
Remove the image if needed:
docker image rm engdesign-sim├── tasks/ # 101 individual task folders
│ ├── <task_id>/ # e.g. XG_01
│ │ ├── LLM_prompt.txt # Prompt presented to the LLM
│ │ ├── output_structure.py # Defines the expected JSON/Python output schema via instructor
│ │ ├── evaluate.py # Runs simulations & computes evaluation results
│ │ ├── images/ # (Optional) Input images for multimodal tasks
│ │ └── logs/ # Our evaluation logs
│ └── ...
├── EngDesign-Open/ # The task folders without license restrictions
│ ├── <task_id>/
│ └── ...
├── iterative_result/ # Logs from iterative design runs with GPT‑4o, o1, o3, o4‑mini
├── evaluation/ # The driver script for running the benchmark
│ └── evaluate_llm.py
├── Dockerfile # Docker configuration for containerized benchmarking
└── docker_requirements.txt # Dependency list for installing in the Docker environment