Skip to content

AGI4Engineering/EngDesign

Repository files navigation

EngDesign

EngDesign is a benchmark of 101 structured engineering‑design tasks spanning multiple domains. This repository supports our NeurIPS Datasets & Benchmarks track submission, "Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs."

Of these 101 tasks, 34 rely on proprietary scientific softwares (e.g., MATLAB or Cadence) and may not run on every system. We provide the complete input datasets and evaluation scripts for these tasks as well —— simply follow the detailed setup instructions to configure the required environments and run them.

The remaining 67 tasks have no license restrictions and can be evaluated using our hand‑authored scripts. To remove licensing barriers, we’ve extracted these into EngDesign-Open, a standalone subset whose repository includes evaluation scripts for all 67 tasks without any proprietary dependencies.

Our evaluation framework currently integrates with twelve LLM variants: GPT‑4o, o1, o3, o3‑high, o4‑mini, o4‑mini‑high, Gemini‑2.0‑flash, Gemini‑2.5‑pro‑preview‑05‑06, DeepSeek‑Chat, DeepSeek‑Reasoner, Claude‑3‑7‑Sonnet, and Claude‑3‑7‑Sonnet (Extended Reasoning Mode).


🚀 Run EngDesign-Open

EngDesign-Open contains all 67 tasks without license restrictions. You can run them by following these steps:

1. Install and Log in to Docker

  • Register at hub.docker.com and verify your email.
  • Download and install Docker Desktop on your machine: Download Docker Desktop
  • Launch Docker Desktop and log in to your account.
  • Make sure Docker Desktop has access to your drive (check settings).

2. Authenticate via CLI

In a terminal, run:

docker login -u your_dockerhub_username

3. Build the Docker Image

Run the following command in the root directory of this project:

docker build -t engdesign-sim .

4. Start a Docker Container

Mount your local project directory and start a bash session in the container:

docker run -it --rm -v /path/to/your/local/directory:/app --entrypoint bash engdesign-sim

5. Run the Benchmark Tasks

Once inside the container (you'll see a prompt like root@xxxxxxxxxxxx:/app#), you can run benchmark tasks using the following commands.

(1) Run All Tasks with a Given Model

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model gpt-4o \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --k 1

(2) Run Specific Tasks with a Given Model

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model gpt-4o \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

Parameter Descriptions

Parameter Description
--task_dir Directory containing the task folders (e.g. ./EngDesign-Open)
--task_list (Optional) Names of specific tasks to run (e.g. AB_01 AB_02). If not set, all tasks will run
--model Model to use (Names of the twelve LLM variants using in the commands are: gpt-4o, o1, o3, o3‑high, o4‑mini, o4‑mini‑high, gemini-2.0-flash, gemini-2.5-pro-preview-05-06, deepseek-chat, deepseek-reasoner, claude-3-7, and claude-3-7-thinking)
--api_key Your API key for the corresponding provider (OpenAI, Google, DeepSeek, Anthropic, etc.)
--k Number of repetitions per task
--reasoning_effort (Optional) Use high for o3 or o4-mini models with high-effort reasoning mode

6. Example Commands for All 12 Supported Models

OpenAI Models

(1) GPT-4o:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model gpt-4o \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by GPT-4o will be saved in the {task_id}_log_gpt-4o_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(2) o1 - OpenAI GPT-4 variant (baseline configuration):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model o1 \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by o1 will be saved in the {task_id}_log_o1_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(3) o3 - OpenAI GPT-4 variant (enhanced reasoning):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model o3 \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by o3 will be saved in the {task_id}_log_o3_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(4) o3-high - OpenAI GPT-4 variant (high-effort reasoning mode):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model o3 \
  --reasoning_effort high \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by o3-high will be saved in the {task_id}_log_o3_high_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(5) o4-mini - OpenAI GPT-4 Mini (lightweight version):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model o4-mini \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by o4-mini will be saved in the {task_id}_log_o4-mini_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(6) o4-mini-high - OpenAI GPT-4 Mini (high-effort reasoning mode):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model o4-mini \
  --reasoning_effort high \
  --api_key your_openai_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by o4-mini-high will be saved in the {task_id}_log_o4-mini_high_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

Gemini Models (Google)

(1) Gemini‑2.0‑flash:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model gemini-2.0-flash \
  --api_key your_gemini_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by Gemini‑2.0‑flash will be saved in the {task_id}_log_gemini-2.0-flash_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(2) Gemini‑2.5‑pro‑preview‑05‑06:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model gemini-2.5-pro-preview-05-06 \
  --api_key your_gemini_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by Gemini‑2.5‑pro‑preview‑05‑06 will be saved in the {task_id}_log_gemini-2.5-pro-preview-05-06_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

DeepSeek Models

(1) DeepSeek‑Chat:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model deepseek-chat \
  --api_key your_deepseek_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by DeepSeek‑Chat will be saved in the {task_id}_log_deepseek-chat_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(2) DeepSeek‑Reasoner:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model deepseek-reasoner \
  --api_key your_deepseek_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by DeepSeek‑Reasoner will be saved in the {task_id}_log_deepseek-reasoner_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

Claude Models (Anthropic)

(1) Claude‑3‑7‑Sonnet:

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model claude-3-7 \
  --api_key your_anthropic_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by Claude‑3‑7‑Sonnet will be saved in the {task_id}_log_claude-3-7_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

(2) Claude‑3‑7‑Sonnet (Extended Reasoning Mode):

xvfb-run -a -e /dev/stdout --server-args="-screen 0 1024x768x24" \
  python3 evaluation/evaluate_llm.py \
  --model claude-3-7-thinking \
  --api_key your_anthropic_api_key \
  --task_dir ./EngDesign-Open \
  --task_list AB_01 AB_02 \
  --k 1

The results and scores generated by Claude‑3‑7‑Sonnet (Extended Reasoning Mode) will be saved in the {task_id}_log_claude-3-7-thinking_{i}.jsonl file located in the logs folder within each task directory, where i indicates the trial number.

7. Other Important Information

(1) Replace API Keys

Make sure to replace the api_key in the commands with your actual API keys for the corresponding provider.

(2) Find the Task Outputs

You can find the model's corresponding output files in the logs folder within each task directory, where you can view the scores and the model's generated outputs. See Section 6 (Example Commands for All 12 Supported Models) for details on the output file names.

Please note that each task folder in this GitHub repository contains a logs folder with pre-generated results from our own runs. We strongly recommend renaming the original logs folder before running your own experiments to avoid confusion with our provided outputs.

In particular, for certain models, we have renamed the output log files. As a result, the log files you generate may partially overlap in filename with those already present in the repository, while others may differ. Please be cautious when interpreting or comparing results. (This is also one of the reasons why we recommend renaming the original logs folder.)

8. Exit the Container

Type exit to quit the container shell.

Optional Cleanup

Remove the image if needed:

docker image rm engdesign-sim

📂 Repository Layout

├── tasks/                       # 101 individual task folders
│   ├── <task_id>/               # e.g. XG_01
│   │   ├── LLM_prompt.txt       # Prompt presented to the LLM
│   │   ├── output_structure.py  # Defines the expected JSON/Python output schema via instructor
│   │   ├── evaluate.py          # Runs simulations & computes evaluation results
│   │   ├── images/              # (Optional) Input images for multimodal tasks
│   │   └── logs/                # Our evaluation logs
│   └── ...
├── EngDesign-Open/              # The task folders without license restrictions
│   ├── <task_id>/
│   └── ...
├── iterative_result/            # Logs from iterative design runs with GPT‑4o, o1, o3, o4‑mini
├── evaluation/                  # The driver script for running the benchmark
│   └── evaluate_llm.py
├── Dockerfile                   # Docker configuration for containerized benchmarking
└── docker_requirements.txt      # Dependency list for installing in the Docker environment

About

A benchmark of 101 structured engineering design tasks across multiple domains.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •