NVIDIA
diff --git a/‎docs/source/quick-start/installing.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/quick-start/installing.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/workflows/evaluate.md‎
Lines changed: 10 additions & 1 deletion b/‎docs/source/workflows/evaluate.md‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎examples/alert_triage_agent/README.md‎
Lines changed: 76 additions & 29 deletions b/‎examples/alert_triage_agent/README.md‎
Lines changed: 76 additions & 29 deletions
diff --git a/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/categorizer.py‎
Lines changed: 4 additions & 5 deletions b/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/categorizer.py‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_live_mode.yml‎
Lines changed: 9 additions & 10 deletions b/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_live_mode.yml‎
Lines changed: 9 additions & 10 deletions
diff --git a/‎…riage_agent/configs/config_test_mode.yml‎ ‎…ge_agent/configs/config_offline_mode.yml‎examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_test_mode.yml renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_offline_mode.yml
Lines changed: 36 additions & 11 deletions b/‎…riage_agent/configs/config_test_mode.yml‎ ‎…ge_agent/configs/config_offline_mode.yml‎examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_test_mode.yml renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_offline_mode.yml
Lines changed: 36 additions & 11 deletions
diff --git a/‎…gent/data/benign_fallback_test_data.json‎ ‎…t/data/benign_fallback_offline_data.json‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_test_data.json renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_offline_data.json b/‎…gent/data/benign_fallback_test_data.json‎ ‎…t/data/benign_fallback_offline_data.json‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_test_data.json renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_offline_data.json
diff --git a/‎…iq_alert_triage_agent/data/test_data.csv‎ ‎…alert_triage_agent/data/offline_data.csv‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/test_data.csv renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/offline_data.csv b/‎…iq_alert_triage_agent/data/test_data.csv‎ ‎…alert_triage_agent/data/offline_data.csv‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/test_data.csv renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/offline_data.csv
diff --git a/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/offline_data.json‎
Lines changed: 3 additions & 0 deletions b/‎examples/alert_triage_agent/src/aiq_alert_triage_agent/data/offline_data.json‎
Lines changed: 3 additions & 0 deletions
@@ -100,7 +100,7 @@ Before you begin using AIQ toolkit, ensure that you meet the following software
 
     In addition to plugins, there are optional dependencies needed for profiling. To install these dependencies, run the following:
     ```bash
-    uv pip install -e .[profiling]
+    uv pip install -e '.[profiling]'
     ```
 1. Verify that you've installed the AIQ toolkit library.
 
 
@@ -142,9 +142,18 @@ eval:
 
 A judge LLM is used to evaluate the trajectory produced by the workflow, taking into account the tools available during execution. It returns a floating-point score between 0 and 1, where 1.0 indicates a perfect trajectory.
 
+To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key. 
+
 It is recommended to set `max_tokens` to 1024 for the judge LLM to ensure sufficient context for evaluation.
 
-To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
+Note: Trajectory evaluation may result in frequent LLM API calls. If you encounter rate-limiting errors (such as `[429] Too Many Requests` error), you can reduce the number of concurrent requests by adjusting the `max_concurrency` parameter in your config. For example:
+
+```yaml
+eval:
+  general:
+    max_concurrency: 2
+```
+This setting reduces the number of concurrent requests to avoid overwhelming the LLM endpoint.
 
 ## Workflow Output
 The `aiq eval` command runs the workflow on all the entries in the `dataset`. The output of these runs is stored in a file named `workflow_output.json` under the `output_dir` specified in the configuration file.
 
@@ -34,14 +34,17 @@ This example demonstrates how to build an intelligent alert triage system using
       - [Functions](#functions)
       - [Workflow](#workflow)
       - [LLMs](#llms)
+      - [Evaluation](#evaluation)
+        - [General](#general)
+        - [Evaluators](#evaluators)
   - [Installation and setup](#installation-and-setup)
     - [Install this workflow](#install-this-workflow)
     - [Set up environment variables](#set-up-environment-variables)
   - [Example Usage](#example-usage)
     - [Running in a live environment](#running-in-a-live-environment)
       - [Note on credentials and access](#note-on-credentials-and-access)
     - [Running live with a HTTP server listening for alerts](#running-live-with-a-http-server-listening-for-alerts)
-    - [Running in test mode](#running-in-test-mode)
+    - [Running in offline mode](#running-in-offline-mode)
 
 
 ## Use case description
@@ -149,20 +152,20 @@ The triage agent may call one or more of the following tools based on the alert
 
 #### Functions
 
-Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.
+Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in offline mode, using mocked data for simulation.
 
 Example:
 
 ```yaml
 hardware_check:
   _type: hardware_check
   llm_name: tool_reasoning_llm
-  test_mode: true
+  offline_mode: true
 ```
 
 * `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
 * `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
-* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.
+* `offline_mode`: If `true`, the tool uses predefined mock results for offline testing.
 
 Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:
 
@@ -185,19 +188,17 @@ workflow:
     - hardware_check
     - ...
   llm_name: ata_agent_llm
-  test_mode: true
-  test_data_path: ...
+  offline_mode: true
+  offline_data_path: ...
   benign_fallback_data_path: ...
-  test_output_path: ...
 ```
 
 * `_type`: The name of the agent (matching the agent's name in `register.py`).
 * `tool_names`: List of tools (from the `functions` section) used in the triage process.
 * `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
-* `test_mode`: Enables test execution using predefined input/output instead of real systems.
-* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
+* `offline_mode`: Enables offline execution using predefined input/output instead of real systems.
+* `offline_data_path`: CSV file containing offline test alerts and their corresponding mocked tool responses.
 * `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
-* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.
 
 #### LLMs
 
@@ -219,6 +220,50 @@ ata_agent_llm:
 
 Each tool or agent can use a dedicated LLM tailored for its task.
 
+#### Evaluation
+
+The `eval` section defines how the system evaluates pipeline outputs using predefined metrics. It includes the location of the dataset used for evaluation and the configuration of evaluation metrics.
+
+```yaml
+eval:
+  general:
+    output_dir: .tmp/aiq/examples/alert_triage_agent/output/
+    dataset:
+      _type: json
+      file_path: examples/alert_triage_agent/data/offline_data.json
+  evaluators:
+    rag_accuracy:
+      _type: ragas
+      metric: AnswerAccuracy
+      llm_name: nim_rag_eval_llm
+    rag_groundedness:
+      _type: ragas
+      metric: ResponseGroundedness
+      llm_name: nim_rag_eval_llm
+    rag_relevance:
+      _type: ragas
+      metric: ContextRelevance
+      llm_name: nim_rag_eval_llm
+```
+
+##### General
+
+* `output_dir`: Directory where outputs (e.g., pipeline output texts, evaluation scores, agent traces) are saved.
+* `dataset.file_path`: Path to the JSON dataset used for evaluation.
+
+##### Evaluators
+
+Each entry under `evaluators` defines a specific metric to evaluate the pipeline's output. All listed evaluators use the `ragas` (Retrieval-Augmented Generation Assessment) framework.
+
+* `metric`: The specific `ragas` metric used to assess the output.
+
+  * `AnswerAccuracy`: Measures whether the agent's response matches the expected answer.
+  * `ResponseGroundedness`: Assesses whether the response is supported by retrieved context.
+  * `ContextRelevance`: Evaluates whether the retrieved context is relevant to the query.
+* `llm_name`: The name of the LLM listed in the above `llms` section that is used to do the evaluation. This LLM should be capable of understanding both the context and generated responses to make accurate assessments.
+
+The list of evaluators can be extended or swapped out depending on your evaluation goals.
+
 ## Installation and setup
 
 If you have not already done so, follow the instructions in the [Install Guide](../../docs/source/quick-start/installing.md) to create the development environment and install AIQ toolkit.
@@ -240,7 +285,7 @@ export $(grep -v '^#' .env | xargs)
 ```
 
 ## Example Usage
-You can run the agent in [test mode](#running-in-test-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). Test mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
+You can run the agent in [offline mode](#running-in-offline-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). offline mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
 
 ### Running in a live environment
 In live mode, each tool used by the triage agent connects to real systems to collect data. These systems can include:
@@ -262,11 +307,11 @@ To run the agent live, follow these steps:
 
    If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.
 
-3. **Disable test mode**
+3. **Disable offline mode**
 
-   Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.
+   Set `offline_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of offline datasets.
 
-   You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.
+   You can also selectively keep some tools in offline mode by leaving their `offline_mode: true` for more granular testing.
 
 4. **Run the agent with a real alert**
 
@@ -371,33 +416,35 @@ To use this mode, first ensure you have configured your live environment as desc
 
    You can monitor the progress of the triage process through these logs and the generated reports.
 
-### Running in test mode
-Test mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
+### Running in offline mode
+offline mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
 
-To run in test mode:
+To run in offline mode:
 1. **Set required environment variables**
 
-   Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
+   Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
 
-1. **How it works**
-- The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
-- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
+2. **How it works**
+- The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
+   - The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `aiq eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
+   - At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
+- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
 
-3. **Run the agent in test mode**
+3. **Run the agent in offline mode**
 
    Run the agent with:
    ```bash
-   aiq run --config_file=examples/alert_triage_agent/configs/config_test_mode.yml --input "test_mode"
+   aiq eval --config_file=examples/alert_triage_agent/configs/config_offline_mode.yml
    ```
-    Note: The `--input` value is ignored in test mode.
 
     The agent will:
-   - Load alerts from the test dataset specified in `test_data_path` in the workflow config
-   - Simulate an investigation using predefined tool results
-   - Iterate through all the alerts in the dataset
-   - Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config
+   - Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
+   - Investigate the alerts using predefined tool responses in the CSV file (path set in the config `workflow.offline_data_path`)
+   - Process all alerts in the dataset in parallel
+   - Run evaluation for the metrics specified in the config `eval.evaluators`
+   - Save the pipeline output along with the evaluation results to the path specified by `eval.output_dir`
 
-2. **Understanding the output**
+4. **Understanding the output**
 
    The output file will contain a new column named `output`, which includes the markdown report generated by the agent for each data point (i.e., each row in the CSV). Navigate to that rightmost `output` column to view the report for each test entry.
 
 
@@ -28,13 +28,13 @@
 from aiq.data_models.function import FunctionBaseConfig
 
 from . import utils
-from .prompts import PipelineNodePrompts
+from .prompts import CategorizerPrompts
 
 
 class CategorizerToolConfig(FunctionBaseConfig, name="categorizer"):
-    description: str = Field(default="This is a categorization tool used at the end of the pipeline.",
-                             description="Description of the tool.")
+    description: str = Field(default=CategorizerPrompts.TOOL_DESCRIPTION, description="Description of the tool.")
     llm_name: LLMRef
+    prompt: str = Field(default=CategorizerPrompts.PROMPT, description="Main prompt for the categorization task.")
 
 
 def _extract_markdown_heading_level(report: str) -> str:
@@ -48,8 +48,7 @@ def _extract_markdown_heading_level(report: str) -> str:
 async def categorizer_tool(config: CategorizerToolConfig, builder: Builder):
     # Set up LLM and chain
     llm = await builder.get_llm(config.llm_name, wrapper_type=LLMFrameworkEnum.LANGCHAIN)
-    prompt_template = ChatPromptTemplate([("system", PipelineNodePrompts.CATEGORIZER_PROMPT),
-                                          MessagesPlaceholder("msgs")])
+    prompt_template = ChatPromptTemplate([("system", config.prompt), MessagesPlaceholder("msgs")])
     categorization_chain = prompt_template | llm
 
     async def _arun(report: str) -> str:
 
@@ -20,27 +20,27 @@ functions:
   hardware_check:
     _type: hardware_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   host_performance_check:
     _type: host_performance_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   monitoring_process_check:
     _type: monitoring_process_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   network_connectivity_check:
     _type: network_connectivity_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   telemetry_metrics_host_heartbeat_check:
     _type: telemetry_metrics_host_heartbeat_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   telemetry_metrics_host_performance_check:
     _type: telemetry_metrics_host_performance_check
     llm_name: tool_reasoning_llm
-    test_mode: false
+    offline_mode: false
   telemetry_metrics_analysis_agent:
     _type: telemetry_metrics_analysis_agent
     tool_names:
@@ -64,11 +64,10 @@ workflow:
     - network_connectivity_check
     - telemetry_metrics_analysis_agent
   llm_name: ata_agent_llm
-  test_mode: false
-  # The below paths are only used if test_mode is true
-  test_data_path: null
+  offline_mode: false
+  # The below paths are only used if offline_mode is true
+  offline_data_path: null
   benign_fallback_data_path: null
-  test_output_path: null
 
 llms:
   ata_agent_llm:
 
@@ -20,28 +20,28 @@ functions:
   hardware_check:
     _type: hardware_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
   host_performance_check:
     _type: host_performance_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
   monitoring_process_check:
     _type: monitoring_process_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
   network_connectivity_check:
     _type: network_connectivity_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
   telemetry_metrics_host_heartbeat_check:
     _type: telemetry_metrics_host_heartbeat_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
     metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
   telemetry_metrics_host_performance_check:
     _type: telemetry_metrics_host_performance_check
     llm_name: tool_reasoning_llm
-    test_mode: true
+    offline_mode: true
     metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
   telemetry_metrics_analysis_agent:
     _type: telemetry_metrics_analysis_agent
@@ -66,11 +66,10 @@ workflow:
     - network_connectivity_check
     - telemetry_metrics_analysis_agent
   llm_name: ata_agent_llm
-  test_mode: true
-  # The below paths are only used if test_mode is true
-  test_data_path: examples/alert_triage_agent/data/test_data.csv
-  benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_test_data.json
-  test_output_path: .tmp/aiq/examples/alert_triage_agent/output/test_output.csv
+  offline_mode: true
+  # The below paths are only used if offline_mode is true
+  offline_data_path: examples/alert_triage_agent/data/offline_data.csv
+  benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_offline_data.json
 
 llms:
   ata_agent_llm:
@@ -103,3 +102,29 @@ llms:
     model_name: meta/llama-3.3-70b-instruct
     temperature: 0
     max_tokens: 2048
+
+  nim_rag_eval_llm:
+    _type: nim
+    model_name: meta/llama-3.3-70b-instruct
+    max_tokens: 8
+
+eval:
+  general:
+    output_dir: .tmp/aiq/examples/alert_triage_agent/output/
+    dataset:
+      _type: json
+      # JSON representation of the offline CSV data (including just the alerts, the expected output, and the label)
+      file_path: examples/alert_triage_agent/data/offline_data.json
+  evaluators:
+    rag_accuracy:
+      _type: ragas
+      metric: AnswerAccuracy
+      llm_name: nim_rag_eval_llm
+    rag_groundedness:
+      _type: ragas
+      metric: ResponseGroundedness
+      llm_name: nim_rag_eval_llm
+    rag_relevance:
+      _type: ragas
+      metric: ContextRelevance
+      llm_name: nim_rag_eval_llm
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6ec4ad0f439d4c7c8682e3deb8ab7d10190c19ec23af5f61e5c2a0bce7f7a51f
+size 2490
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:6ec4ad0f439d4c7c8682e3deb8ab7d10190c19ec23af5f61e5c2a0bce7f7a51f`
	`3`	`+size 2490`