You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/workflows/evaluate.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,9 +142,18 @@ eval:
142
142
143
143
A judge LLM is used to evaluate the trajectory produced by the workflow, taking into account the tools available during execution. It returns a floating-point score between 0 and 1, where 1.0 indicates a perfect trajectory.
144
144
145
+
To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
146
+
145
147
It is recommended to set `max_tokens` to 1024 for the judge LLM to ensure sufficient context for evaluation.
146
148
147
-
To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
149
+
Note: Trajectory evaluation may result in frequent LLM API calls. If you encounter rate-limiting errors (such as `[429] Too Many Requests` error), you can reduce the number of concurrent requests by adjusting the `max_concurrency` parameter in your config. For example:
150
+
151
+
```yaml
152
+
eval:
153
+
general:
154
+
max_concurrency: 2
155
+
```
156
+
This setting reduces the number of concurrent requests to avoid overwhelming the LLM endpoint.
148
157
149
158
## Workflow Output
150
159
The `aiq eval` command runs the workflow on all the entries in the `dataset`. The output of these runs is stored in a file named `workflow_output.json` under the `output_dir` specified in the configuration file.
Copy file name to clipboardExpand all lines: examples/alert_triage_agent/README.md
+76-29Lines changed: 76 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,14 +34,17 @@ This example demonstrates how to build an intelligent alert triage system using
34
34
-[Functions](#functions)
35
35
-[Workflow](#workflow)
36
36
-[LLMs](#llms)
37
+
-[Evaluation](#evaluation)
38
+
-[General](#general)
39
+
-[Evaluators](#evaluators)
37
40
-[Installation and setup](#installation-and-setup)
38
41
-[Install this workflow](#install-this-workflow)
39
42
-[Set up environment variables](#set-up-environment-variables)
40
43
-[Example Usage](#example-usage)
41
44
-[Running in a live environment](#running-in-a-live-environment)
42
45
-[Note on credentials and access](#note-on-credentials-and-access)
43
46
-[Running live with a HTTP server listening for alerts](#running-live-with-a-http-server-listening-for-alerts)
44
-
-[Running in test mode](#running-in-test-mode)
47
+
-[Running in offline mode](#running-in-offline-mode)
45
48
46
49
47
50
## Use case description
@@ -149,20 +152,20 @@ The triage agent may call one or more of the following tools based on the alert
149
152
150
153
#### Functions
151
154
152
-
Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.
155
+
Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in offline mode, using mocked data for simulation.
153
156
154
157
Example:
155
158
156
159
```yaml
157
160
hardware_check:
158
161
_type: hardware_check
159
162
llm_name: tool_reasoning_llm
160
-
test_mode: true
163
+
offline_mode: true
161
164
```
162
165
163
166
* `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
164
167
* `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
165
-
* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.
168
+
* `offline_mode`: If `true`, the tool uses predefined mock results for offline testing.
166
169
167
170
Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:
168
171
@@ -185,19 +188,17 @@ workflow:
185
188
- hardware_check
186
189
- ...
187
190
llm_name: ata_agent_llm
188
-
test_mode: true
189
-
test_data_path: ...
191
+
offline_mode: true
192
+
offline_data_path: ...
190
193
benign_fallback_data_path: ...
191
-
test_output_path: ...
192
194
```
193
195
194
196
* `_type`: The name of the agent (matching the agent's name in `register.py`).
195
197
* `tool_names`: List of tools (from the `functions` section) used in the triage process.
196
198
* `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
197
-
* `test_mode`: Enables test execution using predefined input/output instead of real systems.
198
-
* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
199
+
* `offline_mode`: Enables offline execution using predefined input/output instead of real systems.
200
+
* `offline_data_path`: CSV file containing offline test alerts and their corresponding mocked tool responses.
199
201
* `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
200
-
* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.
201
202
202
203
#### LLMs
203
204
@@ -219,6 +220,50 @@ ata_agent_llm:
219
220
220
221
Each tool or agent can use a dedicated LLM tailored for its task.
221
222
223
+
#### Evaluation
224
+
225
+
The `eval` section defines how the system evaluates pipeline outputs using predefined metrics. It includes the location of the dataset used for evaluation and the configuration of evaluation metrics.
* `output_dir`: Directory where outputs (e.g., pipeline output texts, evaluation scores, agent traces) are saved.
252
+
* `dataset.file_path`: Path to the JSON dataset used for evaluation.
253
+
254
+
##### Evaluators
255
+
256
+
Each entry under `evaluators` defines a specific metric to evaluate the pipeline's output. All listed evaluators use the `ragas` (Retrieval-Augmented Generation Assessment) framework.
257
+
258
+
* `metric`: The specific `ragas` metric used to assess the output.
259
+
260
+
* `AnswerAccuracy`: Measures whether the agent's response matches the expected answer.
261
+
* `ResponseGroundedness`: Assesses whether the response is supported by retrieved context.
262
+
* `ContextRelevance`: Evaluates whether the retrieved context is relevant to the query.
263
+
* `llm_name`: The name of the LLM listed in the above `llms` section that is used to do the evaluation. This LLM should be capable of understanding both the context and generated responses to make accurate assessments.
264
+
265
+
The list of evaluators can be extended or swapped out depending on your evaluation goals.
266
+
222
267
## Installation and setup
223
268
224
269
If you have not already done so, follow the instructions in the [Install Guide](../../docs/source/quick-start/installing.md) to create the development environment and install AIQ toolkit.
You can run the agent in [test mode](#running-in-test-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). Test mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
288
+
You can run the agent in [offline mode](#running-in-offline-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). offline mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
244
289
245
290
### Running in a live environment
246
291
In live mode, each tool used by the triage agent connects to real systems to collect data. These systems can include:
@@ -262,11 +307,11 @@ To run the agent live, follow these steps:
262
307
263
308
If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.
264
309
265
-
3. **Disable test mode**
310
+
3. **Disable offline mode**
266
311
267
-
Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.
312
+
Set `offline_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of offline datasets.
268
313
269
-
You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.
314
+
You can also selectively keep some tools in offline mode by leaving their `offline_mode: true` for more granular testing.
270
315
271
316
4. **Run the agent with a real alert**
272
317
@@ -371,33 +416,35 @@ To use this mode, first ensure you have configured your live environment as desc
371
416
372
417
You can monitor the progress of the triage process through these logs and the generated reports.
373
418
374
-
### Running in test mode
375
-
Test mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
419
+
### Running in offline mode
420
+
offline mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
376
421
377
-
To run in test mode:
422
+
To run in offline mode:
378
423
1. **Set required environment variables**
379
424
380
-
Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
425
+
Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
381
426
382
-
1. **How it works**
383
-
- The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
384
-
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
427
+
2. **How it works**
428
+
- The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
429
+
- The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `aiq eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
430
+
- At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
431
+
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
385
432
386
-
3. **Run the agent in test mode**
433
+
3. **Run the agent in offline mode**
387
434
388
435
Run the agent with:
389
436
```bash
390
-
aiq run --config_file=examples/alert_triage_agent/configs/config_test_mode.yml --input "test_mode"
Note: The `--input` value is ignored in test mode.
393
439
394
440
The agent will:
395
-
- Load alerts from the test dataset specified in `test_data_path` in the workflow config
396
-
- Simulate an investigation using predefined tool results
397
-
- Iterate through all the alerts in the dataset
398
-
- Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config
441
+
- Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
442
+
- Investigate the alerts using predefined tool responses in the CSV file (path set in the config `workflow.offline_data_path`)
443
+
- Process all alerts in the dataset in parallel
444
+
- Run evaluation for the metrics specified in the config `eval.evaluators`
445
+
- Save the pipeline output along with the evaluation results to the path specified by `eval.output_dir`
399
446
400
-
2. **Understanding the output**
447
+
4. **Understanding the output**
401
448
402
449
The output file will contain a new column named `output`, which includes the markdown report generated by the agent for each data point (i.e., each row in the CSV). Navigate to that rightmost `output` column to view the report for each test entry.
Copy file name to clipboardExpand all lines: examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_offline_mode.yml
+36-11Lines changed: 36 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -20,28 +20,28 @@ functions:
20
20
hardware_check:
21
21
_type: hardware_check
22
22
llm_name: tool_reasoning_llm
23
-
test_mode: true
23
+
offline_mode: true
24
24
host_performance_check:
25
25
_type: host_performance_check
26
26
llm_name: tool_reasoning_llm
27
-
test_mode: true
27
+
offline_mode: true
28
28
monitoring_process_check:
29
29
_type: monitoring_process_check
30
30
llm_name: tool_reasoning_llm
31
-
test_mode: true
31
+
offline_mode: true
32
32
network_connectivity_check:
33
33
_type: network_connectivity_check
34
34
llm_name: tool_reasoning_llm
35
-
test_mode: true
35
+
offline_mode: true
36
36
telemetry_metrics_host_heartbeat_check:
37
37
_type: telemetry_metrics_host_heartbeat_check
38
38
llm_name: tool_reasoning_llm
39
-
test_mode: true
39
+
offline_mode: true
40
40
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
41
41
telemetry_metrics_host_performance_check:
42
42
_type: telemetry_metrics_host_performance_check
43
43
llm_name: tool_reasoning_llm
44
-
test_mode: true
44
+
offline_mode: true
45
45
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
46
46
telemetry_metrics_analysis_agent:
47
47
_type: telemetry_metrics_analysis_agent
@@ -66,11 +66,10 @@ workflow:
66
66
- network_connectivity_check
67
67
- telemetry_metrics_analysis_agent
68
68
llm_name: ata_agent_llm
69
-
test_mode: true
70
-
# The below paths are only used if test_mode is true
0 commit comments