Skip to content

Commit aff0ae1

Browse files
Merge branch 'develop' into second_pr
2 parents 68434ca + 7f6fedb commit aff0ae1

34 files changed

+2974
-2678
lines changed

docs/source/quick-start/installing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Before you begin using AIQ toolkit, ensure that you meet the following software
100100
101101
In addition to plugins, there are optional dependencies needed for profiling. To install these dependencies, run the following:
102102
```bash
103-
uv pip install -e .[profiling]
103+
uv pip install -e '.[profiling]'
104104
```
105105
1. Verify that you've installed the AIQ toolkit library.
106106

docs/source/workflows/evaluate.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,9 +142,18 @@ eval:
142142

143143
A judge LLM is used to evaluate the trajectory produced by the workflow, taking into account the tools available during execution. It returns a floating-point score between 0 and 1, where 1.0 indicates a perfect trajectory.
144144

145+
To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
146+
145147
It is recommended to set `max_tokens` to 1024 for the judge LLM to ensure sufficient context for evaluation.
146148

147-
To configure the judge LLM, define it in the `llms` section of the configuration file, and reference it in the evaluator configuration using the `llm_name` key.
149+
Note: Trajectory evaluation may result in frequent LLM API calls. If you encounter rate-limiting errors (such as `[429] Too Many Requests` error), you can reduce the number of concurrent requests by adjusting the `max_concurrency` parameter in your config. For example:
150+
151+
```yaml
152+
eval:
153+
general:
154+
max_concurrency: 2
155+
```
156+
This setting reduces the number of concurrent requests to avoid overwhelming the LLM endpoint.
148157

149158
## Workflow Output
150159
The `aiq eval` command runs the workflow on all the entries in the `dataset`. The output of these runs is stored in a file named `workflow_output.json` under the `output_dir` specified in the configuration file.

examples/alert_triage_agent/README.md

Lines changed: 76 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,17 @@ This example demonstrates how to build an intelligent alert triage system using
3434
- [Functions](#functions)
3535
- [Workflow](#workflow)
3636
- [LLMs](#llms)
37+
- [Evaluation](#evaluation)
38+
- [General](#general)
39+
- [Evaluators](#evaluators)
3740
- [Installation and setup](#installation-and-setup)
3841
- [Install this workflow](#install-this-workflow)
3942
- [Set up environment variables](#set-up-environment-variables)
4043
- [Example Usage](#example-usage)
4144
- [Running in a live environment](#running-in-a-live-environment)
4245
- [Note on credentials and access](#note-on-credentials-and-access)
4346
- [Running live with a HTTP server listening for alerts](#running-live-with-a-http-server-listening-for-alerts)
44-
- [Running in test mode](#running-in-test-mode)
47+
- [Running in offline mode](#running-in-offline-mode)
4548

4649

4750
## Use case description
@@ -149,20 +152,20 @@ The triage agent may call one or more of the following tools based on the alert
149152

150153
#### Functions
151154

152-
Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in test mode, using mocked data for simulation.
155+
Each entry in the `functions` section defines a tool or sub-agent that can be invoked by the main workflow agent. Tools can operate in offline mode, using mocked data for simulation.
153156

154157
Example:
155158

156159
```yaml
157160
hardware_check:
158161
_type: hardware_check
159162
llm_name: tool_reasoning_llm
160-
test_mode: true
163+
offline_mode: true
161164
```
162165
163166
* `_type`: Identifies the name of the tool (matching the names in the tools' python files.)
164167
* `llm_name`: LLM used to support the tool’s reasoning of the raw fetched data.
165-
* `test_mode`: If `true`, the tool uses predefined mock results for offline testing.
168+
* `offline_mode`: If `true`, the tool uses predefined mock results for offline testing.
166169

167170
Some entries, like `telemetry_metrics_analysis_agent`, are sub-agents that coordinate multiple tools:
168171

@@ -185,19 +188,17 @@ workflow:
185188
- hardware_check
186189
- ...
187190
llm_name: ata_agent_llm
188-
test_mode: true
189-
test_data_path: ...
191+
offline_mode: true
192+
offline_data_path: ...
190193
benign_fallback_data_path: ...
191-
test_output_path: ...
192194
```
193195

194196
* `_type`: The name of the agent (matching the agent's name in `register.py`).
195197
* `tool_names`: List of tools (from the `functions` section) used in the triage process.
196198
* `llm_name`: Main LLM used by the agent for reasoning, tool-calling, and report generation.
197-
* `test_mode`: Enables test execution using predefined input/output instead of real systems.
198-
* `test_data_path`: CSV file containing test alerts and their corresponding mocked tool responses.
199+
* `offline_mode`: Enables offline execution using predefined input/output instead of real systems.
200+
* `offline_data_path`: CSV file containing offline test alerts and their corresponding mocked tool responses.
199201
* `benign_fallback_data_path`: JSON file with baseline healthy system responses for tools not explicitly mocked.
200-
* `test_output_path`: Output CSV file path where the agent writes triage results. Each processed alert adds a new `output` column with the generated report.
201202

202203
#### LLMs
203204

@@ -219,6 +220,50 @@ ata_agent_llm:
219220

220221
Each tool or agent can use a dedicated LLM tailored for its task.
221222

223+
#### Evaluation
224+
225+
The `eval` section defines how the system evaluates pipeline outputs using predefined metrics. It includes the location of the dataset used for evaluation and the configuration of evaluation metrics.
226+
227+
```yaml
228+
eval:
229+
general:
230+
output_dir: .tmp/aiq/examples/alert_triage_agent/output/
231+
dataset:
232+
_type: json
233+
file_path: examples/alert_triage_agent/data/offline_data.json
234+
evaluators:
235+
rag_accuracy:
236+
_type: ragas
237+
metric: AnswerAccuracy
238+
llm_name: nim_rag_eval_llm
239+
rag_groundedness:
240+
_type: ragas
241+
metric: ResponseGroundedness
242+
llm_name: nim_rag_eval_llm
243+
rag_relevance:
244+
_type: ragas
245+
metric: ContextRelevance
246+
llm_name: nim_rag_eval_llm
247+
```
248+
249+
##### General
250+
251+
* `output_dir`: Directory where outputs (e.g., pipeline output texts, evaluation scores, agent traces) are saved.
252+
* `dataset.file_path`: Path to the JSON dataset used for evaluation.
253+
254+
##### Evaluators
255+
256+
Each entry under `evaluators` defines a specific metric to evaluate the pipeline's output. All listed evaluators use the `ragas` (Retrieval-Augmented Generation Assessment) framework.
257+
258+
* `metric`: The specific `ragas` metric used to assess the output.
259+
260+
* `AnswerAccuracy`: Measures whether the agent's response matches the expected answer.
261+
* `ResponseGroundedness`: Assesses whether the response is supported by retrieved context.
262+
* `ContextRelevance`: Evaluates whether the retrieved context is relevant to the query.
263+
* `llm_name`: The name of the LLM listed in the above `llms` section that is used to do the evaluation. This LLM should be capable of understanding both the context and generated responses to make accurate assessments.
264+
265+
The list of evaluators can be extended or swapped out depending on your evaluation goals.
266+
222267
## Installation and setup
223268

224269
If you have not already done so, follow the instructions in the [Install Guide](../../docs/source/quick-start/installing.md) to create the development environment and install AIQ toolkit.
@@ -240,7 +285,7 @@ export $(grep -v '^#' .env | xargs)
240285
```
241286

242287
## Example Usage
243-
You can run the agent in [test mode](#running-in-test-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). Test mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
288+
You can run the agent in [offline mode](#running-in-offline-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). offline mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
244289

245290
### Running in a live environment
246291
In live mode, each tool used by the triage agent connects to real systems to collect data. These systems can include:
@@ -262,11 +307,11 @@ To run the agent live, follow these steps:
262307

263308
If your environment includes unique systems or data sources, you can define new tools or modify existing ones. This allows your triage agent to pull in the most relevant data for your alerts and infrastructure.
264309

265-
3. **Disable test mode**
310+
3. **Disable offline mode**
266311

267-
Set `test_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of synthetic test datasets.
312+
Set `offline_mode: false` in the workflow section and for each tool in the functions section of your config file to ensure the agent uses real data instead of offline datasets.
268313

269-
You can also selectively keep some tools in test mode by leaving their `test_mode: true` for more granular testing.
314+
You can also selectively keep some tools in offline mode by leaving their `offline_mode: true` for more granular testing.
270315

271316
4. **Run the agent with a real alert**
272317

@@ -371,33 +416,35 @@ To use this mode, first ensure you have configured your live environment as desc
371416

372417
You can monitor the progress of the triage process through these logs and the generated reports.
373418

374-
### Running in test mode
375-
Test mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
419+
### Running in offline mode
420+
offline mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.
376421

377-
To run in test mode:
422+
To run in offline mode:
378423
1. **Set required environment variables**
379424

380-
Make sure `test_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
425+
Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the config](#understanding-the-config) section).
381426

382-
1. **How it works**
383-
- The **main test CSV** provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
384-
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's test data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
427+
2. **How it works**
428+
- The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
429+
- The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `aiq eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
430+
- At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
431+
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
385432

386-
3. **Run the agent in test mode**
433+
3. **Run the agent in offline mode**
387434

388435
Run the agent with:
389436
```bash
390-
aiq run --config_file=examples/alert_triage_agent/configs/config_test_mode.yml --input "test_mode"
437+
aiq eval --config_file=examples/alert_triage_agent/configs/config_offline_mode.yml
391438
```
392-
Note: The `--input` value is ignored in test mode.
393439

394440
The agent will:
395-
- Load alerts from the test dataset specified in `test_data_path` in the workflow config
396-
- Simulate an investigation using predefined tool results
397-
- Iterate through all the alerts in the dataset
398-
- Save reports as a new column in a copy of the test CSV file to the path specified in `test_output_path` in the workflow config
441+
- Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
442+
- Investigate the alerts using predefined tool responses in the CSV file (path set in the config `workflow.offline_data_path`)
443+
- Process all alerts in the dataset in parallel
444+
- Run evaluation for the metrics specified in the config `eval.evaluators`
445+
- Save the pipeline output along with the evaluation results to the path specified by `eval.output_dir`
399446

400-
2. **Understanding the output**
447+
4. **Understanding the output**
401448

402449
The output file will contain a new column named `output`, which includes the markdown report generated by the agent for each data point (i.e., each row in the CSV). Navigate to that rightmost `output` column to view the report for each test entry.
403450

examples/alert_triage_agent/src/aiq_alert_triage_agent/categorizer.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,13 @@
2828
from aiq.data_models.function import FunctionBaseConfig
2929

3030
from . import utils
31-
from .prompts import PipelineNodePrompts
31+
from .prompts import CategorizerPrompts
3232

3333

3434
class CategorizerToolConfig(FunctionBaseConfig, name="categorizer"):
35-
description: str = Field(default="This is a categorization tool used at the end of the pipeline.",
36-
description="Description of the tool.")
35+
description: str = Field(default=CategorizerPrompts.TOOL_DESCRIPTION, description="Description of the tool.")
3736
llm_name: LLMRef
37+
prompt: str = Field(default=CategorizerPrompts.PROMPT, description="Main prompt for the categorization task.")
3838

3939

4040
def _extract_markdown_heading_level(report: str) -> str:
@@ -48,8 +48,7 @@ def _extract_markdown_heading_level(report: str) -> str:
4848
async def categorizer_tool(config: CategorizerToolConfig, builder: Builder):
4949
# Set up LLM and chain
5050
llm = await builder.get_llm(config.llm_name, wrapper_type=LLMFrameworkEnum.LANGCHAIN)
51-
prompt_template = ChatPromptTemplate([("system", PipelineNodePrompts.CATEGORIZER_PROMPT),
52-
MessagesPlaceholder("msgs")])
51+
prompt_template = ChatPromptTemplate([("system", config.prompt), MessagesPlaceholder("msgs")])
5352
categorization_chain = prompt_template | llm
5453

5554
async def _arun(report: str) -> str:

examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_live_mode.yml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,27 +20,27 @@ functions:
2020
hardware_check:
2121
_type: hardware_check
2222
llm_name: tool_reasoning_llm
23-
test_mode: false
23+
offline_mode: false
2424
host_performance_check:
2525
_type: host_performance_check
2626
llm_name: tool_reasoning_llm
27-
test_mode: false
27+
offline_mode: false
2828
monitoring_process_check:
2929
_type: monitoring_process_check
3030
llm_name: tool_reasoning_llm
31-
test_mode: false
31+
offline_mode: false
3232
network_connectivity_check:
3333
_type: network_connectivity_check
3434
llm_name: tool_reasoning_llm
35-
test_mode: false
35+
offline_mode: false
3636
telemetry_metrics_host_heartbeat_check:
3737
_type: telemetry_metrics_host_heartbeat_check
3838
llm_name: tool_reasoning_llm
39-
test_mode: false
39+
offline_mode: false
4040
telemetry_metrics_host_performance_check:
4141
_type: telemetry_metrics_host_performance_check
4242
llm_name: tool_reasoning_llm
43-
test_mode: false
43+
offline_mode: false
4444
telemetry_metrics_analysis_agent:
4545
_type: telemetry_metrics_analysis_agent
4646
tool_names:
@@ -64,11 +64,10 @@ workflow:
6464
- network_connectivity_check
6565
- telemetry_metrics_analysis_agent
6666
llm_name: ata_agent_llm
67-
test_mode: false
68-
# The below paths are only used if test_mode is true
69-
test_data_path: null
67+
offline_mode: false
68+
# The below paths are only used if offline_mode is true
69+
offline_data_path: null
7070
benign_fallback_data_path: null
71-
test_output_path: null
7271

7372
llms:
7473
ata_agent_llm:

examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_test_mode.yml renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/configs/config_offline_mode.yml

Lines changed: 36 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,28 +20,28 @@ functions:
2020
hardware_check:
2121
_type: hardware_check
2222
llm_name: tool_reasoning_llm
23-
test_mode: true
23+
offline_mode: true
2424
host_performance_check:
2525
_type: host_performance_check
2626
llm_name: tool_reasoning_llm
27-
test_mode: true
27+
offline_mode: true
2828
monitoring_process_check:
2929
_type: monitoring_process_check
3030
llm_name: tool_reasoning_llm
31-
test_mode: true
31+
offline_mode: true
3232
network_connectivity_check:
3333
_type: network_connectivity_check
3434
llm_name: tool_reasoning_llm
35-
test_mode: true
35+
offline_mode: true
3636
telemetry_metrics_host_heartbeat_check:
3737
_type: telemetry_metrics_host_heartbeat_check
3838
llm_name: tool_reasoning_llm
39-
test_mode: true
39+
offline_mode: true
4040
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
4141
telemetry_metrics_host_performance_check:
4242
_type: telemetry_metrics_host_performance_check
4343
llm_name: tool_reasoning_llm
44-
test_mode: true
44+
offline_mode: true
4545
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
4646
telemetry_metrics_analysis_agent:
4747
_type: telemetry_metrics_analysis_agent
@@ -66,11 +66,10 @@ workflow:
6666
- network_connectivity_check
6767
- telemetry_metrics_analysis_agent
6868
llm_name: ata_agent_llm
69-
test_mode: true
70-
# The below paths are only used if test_mode is true
71-
test_data_path: examples/alert_triage_agent/data/test_data.csv
72-
benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_test_data.json
73-
test_output_path: .tmp/aiq/examples/alert_triage_agent/output/test_output.csv
69+
offline_mode: true
70+
# The below paths are only used if offline_mode is true
71+
offline_data_path: examples/alert_triage_agent/data/offline_data.csv
72+
benign_fallback_data_path: examples/alert_triage_agent/data/benign_fallback_offline_data.json
7473

7574
llms:
7675
ata_agent_llm:
@@ -103,3 +102,29 @@ llms:
103102
model_name: meta/llama-3.3-70b-instruct
104103
temperature: 0
105104
max_tokens: 2048
105+
106+
nim_rag_eval_llm:
107+
_type: nim
108+
model_name: meta/llama-3.3-70b-instruct
109+
max_tokens: 8
110+
111+
eval:
112+
general:
113+
output_dir: .tmp/aiq/examples/alert_triage_agent/output/
114+
dataset:
115+
_type: json
116+
# JSON representation of the offline CSV data (including just the alerts, the expected output, and the label)
117+
file_path: examples/alert_triage_agent/data/offline_data.json
118+
evaluators:
119+
rag_accuracy:
120+
_type: ragas
121+
metric: AnswerAccuracy
122+
llm_name: nim_rag_eval_llm
123+
rag_groundedness:
124+
_type: ragas
125+
metric: ResponseGroundedness
126+
llm_name: nim_rag_eval_llm
127+
rag_relevance:
128+
_type: ragas
129+
metric: ContextRelevance
130+
llm_name: nim_rag_eval_llm

examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_test_data.json renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/benign_fallback_offline_data.json

File renamed without changes.

examples/alert_triage_agent/src/aiq_alert_triage_agent/data/test_data.csv renamed to examples/alert_triage_agent/src/aiq_alert_triage_agent/data/offline_data.csv

File renamed without changes.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:6ec4ad0f439d4c7c8682e3deb8ab7d10190c19ec23af5f61e5c2a0bce7f7a51f
3+
size 2490

0 commit comments

Comments
 (0)