-
Notifications
You must be signed in to change notification settings - Fork 682
[metrics] Add serveral observability metrics #3868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] Add serveral observability metrics #3868
Conversation
|
K11OntheBoat seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thanks for your contribution! |
docs/zh/online_serving/metrics.md
Outdated
| | `fastdeploy:request_success_total` | Counter | 成功处理的请求个数 | 个 | | ||
|
|
||
| | `fastdeploy:cache_config_info` | Gauge | 推理引擎的缓存配置信息 | 个 | | ||
| | `fastdeploy:available_batch_size` | Gauge | 系统还可以接受的请求数量 | 个 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里改成 Decode阶段还可以插入的请求数量
fastdeploy/metrics/metrics.py
Outdated
| "available_batch_size": { | ||
| "type": Gauge, | ||
| "name": "fastdeploy:available_batch_size", | ||
| "description": "Number of how many new requests the system can still accept.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对应英文也改一下
docs/online_serving/metrics.md
Outdated
| | `fastdeploy:request_success_total` | Counter | Number of successfully processed requests | Count | | ||
|
|
||
| | `fastdeploy:cache_config_info` | Gauge | Information of the engine's CacheConfig | Count | | ||
| | `fastdeploy:available_batch_size` | Gauge | Number of how many new requests the system can still accept| Count | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同中文,改一下注释含义,例如 "Number of requests that can continue to be inserted during the decode phase"
* Add several observability metrics * [wenxin-tools-584] 【可观测性】支持查看本节点的并发数、剩余block_size、排队请求数等信息 * adjust some metrics and md files * trigger ci * adjust ci file * trigger ci * trigger ci --------- Co-authored-by: K11OntheBoat <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]>
* [metrics] Add serveral observability metrics (#3868) * Add several observability metrics * [wenxin-tools-584] 【可观测性】支持查看本节点的并发数、剩余block_size、排队请求数等信息 * adjust some metrics and md files * trigger ci * adjust ci file * trigger ci * trigger ci --------- Co-authored-by: K11OntheBoat <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]> * version adjust --------- Co-authored-by: K11OntheBoat <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]>
fastdeploy:cache_config_info 类型为 Gauge,记录了当前节点缓存设置(CacheConfig)相关的信息,当引擎进行初始化时进行记录,内含有:
block_size: KV缓存中每个块(block)的大小,以token为单位。
bytes_per_block: 每个KV缓存块所占用的字节数。
bytes_per_layer_per_block: 每个块在单层模型中占用的字节数。
cache_dtype: 缓存中存储键值(KV)对的数据类型。
cache_queue_port: 用于缓存队列通信的端口号。
cache_transfer_protocol: 缓存数据传输使用的协议,例如 ipc(进程间通信)。
dec_token_num: 解码阶段的令牌数量。
each_token_cache_space: 每个令牌在KV缓存中占用的空间大小。
enable_chunked_prefill: 是否启用分块预填充。
enable_hierarchical_cache: 是否启用分层缓存。
enable_prefix_caching: 是否启用前缀缓存。
enable_ssd_cache: 是否启用SSD固态硬盘缓存。
enc_dec_block_num: 编码器-解码器模型的块数量。
gpu_memory_utilization: 允许使用的GPU内存占总GPU内存的比例。
kv_cache_ratio: KV缓存占用的内存比例。
max_block_num_per_seq: 每个序列(请求)允许分配的最大块数量。
model_cfg: 模型的配置对象。
num_cpu_blocks: 分配给CPU的缓存块数量。
num_gpu_blocks_override: 覆盖默认值后,分配给GPU的缓存块数量。
pd_comm_port: 用于通信的端口号。
prealloc_dec_block_slot_num_threshold: 预先分配解码块槽的阈值。
prefill_kvcache_block_num: 预填充阶段使用的KV缓存块数量。
rdma_comm_ports: 用于RDMA(远程直接数据存取)通信的端口号。
swap_space: 交换空间的大小,用于处理内存不足的情况。
total_block_num: KV缓存的总块数。
示例:
fastdeploy:available_batch_size,类型为Gauge,记录了当前节点可用批量大小,表示还能接受多少新的请求,在资源分配和回收时记录
fastdeploy:hit_req_rate,类型为Gauge,记录了Request级别前缀缓存命中率,在CacheMetrics更新命中metrics时进行记录
fastdeploy:hit_token_rate,类型为Gauge,记录了Token级别的前缀缓存命中率,在CacheMetrics更新命中metrics时进行记录
fastdeploy:cpu_hit_token_rate,类型为Gauge,记录了Token级别cpu前缀缓存命中率,在CacheMetrics更新命中metrics时进行记录
fastdeploy:gpu_hit_token_rate,类型为Gauge,记录了Token级别GPU前缀缓存命中率,在CacheMetrics更新命中metrics时进行记录