metrics.md

Metrics

GuideLLM provides a comprehensive set of metrics to evaluate and optimize the performance of large language model (LLM) deployments. These metrics are designed to help users understand the behavior of their models under various conditions, identify bottlenecks, and make informed decisions about scaling and resource allocation. Below, we outline the key metrics measured by GuideLLM, their definitions, use cases, and how they can be interpreted.

Request Status Metrics

Successful, Incomplete, and Error Requests

Successful Requests: The number of requests that were completed successfully without any errors.
Incomplete Requests: The number of requests that were started but not completed, often due to timeouts or interruptions.
Error Requests: The number of requests that failed due to errors, such as invalid inputs or server issues.

These metrics provide a breakdown of the overall request statuses, helping users identify the reliability and stability of their LLM deployment.

Requests Made

Definition: The total number of requests made during a benchmark run, broken down by status (successful, incomplete, error).
Use Case: Helps gauge the workload handled by the system and identify the proportion of requests that were successful versus those that failed or were incomplete.

Token Metrics

Prompt Tokens and Counts

Definition: The number of tokens in the input prompts sent to the LLM.
Use Case: Useful for understanding the complexity of the input data and its impact on model performance.

Output Tokens and Counts

Definition: The number of tokens generated by the LLM in response to the input prompts.
Use Case: Helps evaluate the model's output length and its correlation with latency and resource usage.

Performance Metrics

Request Rate (Requests Per Second)

Definition: The number of requests processed per second.
Use Case: Indicates the throughput of the system and its ability to handle concurrent workloads.

Request Concurrency

Definition: The number of requests being processed simultaneously.
Use Case: Helps evaluate the system's capacity to handle parallel workloads.

Output Tokens Per Second

Definition: The average number of output tokens generated per second as a throughput metric across all requests.
Use Case: Provides insights into the server's performance and efficiency in generating output tokens.

Total Tokens Per Second

Definition: The combined rate of prompt and output tokens processed per second as a throughput metric across all requests.
Use Case: Provides insights into the server's overall performance and efficiency in processing both prompt and output tokens.

Request Latency

Definition: The time taken to process a single request, from start to finish.
Use Case: A critical metric for evaluating the responsiveness of the system.

Time to First Token (TTFT)

Definition: The time taken to generate the first token of the output.
Use Case: Indicates the initial response time of the model, which is crucial for user-facing applications.

Inter-Token Latency (ITL)

Definition: The average time between generating consecutive tokens in the output, excluding the first token.
Use Case: Helps assess the smoothness and speed of token generation.

Time Per Output Token

Definition: The average time taken to generate each output token, including the first token.
Use Case: Provides a detailed view of the model's token generation efficiency.

Statistical Summaries

GuideLLM provides detailed statistical summaries for each of the above metrics using the StatusDistributionSummary and DistributionSummary models. These summaries include the following statistics:

Summary Statistics

Mean: The average value of the metric.
Median: The middle value of the metric when sorted.
Mode: The most frequently occurring value of the metric.
Variance: The measure of how much the values of the metric vary.
Standard Deviation (Std Dev): The square root of the variance, indicating the spread of the values.
Min: The minimum value of the metric.
Max: The maximum value of the metric.
Count: The total number of data points for the metric.
Total Sum: The sum of all values for the metric.

Percentiles

GuideLLM calculates a comprehensive set of percentiles for each metric, including:

0.1th Percentile (p001): The value below which 0.1% of the data falls.
1st Percentile (p01): The value below which 1% of the data falls.
5th Percentile (p05): The value below which 5% of the data falls.
10th Percentile (p10): The value below which 10% of the data falls.
25th Percentile (p25): The value below which 25% of the data falls.
75th Percentile (p75): The value below which 75% of the data falls.
90th Percentile (p90): The value below which 90% of the data falls.
95th Percentile (p95): The value below which 95% of the data falls.
99th Percentile (p99): The value below which 99% of the data falls.
99.9th Percentile (p999): The value below which 99.9% of the data falls.

Use Cases for Statistical Summaries

Mean and Median: Provide a central tendency of the metric values.
Variance and Std Dev: Indicate the variability and consistency of the metric.
Min and Max: Highlight the range of the metric values.
Percentiles: Offer a detailed view of the distribution, helping identify outliers and performance at different levels of service.

By combining these metrics and statistical summaries, GuideLLM enables users to gain a deep understanding of their LLM deployments, optimize performance, and ensure scalability and cost-effectiveness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics

Request Status Metrics

Successful, Incomplete, and Error Requests

Requests Made

Token Metrics

Prompt Tokens and Counts

Output Tokens and Counts

Performance Metrics

Request Rate (Requests Per Second)

Request Concurrency

Output Tokens Per Second

Total Tokens Per Second

Request Latency

Time to First Token (TTFT)

Inter-Token Latency (ITL)

Time Per Output Token

Statistical Summaries

Summary Statistics

Percentiles

Use Cases for Statistical Summaries

FilesExpand file tree

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Metrics

Request Status Metrics

Successful, Incomplete, and Error Requests

Requests Made

Token Metrics

Prompt Tokens and Counts

Output Tokens and Counts

Performance Metrics

Request Rate (Requests Per Second)

Request Concurrency

Output Tokens Per Second

Total Tokens Per Second

Request Latency

Time to First Token (TTFT)

Inter-Token Latency (ITL)

Time Per Output Token

Statistical Summaries

Summary Statistics

Percentiles

Use Cases for Statistical Summaries