GuideLLM provides a comprehensive set of metrics to evaluate and optimize the performance of large language model (LLM) deployments. These metrics are designed to help users understand the behavior of their models under various conditions, identify bottlenecks, and make informed decisions about scaling and resource allocation. Below, we outline the key metrics measured by GuideLLM, their definitions, use cases, and how they can be interpreted.
- Successful Requests: The number of requests that were completed successfully without any errors.
- Incomplete Requests: The number of requests that were started but not completed, often due to timeouts or interruptions.
- Error Requests: The number of requests that failed due to errors, such as invalid inputs or server issues.
These metrics provide a breakdown of the overall request statuses, helping users identify the reliability and stability of their LLM deployment.
- Definition: The total number of requests made during a benchmark run, broken down by status (successful, incomplete, error).
- Use Case: Helps gauge the workload handled by the system and identify the proportion of requests that were successful versus those that failed or were incomplete.
- Definition: The number of tokens in the input prompts sent to the LLM.
- Use Case: Useful for understanding the complexity of the input data and its impact on model performance.
- Definition: The number of tokens generated by the LLM in response to the input prompts.
- Use Case: Helps evaluate the model's output length and its correlation with latency and resource usage.
- Definition: The number of requests processed per second.
- Use Case: Indicates the throughput of the system and its ability to handle concurrent workloads.
- Definition: The number of requests being processed simultaneously.
- Use Case: Helps evaluate the system's capacity to handle parallel workloads.
- Definition: The average number of output tokens generated per second as a throughput metric across all requests.
- Use Case: Provides insights into the server's performance and efficiency in generating output tokens.
- Definition: The combined rate of prompt and output tokens processed per second as a throughput metric across all requests.
- Use Case: Provides insights into the server's overall performance and efficiency in processing both prompt and output tokens.
- Definition: The time taken to process a single request, from start to finish.
- Use Case: A critical metric for evaluating the responsiveness of the system.
- Definition: The time taken to generate the first token of the output.
- Use Case: Indicates the initial response time of the model, which is crucial for user-facing applications.
- Definition: The average time between generating consecutive tokens in the output, excluding the first token.
- Use Case: Helps assess the smoothness and speed of token generation.
- Definition: The average time taken to generate each output token, including the first token.
- Use Case: Provides a detailed view of the model's token generation efficiency.
GuideLLM provides detailed statistical summaries for each of the above metrics using the StatusDistributionSummary and DistributionSummary models. These summaries include the following statistics:
- Mean: The average value of the metric.
- Median: The middle value of the metric when sorted.
- Mode: The most frequently occurring value of the metric.
- Variance: The measure of how much the values of the metric vary.
- Standard Deviation (Std Dev): The square root of the variance, indicating the spread of the values.
- Min: The minimum value of the metric.
- Max: The maximum value of the metric.
- Count: The total number of data points for the metric.
- Total Sum: The sum of all values for the metric.
GuideLLM calculates a comprehensive set of percentiles for each metric, including:
- 0.1th Percentile (p001): The value below which 0.1% of the data falls.
- 1st Percentile (p01): The value below which 1% of the data falls.
- 5th Percentile (p05): The value below which 5% of the data falls.
- 10th Percentile (p10): The value below which 10% of the data falls.
- 25th Percentile (p25): The value below which 25% of the data falls.
- 75th Percentile (p75): The value below which 75% of the data falls.
- 90th Percentile (p90): The value below which 90% of the data falls.
- 95th Percentile (p95): The value below which 95% of the data falls.
- 99th Percentile (p99): The value below which 99% of the data falls.
- 99.9th Percentile (p999): The value below which 99.9% of the data falls.
- Mean and Median: Provide a central tendency of the metric values.
- Variance and Std Dev: Indicate the variability and consistency of the metric.
- Min and Max: Highlight the range of the metric values.
- Percentiles: Offer a detailed view of the distribution, helping identify outliers and performance at different levels of service.
By combining these metrics and statistical summaries, GuideLLM enables users to gain a deep understanding of their LLM deployments, optimize performance, and ensure scalability and cost-effectiveness.