INFRASTRUCTURE CONTROL PLANE FEATURES
A Complete Control Plane for AI Resource Management at Scale
Build Smarter AI Workflows Faster
Accelerate model development and deployment processes while maintaining IT control over the AI lifecycle. IT departments can establish robust resource allocation policies with hierarchical structures and logic, enabling AI builders to effortlessly initiate remote sessions and leverage ClearML’s job scheduler for self-service task management on approved computing resources. Our hardware-agnostic architecture supports model training, fine-tuning, and deployment across a wide range of cluster types, including Kubernetes, Slurm, PBS, and bare metal setups, while ensuring compatibility with various hardware platforms such as NVIDIA™, AMD™, Arm™, and Intel™.
Resource Allocation Policy Manager
Get full visibility on all your compute clusters and how they are used
ClearML’s Resource Allocation Policy Manager streamlines resource allocation and allows administrators to prioritize resources effectively and monitor team usage in real time. It maximizes cluster utilization and provides flexibility in situations such as when a critical need for resources arises, by enabling admins to define quotas and guarantee compute capacity for each team. This ensures every team has access to the compute they need while simplifying the process of managing dynamic demands.
- Full Visibility: The ClearML Resource Allocation Policy Manager provides a big-picture view of all available resources, displaying quota allocations across teams. With this visibility, admins can set priorities to ensure clusters are load balanced or favor newer, faster, or on-prem machines over others to lower operational costs.
- Maximize Utilization: Admins can set quotas and over-quotas for projects and teams so that they are able to access idle machines, even if they have been reserved for others. By preventing machines from sitting idle, organizations maximize the value of their investments.
- Add or Remove Clusters Without Downtime: Admins can easily add new clusters or take existing ones offline without impacting AI builders. With the ability to ensure teams and projects are always mapped to available resources, IT can conduct maintenance on infrastructure without affecting AI development.
Self-serve Compute
Empower AI builders and drive faster development
- Efficiency and Scalability: Self-serve compute eliminates bottlenecks by giving AI builders direct access to resources, while orchestration dynamically allocates pre-approved resources for tasks.
- Streamlined Workflows: Integrated orchestration with established resource policies and stored credentials eliminates the need for manual provisioning, enabling faster prototyping and deployment of AI models.
- Reduced Overhead: Eliminating the need for manual provisioning reduces overhead of IT teams and admins, allowing them to focus on higher value-add tasks.
Job Scheduling
Optimize and streamline processing AI workloads
- Prioritized Job Management: Construct and oversee prioritized task queues for effective resource allocation. Enterprise clients can enhance GPU utilization with sophisticated features like fractional GPUs and team-specific resource limits.
- Resource Tracking and Workload Balancing: Observe GPU and CPU consumption by user group or queue and effortlessly redistribute tasks across machines or scale up resources to meet processing demands.
- Over-quota Utilization: Capitalize on idle resources by employing over-quota functionalities, ensuring full resource utilization, even during high-demand periods.
HPC Compatibility
See what’s happening inside your Slurm and PBS clusters
- Full Visibility into Your HPC Cluster: ClearML creates transparency into your Slurm and PBS clusters and allows you to monitor all jobs on those clusters, understand the resource utilized per job, and access the aggregated results.
- Out-of-the-box Automation: You can write custom code on ClearML or use our built in pipelines and task scheduling capabilities to create custom automations without having to resort to Bash script.
- Heterogenous Clusters: With ClearML, creating and maintaining heterogenous clusters is simplified. ClearML makes it possible to extend your HPC compute into the cloud with cloud spillover and autoscalers for AWS, GCP, or Azure.
Interactive Development Environments (IDEs)
Work from anywhere by launching an IDE with a single click
- Flexible Development: Launch JupyterLab, VS Code, or SSH sessions on any remote machine with a single click, regardless of where you are.
- Workspace Syncing: Automatically store and restore interactive session workspaces for continuity across sessions.
- Secure Collaboration: Access remote resources securely via encrypted connections, ensuring seamless team collaboration and resource sharing.
Control and Optimize Your Entire AI Compute Infrastructure
Monitor and manage all of your CPU and GPU clusters effortlessly with a complete view of your infrastructure’s status and performance. Enable secure shared compute and dynamic fractional GPUs to run multiple jobs per GPU with precision control to boost utilization. Increase accessibility by pooling resources, applying quotas with over-quota capabilities, and allocating on-premises and cloud resources effectively to control cloud spending.
Secure Multi-tenancy
Increase cluster utilization through secure shared computing
- Efficient Resource Utilization: Multi-tenancy dynamically re-allocates unused resources, achieving maximum hardware utilization.
- Scalability: This feature allows seamless scaling of infrastructure by adding clusters or cloud providers without disrupting existing workloads.
- Enhanced Security: Isolated networks provide data privacy and prevent cross-tenant interference, safeguarding sensitive projects.
Dynamic Fractional GPUs
Do more with the GPUs you already have
- Efficient Resource Utilization: Dynamic fractional GPUs allow multiple workloads to share a single GPU, ensuring that GPU resources are fully utilized without being idle or wasted.
- Cost Savings and Delaying Additional GPU Purchases: By optimizing GPU usage and running multiple tasks on the same hardware, organizations can reduce the number of GPUs needed to sustain AI development, leading to significant cost reductions or the delay of additional GPU purchases.
- Flexibility and Scalability: Dynamic GPU allocation seamlessly adapts to changing workload demands, supporting AI development and deployment with improved scalability.
Multi-cloud / Multi-cluster
Freedom to build your infrastructure in the most effective or cost-efficient way
- Flexibility: ClearML’s cloud-agnostic design supports multi-cloud deployments (AWS, Azure, GCP) and integrates with Kubernetes, Slurm/PBS, and bare-metal clusters for diverse infrastructure compatibility.
- Centralized Management: Offers a centralized control plane to monitor, allocate, and optimize resources across multiple clusters with detailed dashboards and granular controls.
- Scalability: Dynamically scales workloads across heterogenous clusters with features like fractional GPUs, hierarchical quota/over-quotas, and hybrid cloud spillover to maximize resource utilization.
Cloud Spillover
Maximize on-site infrastructure usage to reduce unnecessary cloud expenditure
- Policy-driven Resource Allocation: Set up access rules with RBAC to determine when cloud resources should be employed, such as during busy times when on-prem infrastructure is at maximum capacity.
- Effortless Cloud Integration: Automatically spin up cloud instances to handle additional workloads without disrupting tasks or users, thanks to ClearML’s smooth scheduling and orchestration.
- Cloud Cost Control: Optimize cloud usage to manage expenses. Establish guidelines and restrictions for cloud resource access and allocate cloud budgets to individual projects or teams for precise utilization monitoring.
Autoscalers
Automatically launch and spin down cloud instances as needed to manage cloud costs
- Hassle-free Cloud Expansion: Integrate compute resources from AWS, GCP, or Azure into your existing infrastructure with minimal effort, and set up policies with logic that determines when they are available to projects and teams.
- Eliminate Idle Instances: Automate the scaling up and down of cloud machines to maximize efficiency, eliminating unnecessary expenses on idle resources.
- Flexible Hybrid Models: Dynamically blend on-premises and cloud resources (Kubernetes, Slurm, bare metal) to create robust hybrid computing solutions, delivering scalability without being tied to a single vendor.
Maximize AI Performance and Minimize Costs
Resource Dashboard
A single pane of glass for monitoring your infrastructure
- Real-time Visibility: Admins gain immediate insights into cluster performance (available vs. utilized resources) and team quota usage for better decision-making on resource allocation.
- Performance: The dashboard provides views into the queues for each resource as well as resource utilization over time, event history logs, and detailed performance metrics.
- Effective Resource Management: Organizations can monitor utilization trends, enforce compute usage, and correlate budgets for optimized resource distribution.
Usage Reporting
Get real-time reporting on user activity
- User Activity Tracking: ClearML’s detailed reporting helps track usage to accurately reflect real time consumption of computing hours, data storage, API calls, and other chargeable metrics.
- Improved Resource Allocation: Better forecast compute demand with granular insights and trends. Enable better workload distribution, ensuring GPUs are neither idle nor overloaded.
- Transparency and Accountability: Organizations can use usage reports for internal billing and to monitor tenant-specific resource consumption in shared environments.
Quota Management
Control compute consumption while ensuring high utilization
- Fair Resource Allocation: Quotas ensure that no single team or project monopolizes cluster resources, promoting equitable access for all users.
- Improved Utilization: By redistributing idle resources, quota management maximizes the efficiency of cluster operations and reduces waste.
- Scalability and Flexibility: Quotas enable dynamic adjustments to resource limits, allowing clusters to scale seamlessly as workloads evolve.
Cloud Resource Budget Control
Maximize your cloud budget with efficient cloud usage
- Operational Expense Predictability: Set cloud limits to ensure compute consumption does not exceed allocated budgets, avoiding unexpected costs.
- Reduce Spending on Idle Resources: Automatically spin down cloud instances after a pre-set idle period to eliminate unnecessary spending on idle instances.
- Prioritize On-prem Resources: Create resource allocation policies and rules to prioritize on-prem resources, and spillover to cloud only once on-prem infrastructure is at full utilization.
Optimize Your Infrastructure for AI Workloads
ClearML’s AI Infrastructure Control Plane powers AI orchestration for streamlining user access to resources, compute optimization for ensuring full utilization of compute clusters, and cost reduction for maximizing the ROI on your infrastructure.