INFRASTRUCTURE CONTROL PLANE FEATURES

A Complete Control Plane for AI Resource Management at Scale

Build Smarter AI Workflows Faster

Accelerate model development and deployment processes while maintaining IT control over the AI lifecycle. IT departments can establish robust resource allocation policies with hierarchical structures and logic, enabling AI builders to effortlessly initiate remote sessions and leverage ClearML’s job scheduler for self-service task management on approved computing resources. Our hardware-agnostic architecture supports model training, fine-tuning, and deployment across a wide range of cluster types, including Kubernetes, Slurm, PBS, and bare metal setups, while ensuring compatibility with various hardware platforms such as NVIDIA™, AMD™, Arm™, and Intel™.

Resource Allocation Policy Manager

Get full visibility on all your compute clusters and how they are used

ClearML’s Resource Allocation Policy Manager streamlines resource allocation and allows administrators to prioritize resources effectively and monitor team usage in real time. It maximizes cluster utilization and provides flexibility in situations such as when a critical need for resources arises, by enabling admins to define quotas and guarantee compute capacity for each team. This ensures every team has access to the compute they need while simplifying the process of managing dynamic demands.

Full Visibility: The ClearML Resource Allocation Policy Manager provides a big-picture view of all available resources, displaying quota allocations across teams. With this visibility, admins can set priorities to ensure clusters are load balanced or favor newer, faster, or on-prem machines over others to lower operational costs.
Maximize Utilization: Admins can set quotas and over-quotas for projects and teams so that they are able to access idle machines, even if they have been reserved for others. By preventing machines from sitting idle, organizations maximize the value of their investments.
Add or Remove Clusters Without Downtime: Admins can easily add new clusters or take existing ones offline without impacting AI builders. With the ability to ensure teams and projects are always mapped to available resources, IT can conduct maintenance on infrastructure without affecting AI development.

Self-serve Compute

Empower AI builders and drive faster development

ClearML’s built in orchestration and scheduling capabilities enable AI builders to self-serve compute for the development and deployment of AI. After the initial set-up, IT teams are freed up to focus on other tasks and other project demands. The autonomy of self-service allows for more efficient and frictionless AI workflows with greater collaboration.

Efficiency and Scalability: Self-serve compute eliminates bottlenecks by giving AI builders direct access to resources, while orchestration dynamically allocates pre-approved resources for tasks.
Streamlined Workflows: Integrated orchestration with established resource policies and stored credentials eliminates the need for manual provisioning, enabling faster prototyping and deployment of AI models.
Reduced Overhead: Eliminating the need for manual provisioning reduces overhead of IT teams and admins, allowing them to focus on higher value-add tasks.

Job Scheduling

Optimize and streamline processing AI workloads

ClearML Schedule offers robust tools for overseeing compute resources, enabling you to optimize and prioritize workloads. With capabilities such as task prioritization, resource tracking, and load balancing, it maximizes resource efficiency and enhances workflow productivity.

Prioritized Job Management: Construct and oversee prioritized task queues for effective resource allocation. Enterprise clients can enhance GPU utilization with sophisticated features like fractional GPUs and team-specific resource limits.
Resource Tracking and Workload Balancing: Observe GPU and CPU consumption by user group or queue and effortlessly redistribute tasks across machines or scale up resources to meet processing demands.
Over-quota Utilization: Capitalize on idle resources by employing over-quota functionalities, ensuring full resource utilization, even during high-demand periods.

HPC Compatibility

See what’s happening inside your Slurm and PBS clusters

ClearML enhances the usability of Slurm and PBS by providing full visibility into job queues, the ability to monitor job statuses, stop or remove jobs, and track activity within your HPC clusters. ClearML accomplishes this by mapping AI workloads to Slurm/PBS templates, launching them with the specified environment, and managing the data connections, drivers, and memory usage.

Full Visibility into Your HPC Cluster: ClearML creates transparency into your Slurm and PBS clusters and allows you to monitor all jobs on those clusters, understand the resource utilized per job, and access the aggregated results.
Out-of-the-box Automation: You can write custom code on ClearML or use our built in pipelines and task scheduling capabilities to create custom automations without having to resort to Bash script.
Heterogenous Clusters: With ClearML, creating and maintaining heterogenous clusters is simplified. ClearML makes it possible to extend your HPC compute into the cloud with cloud spillover and autoscalers for AWS, GCP, or Azure.

Interactive Development Environments (IDEs)

Work from anywhere by launching an IDE with a single click

ClearML simplifies remote AI development by providing tools to launch interactive sessions on remote machines with a single click. Whether through JupyterLab, VS Code, or SSH, you can develop seamlessly in secure, scalable environments while automatically syncing your workspace. ClearML’s solutions allow AI builders to debug, replicate experiments, and collaborate efficiently from anywhere, across any infrastructure.

Flexible Development: Launch JupyterLab, VS Code, or SSH sessions on any remote machine with a single click, regardless of where you are.
Workspace Syncing: Automatically store and restore interactive session workspaces for continuity across sessions.
Secure Collaboration: Access remote resources securely via encrypted connections, ensuring seamless team collaboration and resource sharing.

Control and Optimize Your Entire AI Compute Infrastructure

Monitor and manage all of your CPU and GPU clusters effortlessly with a complete view of your infrastructure’s status and performance. Enable secure shared compute and dynamic fractional GPUs to run multiple jobs per GPU with precision control to boost utilization. Increase accessibility by pooling resources, applying quotas with over-quota capabilities, and allocating on-premises and cloud resources effectively to control cloud spending.

Secure Multi-tenancy

Increase cluster utilization through secure shared computing

ClearML’s secure multi-tenancy enables critical resource sharing by maintaining isolated networks for each tenant. This approach maximizes resource utilization by dynamically allocating idle resources to active workloads, ensuring no compute power is wasted. Additionally, it simplifies scalability and reduces costs by enabling centralized management and elastic resource distribution. When combined with ClearML’s usage reporting and billing API, organizations can implement GPU-as-a-Service with the ability to issue invoices and chargebacks.

Efficient Resource Utilization: Multi-tenancy dynamically re-allocates unused resources, achieving maximum hardware utilization.
Scalability: This feature allows seamless scaling of infrastructure by adding clusters or cloud providers without disrupting existing workloads.
Enhanced Security: Isolated networks provide data privacy and prevent cross-tenant interference, safeguarding sensitive projects.

Dynamic Fractional GPUs

Do more with the GPUs you already have

Get more out of each GPU by splitting it between multiple workloads. ClearML combines dynamic GPU fractions with dynamic job allocation, ensuring right-sized GPU allocations that fully maximize the capacity of each chip.

Efficient Resource Utilization: Dynamic fractional GPUs allow multiple workloads to share a single GPU, ensuring that GPU resources are fully utilized without being idle or wasted.
Cost Savings and Delaying Additional GPU Purchases: By optimizing GPU usage and running multiple tasks on the same hardware, organizations can reduce the number of GPUs needed to sustain AI development, leading to significant cost reductions or the delay of additional GPU purchases.
Flexibility and Scalability: Dynamic GPU allocation seamlessly adapts to changing workload demands, supporting AI development and deployment with improved scalability.

Multi-cloud / Multi-cluster

Freedom to build your infrastructure in the most effective or cost-efficient way

ClearML excels in supporting multi-cloud and multi-cluster setups by enabling seamless integration across diverse infrastructures, including on-premises, cloud, and hybrid environments. Our hardware-agnostic and cloud-agnostic design allows organizations to combine resources from multiple providers while maintaining a centralized control plane.

Flexibility: ClearML’s cloud-agnostic design supports multi-cloud deployments (AWS, Azure, GCP) and integrates with Kubernetes, Slurm/PBS, and bare-metal clusters for diverse infrastructure compatibility.
Centralized Management: Offers a centralized control plane to monitor, allocate, and optimize resources across multiple clusters with detailed dashboards and granular controls.
Scalability: Dynamically scales workloads across heterogenous clusters with features like fractional GPUs, hierarchical quota/over-quotas, and hybrid cloud spillover to maximize resource utilization.

Cloud Spillover

Maximize on-site infrastructure usage to reduce unnecessary cloud expenditure

ClearML’s Cloud Spillover allows tasks to be executed on cloud resources only after on-premise machines reach full capacity. By favoring on-site computing, companies can regulate cloud expenses while ensuring processing power is accessible during high-demand periods.

Policy-driven Resource Allocation: Set up access rules with RBAC to determine when cloud resources should be employed, such as during busy times when on-prem infrastructure is at maximum capacity.
Effortless Cloud Integration: Automatically spin up cloud instances to handle additional workloads without disrupting tasks or users, thanks to ClearML’s smooth scheduling and orchestration.
Cloud Cost Control: Optimize cloud usage to manage expenses. Establish guidelines and restrictions for cloud resource access and allocate cloud budgets to individual projects or teams for precise utilization monitoring.

Autoscalers

Automatically launch and spin down cloud instances as needed to manage cloud costs

ClearML’s Autoscalers streamline the process of adjusting compute resources by automatically provisioning extra cloud capacity as needed and spinning them down after a pre-set idle period. With smooth integration and effective resource optimization, it enables hybrid setups that merge on-premises and cloud infrastructure for enhanced adaptability and cost efficiency.

Hassle-free Cloud Expansion: Integrate compute resources from AWS, GCP, or Azure into your existing infrastructure with minimal effort, and set up policies with logic that determines when they are available to projects and teams.
Eliminate Idle Instances: Automate the scaling up and down of cloud machines to maximize efficiency, eliminating unnecessary expenses on idle resources.
Flexible Hybrid Models: Dynamically blend on-premises and cloud resources (Kubernetes, Slurm, bare metal) to create robust hybrid computing solutions, delivering scalability without being tied to a single vendor.

Maximize AI Performance and Minimize Costs

ClearML provides a single pane of glass for your entire AI infrastructure. Our Infrastructure Control Plane streamlines the integration of software and hardware, maximizing resource utilization to accelerate AI adoption without additional investment. It enables optimized workload distribution across heterogenous clusters through granular compute control, resulting in enhanced operational efficiency and consistent performance – allowing organizations to extract maximum value from their existing infrastructure.

Resource Dashboard

A single pane of glass for monitoring your infrastructure

ClearML’s centralized resource dashboard provides IT admins with real-time visibility into cluster performance and usage across teams, enabling better decision-making and optimal resource allocation. It offers a macro view of your entire infrastructure with detailed breakdowns of resource usage and the ability to set quotas and budgets per team or project. With this transparency, organizations can monitor utilization trends, enforce compute usage, and manage budgets effectively.

Real-time Visibility: Admins gain immediate insights into cluster performance (available vs. utilized resources) and team quota usage for better decision-making on resource allocation.
Performance: The dashboard provides views into the queues for each resource as well as resource utilization over time, event history logs, and detailed performance metrics.
Effective Resource Management: Organizations can monitor utilization trends, enforce compute usage, and correlate budgets for optimized resource distribution.

Usage Reporting

Get real-time reporting on user activity

ClearML’s granular usage reporting is invaluable for organizations measuring user activity, whether for general monitoring or as part of a GPU-as-a-Service offering. ClearML’s reporting provides detailed insights into computing consumption, data storage, API calls, and other metrics, enabling precise cost tracking. This level of visibility also supports internal billing (chargebacks) and can be used to issue invoices in multi-tenant environments when paired with ClearML’s Billing API.

User Activity Tracking: ClearML’s detailed reporting helps track usage to accurately reflect real time consumption of computing hours, data storage, API calls, and other chargeable metrics.
Improved Resource Allocation: Better forecast compute demand with granular insights and trends. Enable better workload distribution, ensuring GPUs are neither idle nor overloaded.
Transparency and Accountability: Organizations can use usage reports for internal billing and to monitor tenant-specific resource consumption in shared environments.

Quota Management

Control compute consumption while ensuring high utilization

ClearML maximizes utilization of computing clusters by ensuring fair and efficient resource allocation across teams or projects. ClearML’s quota management prevents overconsumption by setting limits on resource usage, enabling administrators to maintain control over resource access while balancing demand. Over-quota capabilities reduce idle time and push AI development by dynamically redistributing unused resources for processing excess workloads.

Fair Resource Allocation: Quotas ensure that no single team or project monopolizes cluster resources, promoting equitable access for all users.
Improved Utilization: By redistributing idle resources, quota management maximizes the efficiency of cluster operations and reduces waste.
Scalability and Flexibility: Quotas enable dynamic adjustments to resource limits, allowing clusters to scale seamlessly as workloads evolve.

Cloud Resource Budget Control

Maximize your cloud budget with efficient cloud usage

ClearML lets IT teams and admins set limits on cloud compute consumption by team or project, helping control budgets by ensuring resource usage stays within predefined thresholds. When used together with ClearML Autoscalers, IT can prevent unexpected overspending while gaining visibility into usage trends. For enterprises with hybrid environments, cloud computing limits can also be paired with Cloud Spillover to ensure on-premise machines are fully utilized before spinning up a cloud instance.

Operational Expense Predictability: Set cloud limits to ensure compute consumption does not exceed allocated budgets, avoiding unexpected costs.
Reduce Spending on Idle Resources: Automatically spin down cloud instances after a pre-set idle period to eliminate unnecessary spending on idle instances.
Prioritize On-prem Resources: Create resource allocation policies and rules to prioritize on-prem resources, and spillover to cloud only once on-prem infrastructure is at full utilization.

Optimize Your Infrastructure for AI Workloads

ClearML’s Infrastructure Control Plane is your complete solution for simplifying the management of your compute infrastructure for AI development and deployment. Whether on-prem, air-gapped, or multi-cloud, ClearML’s hardware- and silicon-agnostic approach simplifies the management of GPU, CPU, and HPC clusters, providing a centralized interface for IT teams and AI builders that offers a cloud-like experience.

ClearML’s AI Infrastructure Control Plane powers AI orchestration for streamlining user access to resources, compute optimization for ensuring full utilization of compute clusters, and cost reduction for maximizing the ROI on your infrastructure.