Private Inference

When you use traditional AI services, your data passes through systems controlled by cloud providers and AI companies. Your prompts, the AI's responses, and even the processing of your requests are all visible to these third parties. This creates serious security concerns for sensitive applications.

Private inference solves this problem. It ensures that AI computations happen in a completely isolated environment where no one—not the cloud provider, not the model provider, not even NEAR AI can access your data. At the same time, you can independently verify that your requests were actually processed in this secure environment through cryptographic attestation.

This guide explains how NEAR AI Cloud implements private inference using Trusted Execution Environments (TEEs), the architecture that protects your data, and the security guarantees you can rely on.

What is Private Inference?

Private inference is a method of running AI models where both your input data and the model's outputs remain completely hidden from everyone except the user and client even while the computation happens on remote servers you don't control.

Traditional cloud AI services require you to trust that providers won't access your data. Private inference eliminates this need for trust by using hardware-based security that makes it technically impossible for anyone to see your data, even with physical access to the servers.

NEAR AI Cloud's private inference provides three core guarantees:

Complete Privacy

Your prompts, model weights, and outputs are encrypted and isolated in hardware-secured environments. Infrastructure providers, model providers, and NEAR cannot access your data at any point in the process.

Cryptographic Verification

Every computation generates cryptographic proof that it occurred inside a genuine, secure TEE. You can independently verify that your AI requests were processed in a protected environment without trusting any third party.

Production Performance

Hardware-accelerated TEEs with NVIDIA Confidential Computing deliver high-throughput inference with minimal latency overhead, making private inference practical for real-world applications.

How Private Inference Works

Trusted Execution Environment (TEE)

NEAR AI Cloud combines Intel TDX and NVIDIA TEE technologies to create isolated, secure environments for AI computation:

Intel TDX(Trust Domain Extensions) : Creates confidential virtual machines (CVMs) that isolate your AI workloads from the host system, preventing unauthorized access to data in memory.
NVIDIA TEE : Provides GPU-level isolation for model inference, ensuring model weights and computations remain completely private during processing.
Cryptographic Attestation : Each TEE environment generates cryptographic proofs of its integrity and configuration, enabling independent verification of the secure execution environment.

Client-Side Encryption

If you're using a standard OpenAI SDK or curl, your prompts are automatically protected by TLS encryption—no additional setup required.

NEAR AI Cloud supports two connection modes — both provide TLS termination inside a TEE:

Gateway mode routes through cloud-api.near.ai, which runs in its own TEE before forwarding to the model TEE:

Direct completions mode connects you straight to the model's TEE — one hop, one TEE to verify:

Key insight: In both modes, TLS termination happens inside the TEE (green boxes), not at an external load balancer. Your prompts remain encrypted until they reach the secure enclave.

Here's why this works:

Standard HTTPS = TLS encryption: When you make API calls using the OpenAI SDK or curl, you're connecting via HTTPS. This means TLS encryption is applied automatically by your client before any data leaves your machine.
TLS terminates inside the TEE: Unlike traditional cloud services where TLS terminates at an external load balancer, NEAR AI Cloud terminates TLS connections inside the Trusted Execution Environment. Your encrypted data travels from your laptop directly into the secure enclave before being decrypted.
No plaintext exposure: Because TLS termination happens within the TEE, your prompts are never exposed in plaintext outside the hardware-secured environment—not to network infrastructure, not to cloud providers, not to anyone.

This means you get the same seamless developer experience as any OpenAI-compatible API, while your data remains encrypted from your machine all the way into the secure TEE.

Direct Completions

With direct completions, each model is available at its own subdomain under completions.near.ai. Your request goes straight to the model's TEE — no gateway hop needed.

Base URL format:

https://{slug}.completions.near.ai/v1

Available endpoints (see completions.near.ai/endpoints for the full list):

Subdomain	Model	Input	Output
`qwen35-122b.completions.near.ai`	Qwen3.5-122B	Text	Text
`deepseek-v31.completions.near.ai`	DeepSeek-V3.1	Text	Text
`qwen3-30b.completions.near.ai`	Qwen3-30B	Text	Text
`gpt-oss-120b.completions.near.ai`	GPT-OSS-120B	Text	Text
`glm-5.completions.near.ai`	GLM-5	Text	Text
`qwen3-vl-30b.completions.near.ai`	Qwen3-VL-30B	Text, Image	Text
`flux2-klein.completions.near.ai`	FLUX.2-klein-4B	Text	Image
`whisper-large-v3.completions.near.ai`	Whisper Large V3	Audio	Text
`qwen3-embedding.completions.near.ai`	Qwen3-Embedding-0.6B	Text	Embedding
`qwen3-reranker.completions.near.ai`	Qwen3-Reranker-0.6B	Text	Score

Each subdomain exposes the same OpenAI-compatible API as the gateway, plus model-specific endpoints:

POST /v1/chat/completions — Chat completions (with signing)
POST /v1/completions — Text completions
POST /v1/embeddings — Embeddings
POST /v1/images/generations — Image generation
POST /v1/audio/transcriptions — Audio transcription
POST /v1/rerank — Reranking
GET /v1/models — Available models
GET /v1/attestation/report — TEE attestation (TDX + GPU)
GET /v1/signature/{chat_id} — Retrieve cached signatures

Benefits of direct completions:

Fewer hops — Your request reaches the model TEE directly, reducing latency
Simpler trust model — Only the model TEE needs to be verified (no gateway verification required)
TLS binds to attestation — The include_tls_fingerprint=true parameter on the attestation endpoint binds the TLS certificate to the attestation report

The Inference Process

When you make a request to NEAR AI Cloud, your data flows through a secure pipeline designed to maintain privacy at every step. The exact flow depends on whether you use the gateway or a direct completions endpoint.

Via Gateway (cloud-api.near.ai):

Request Initiation: You send chat completion requests via HTTPS to the LLM Gateway. TLS encryption protects your data in transit, and the TLS connection terminates inside the TEE, ensuring your prompts are decrypted only within the secure environment.
Secure Request Routing: The LLM Gateway routes your request to the appropriate Private LLM Node based on the requested model, availability, and load balancing requirements.
Secure Inference: AI inference computations execute inside the Private LLM Node's TEE, where all data and model weights are protected by hardware-enforced isolation.
Attestation Generation The TEE generates CPU and GPU attestation reports that provide cryptographic proof of the environment's integrity and configuration.
Cryptographic Signing: The TEE cryptographically signs both your original request and the inference results to ensure authenticity and prevent tampering.
Verifiable Response: You receive the AI response along with cryptographic signatures and attestation data for independent verification.

Via Direct Completions ({slug}.completions.near.ai):

Direct Request: You send chat completion requests via HTTPS directly to the model's subdomain. TLS terminates inside the model's TEE — there is no intermediate gateway.
Secure Inference: The model processes your request entirely within its TEE, where all data and model weights are protected by hardware-enforced isolation.
Attestation & Signing: The TEE generates attestation reports and cryptographically signs both your request and the inference results.
Verifiable Response: You receive the response with cryptographic signatures. Only model verification is needed — no gateway attestation to check.

Architecture Overview

NEAR AI Cloud operates through a distributed architecture consisting of an LLM Gateway and a network of Private LLM Nodes.

Private LLM Nodes

Each Private LLM Node provides secure, isolated AI inference capabilities:

Standardized Hardware: 8x NVIDIA H200 GPUs per node, optimized for high-performance inference
Intel TDX-enabled CPUs: Enable secure virtualization with hardware-enforced isolation
Private-ML-SDK: Manages secure model execution, attestation generation, and cryptographic signing
Health Monitoring: Automated liveness checks and monitoring ensure continuous availability

LLM Gateway

The LLM Gateway serves as the central orchestration layer:

Model Management: Registers and manages available models across the Private LLM Node network
Request Routing: Intelligently routes requests to appropriate nodes based on model availability and load
Attestation Verification: Validates and stores TEE attestation reports for audit and verification
Access Control: Manages API keys, authentication, and usage tracking for billing and monitoring

Security Guarantees

Defense in Depth

NEAR AI Cloud's private inference implements multiple layers of security to protect your data:

Hardware-Level Isolation : TEEs create isolated execution environments enforced at the hardware level, preventing unauthorized access to memory and computation even from privileged system administrators or cloud providers.
Secure Communication : All communication between your applications and the LLM infrastructure uses TLS encryption. Critically, TLS termination occurs inside the TEE—not at an external load balancer—so your prompts remain encrypted until they reach the secure enclave.
Cryptographic Attestation : Every TEE environment generates cryptographic proofs that verify the integrity of the execution environment, allowing you to independently confirm your computations occurred in a genuine, unmodified TEE.
Result Authentication : All AI outputs are cryptographically signed inside the TEE before leaving the secure environment, ensuring the authenticity and integrity of responses.

Threat Protection

NEAR AI Cloud's architecture protects against common attack vectors:

Malicious Infrastructure Providers : Hardware-enforced TEE isolation prevents cloud infrastructure providers from accessing your prompts, model weights, or inference results, even with physical access to servers.
Network-Based Attacks : End-to-end encryption protects your data during transmission, preventing man-in-the-middle attacks and network eavesdropping.
Model Extraction Attempts : Model weights remain encrypted and isolated within the TEE, making extraction computationally infeasible even for attackers with privileged system access.

Result Tampering : Cryptographic signatures generated inside the TEE ensure that responses cannot be modified in transit without detection, maintaining the integrity of AI outputs.

Private Inference

What is Private Inference?

Complete Privacy

Cryptographic Verification

Production Performance

How Private Inference Works

Trusted Execution Environment (TEE)

Client-Side Encryption

Direct Completions

The Inference Process

Architecture Overview

Private LLM Nodes

LLM Gateway

Security Guarantees

Defense in Depth

Threat Protection

Next Steps

Verification

TLS Attestation Verification

E2EE Chat Completions

What is Private Inference?​

Complete Privacy

Cryptographic Verification

Production Performance

How Private Inference Works​

Trusted Execution Environment (TEE)​

Client-Side Encryption​

Direct Completions​

The Inference Process​

Architecture Overview​

Private LLM Nodes​

LLM Gateway​

Security Guarantees​

Defense in Depth​

Threat Protection​

Next Steps​

Verification

TLS Attestation Verification

E2EE Chat Completions

What is Private Inference?

How Private Inference Works

Trusted Execution Environment (TEE)

Client-Side Encryption

Direct Completions

The Inference Process

Architecture Overview

Private LLM Nodes

LLM Gateway

Security Guarantees

Defense in Depth

Threat Protection

Next Steps