LLM Self-Hosting for Enterprise - Azure, GCP, On-Premise

Why Self-Hosting?

For many enterprise clients, the question is not whether AI will be deployed, but where the data is processed. When using cloud APIs (OpenAI, Anthropic, Google), data leaves the organization’s own infrastructure. For regulated industries - finance, healthcare, public sector - data residency can be a disqualifying factor.

Self-hosting means: the language model runs in the client’s infrastructure. No data leaves the corporate network. No third party processes the requests. Full control over model, data, and processing.

Which Models Can You Self-Host?

Open-source models can be operated in your own infrastructure:

Llama (Meta): Various sizes (8B, 70B, 405B parameters). Powerful, well-documented, large community.

Mistral: European model. Mistral 7B, Mixtral 8x7B. Strong price-performance ratio, efficient.

DeepSeek: Various variants including DeepSeek-R1 for reasoning tasks. Particularly strong price-performance ratio.

gpt-oss (OpenAI): OpenAI’s first open-source model under Apache 2.0. gpt-oss-120b (117B parameters, MoE, runs on a single 80 GB GPU) and gpt-oss-20b for edge scenarios.

Proprietary models (Claude, ChatGPT, Gemini) are not available for self-hosting but can be used via API with EU-based processing.

In a model-agnostic architecture, an agent can use multiple models: self-hosted for sensitive data, cloud API for non-critical tasks. The routing is rule-based and configured in the Decision Layer.

Deployment Options

Azure: LLMs can be deployed on Azure ML or operated on dedicated GPU VMs (NC-Series, ND-Series). Integration with Azure Entra ID for authentication and access control. Processing in EU data centers (West Europe, North Europe).

GCP: Deployment via Vertex AI or on dedicated GPU VMs (A2, G2). Integration with Google Cloud IAM. Processing in EU data centers (europe-west1, europe-west4).

On-Premise: Dedicated servers with NVIDIA GPUs (A100, H100, RTX 4000 Ada). Operation in certified data centers. Maximum control, no cloud dependency.

Hybrid: Combination of self-hosted and cloud. Sensitive workloads run locally, non-critical workloads in the cloud. Unified governance across both environments.

Architecture Considerations

GPU Sizing: Model size determines GPU requirements. A 7B model runs on a single GPU. A 70B model requires multiple GPUs or quantization. The right sizing depends on the use case.

Inference Optimization: Techniques such as quantization (4-bit, 8-bit), batching, and KV cache optimization reduce resource requirements with acceptable quality trade-offs.

High Availability: For production systems: redundant GPU servers, load balancing, automatic failover. No single point of failure.

Model Updates: New model versions must be tested before going into production. A staging environment for model testing is part of the infrastructure.