MiniMax AI models run on NVIDIA A100 and H100 GPU clusters, delivering high-throughput inference with CUDA-optimized kernels and sub-200ms time-to-first-token.
MiniMax inference infrastructure runs entirely on NVIDIA data center GPUs — A100 for standard workloads, H100 for large-scale video and frontier models.
The MiniMax platform operates GPU clusters across three geographic regions, each equipped with NVIDIA A100 80GB and H100 Tensor Core GPUs interconnected via NVLink and NVSwitch fabrics. The A100 fleet handles the majority of production inference — chat completions, embeddings, and moderate-scale video generation — with 80GB of HBM2e memory per GPU enabling large batch sizes and high throughput. The H100 cluster, built on the Hopper architecture, targets the most demanding workloads: 4K video generation, the largest MiniMax language models, and enterprise customers with dedicated capacity requirements.
Each GPU node runs NVIDIA's GPU Operator for automated driver management, CUDA toolkit 12.x, and a hardened container runtime. The physical infrastructure spans Tier III and Tier IV data centers with N+1 redundancy on power and cooling. All inter-GPU traffic within a node travels over NVLink at 900 GB/s; cross-node communication uses InfiniBand HDR at 200 Gbps to support tensor parallelism across multiple GPUs for the largest model deployments.
MiniMax deploys A100 (80GB HBM2e) for general inference and H100 (80GB HBM3) for large-scale workloads. The H100's Transformer Engine with FP8 delivers up to 9x faster training and 30x faster inference versus the previous generation for transformer models.
MiniMax engineering applies custom CUDA kernels, FlashAttention-2, and TensorRT compilation to squeeze every teraflop of usable performance from NVIDIA hardware.
The optimization pipeline starts at the kernel level. MiniMax maintains a library of CUDA kernels tuned for specific model architectures and input shapes. FlashAttention-2 replaces standard attention implementations, reducing GPU memory reads by up to 7x for long-context generations. Custom matrix multiplication kernels use Tensor Cores with mixed-precision FP16/BF16 accumulation. For production serving, models pass through NVIDIA TensorRT compilation, which applies layer fusion, kernel auto-tuning, and precision calibration — resulting in inference throughput gains of 30-60% compared to eager PyTorch execution.
The H100 Hopper architecture brings additional acceleration through the Transformer Engine. This library dynamically adjusts precision between FP8 and FP16 during inference, maintaining output quality while doubling throughput for transformer models. Combined with H100's 3x larger L2 cache (50MB vs 40MB on A100) and fourth-generation Tensor Cores, the MiniMax largest models achieve time-to-first-token consistently under 200ms for prompts up to 8K tokens.
MiniMax uses a distributed serving stack — Triton Inference Server, Kubernetes orchestration, and continuous batching — to maximize GPU utilization without sacrificing latency.
Models are packaged as Triton model repositories, containerized with Docker, and deployed on Kubernetes clusters spanning GPU nodes. Each model variant runs as a horizontally scalable deployment with pod autoscaling triggered by request queue depth. The serving layer implements continuous batching: incoming requests are dynamically grouped into batches that fill GPU memory to optimal occupancy, dispatched immediately rather than waiting for fixed batch windows. This approach keeps GPU utilization above 80% during peak traffic while maintaining per-request latency SLAs.
NVIDIA NVLink and NVSwitch provide the inter-GPU bandwidth needed for tensor parallelism across 4 or 8 GPUs for the largest MiniMax models. The Triton model configuration specifies tensor-parallel size per deployment, and the Kubernetes scheduler ensures pods land on nodes with sufficient NVLink-connected GPUs. For inference that fits within a single GPU's memory, model replicas are distributed across all available GPUs with round-robin load balancing.
The MiniMax-NVIDIA collaboration translates directly into customer benefits — faster responses, lower per-token costs, and infrastructure that scales without cold starts.
Faster inference: CUDA-optimized serving delivers chat completions at sub-200ms time-to-first-token and streaming throughput exceeding 150 tokens per second. Video generation for 10-second 1080p clips completes in under 90 seconds on H100 clusters. Embeddings for batches of 1,000 documents return in under 500ms.
Lower cost: GPU efficiency gains reduce the compute cost per inference, and those savings flow to customers through competitive per-token pricing. Continuous batching maximizes hardware utilization so MiniMax can serve more customers per GPU, keeping marginal costs down. Enterprise customers on reserved capacity plans lock in predictable pricing decoupled from spot-market GPU fluctuations.
Enterprise reliability: Dedicated GPU instances run on isolated partitions with guaranteed throughput. Auto-scaling policies add GPU capacity before queue depth impacts latency. Redundant power, cooling, and networking in each data center region means no single point of failure. The infrastructure achieves 99.95% availability for standard API endpoints and 99.9% for video generation workloads.
MiniMax is actively qualifying B200 GPUs for next-generation serving with expanded memory capacity and even higher throughput for frontier AI models.
The engineering roadmap includes B200 qualification for model serving, adoption of NVIDIA NIM microservices for standardized model packaging, and expanded regional GPU capacity in Asia-Pacific. MiniMax is also evaluating Grace Hopper Superchip deployments for CPU-GPU coherency that eliminates host-to-device memory copies during inference. These hardware investments align with MiniMax commitment to keeping inference fast and affordable as model capabilities advance and context windows grow into the millions of tokens.
This table compares the NVIDIA GPU models deployed across MiniMax infrastructure, including memory, compute throughput, and recommended use cases.
| GPU Model | VRAM | FP16 TFLOPS | Use Case | Availability |
|---|---|---|---|---|
| NVIDIA A100 80GB | 80 GB HBM2e | 312 | Chat, embeddings, standard video generation | All regions, all plans |
| NVIDIA H100 80GB | 80 GB HBM3 | 989 | Large models, 4K video, high-throughput chat | All regions, pay-as-you-go & enterprise |
| NVIDIA A100 40GB | 40 GB HBM2e | 312 | Lightweight models, embedding workloads | Select regions, free tier |
| NVIDIA L40S | 48 GB GDDR6 | 362 | Video transcoding, small model inference | North America, enterprise only |
| NVIDIA H200 | 141 GB HBM3e | 989 | Frontier models, ultra-long context | Limited preview |
| NVIDIA B200 | 192 GB HBM3e | 2,250 | Next-gen serving, qualification in progress | Roadmap — 2026 |
"We run inference workloads that spike from zero to thousands of concurrent requests in under a minute. MiniMax H100 cluster handles those transitions without dropping a single request. The Triton serving stack plus continuous batching means we pay for actual GPU utilization — not idle capacity. When our team benchmarked MiniMax against three other providers, the NVIDIA-optimized inference stack delivered 40% better throughput per dollar."
— Henrik O. Dahl, Systems Engineer, Arctic Compute, Minneapolis
MiniMax deploys NVIDIA A100 80GB (Ampere architecture) for standard inference workloads including chat, embeddings, and moderate-scale video generation. The H100 80GB (Hopper architecture) serves the largest MiniMax language models and 4K video generation pipelines. Both GPU models connect via NVLink and NVSwitch for multi-GPU tensor parallelism. Additionally, L40S GPUs handle video transcoding in North American data centers, and H200 GPUs with 141GB HBM3e are in limited preview for frontier models requiring expanded memory. B200 qualification is underway for the next generation of miniMax model serving infrastructure.
MiniMax employs a layered CUDA optimization strategy. At the kernel level, custom CUDA kernels implement FlashAttention-2 for memory-efficient attention computation and custom GEMM operations tuned for MiniMax model dimensions. At the compiler level, NVIDIA TensorRT applies layer fusion, kernel auto-tuning, and precision calibration to produce optimized inference engines. At the architecture level, the H100 Transformer Engine dynamically manages FP8/FP16 precision during inference, doubling throughput without quality loss. Together, these optimizations reduce per-token latency by 30-60% versus unoptimized PyTorch serving, while cutting GPU memory footprint enough to serve larger batch sizes or longer context windows on the same hardware.
Three concrete benefits reach every MiniMax customer. First, faster response times: chat completions arrive in under 200ms time-to-first-token, streaming throughput exceeds 150 tokens per second, and 10-second 1080p video generation completes in under 90 seconds on H100. Second, lower costs: GPU efficiency gains and continuous batching minimize idle hardware, translating into competitive per-token pricing and predictable enterprise bills. Third, scalability without compromise: auto-scaling Kubernetes deployments add GPU capacity before latency degrades, and enterprise dedicated instances provide isolation from multi-tenant traffic. Customers on reserved capacity plans lock in pricing decoupled from volatile GPU spot markets while receiving priority access to new NVIDIA hardware generations as they qualify.
The MiniMax serving architecture layers Triton Inference Server on Kubernetes with continuous batching at its core. Models are packaged as Triton model repositories — container images containing the model weights, configuration, and serving logic. Kubernetes deployments manage replica counts with horizontal pod autoscaling driven by request queue depth: when the queue exceeds a threshold, new replicas spin up from warm container images in under 30 seconds. Continuous batching dynamically groups incoming requests to fill GPU memory to optimal occupancy, issuing batches immediately rather than waiting for fixed windows. For the largest models requiring tensor parallelism across 4 or 8 GPUs, NVLink and NVSwitch provide 900 GB/s GPU-to-GPU bandwidth. Load balancing distributes single-GPU replicas across all available GPUs in a round-robin pattern. Prometheus metrics track GPU utilization, queue depth, and per-request latency for operational visibility.
Yes, enterprise customers can reserve dedicated GPU capacity on MiniMax NVIDIA clusters. Dedicated instances run on isolated GPU partitions — either full GPUs or MIG (Multi-Instance GPU) slices on A100 hardware — with guaranteed throughput and zero noisy-neighbor interference. Customers choose between A100 and H100 SKUs based on model size and latency requirements. Auto-scaling policies for dedicated pools can be configured to maintain a minimum GPU count during off-peak hours and burst up to a negotiated ceiling. Pricing is monthly reserved with annual commitment discounts. Dedicated customers receive priority access to new NVIDIA GPU generations during the qualification period before general availability, plus a dedicated Slack channel with MiniMax infrastructure engineers for performance tuning and capacity planning. Contact enterprise sales at support@minimax.gr.com for a capacity assessment.