MiniMax

MiniMax AI Models & Capabilities

MiniMax produces two large language models engineered for distinct workloads. MiniMax-Text-01 handles text-only tasks at 380B parameters. MiniMax-VL-01 adds vision support at 250B parameters. Both run on a unified inference stack with regional API endpoints.

Language Model Architecture

MiniMax models use a dense transformer architecture optimized for throughput and output quality across text and multimodal tasks.

MiniMax-Text-01 runs a 380-billion-parameter dense transformer with grouped-query attention, SwiGLU activation, and rotary position embeddings. The model was pre-trained on 12 trillion tokens spanning code, scientific literature, web text, and multilingual corpora. Its 256K context window handles document sets that break smaller models. Inference speeds reach 65 tokens per second on the US West endpoint under typical concurrency of 32 simultaneous requests. Tokenizer vocabulary covers 150,000 entries with byte-level fallback encoding for rare characters.

MiniMax-VL-01 extends the architecture with a vision encoder that processes images up to 8,192 pixels on the longest edge. The vision tower uses a ViT-G architecture with 2 billion parameters, connected to the text backbone through a learned projection layer. Image tokens interleave with text tokens in the sequence, so the model reasons across modalities within a single forward pass. File formats accepted include JPEG, PNG, WebP, and TIFF. GIF frames are treated as sequential images. The model pinpoints regions in charts, reads handwritten text, and compares visual elements across multiple uploaded images.

Model Specifications Overview:

MiniMax-Text-01 delivers 380B parameters with 256K context. MiniMax-VL-01 provides 250B parameters with 128K context and full vision support. Both models share the same tokenizer, API contract, and regional deployment infrastructure. Choose Text-01 for pure language tasks; add VL-01 when images drive decisions.

MiniMax-Text-01

The text-only flagship handles extended reasoning, code generation, and document analysis at industrial scale.

MiniMax-Text-01 processes up to 256,000 tokens in a single forward pass. That covers full-length books, complete codebases up to roughly 40,000 lines, or multi-day chat logs. The model uses sliding window attention with a window size of 8,192 tokens and 32 attention heads across 96 transformer layers. Embedding dimension is 16,384. Pre-training data includes 65% English-language sources, 15% code repositories, 12% multilingual text, and 8% technical documentation. Fine-grained instruction tuning adds 800,000 curated prompt-response pairs covering 140 task categories.

Performance highlights: MMLU 87.2%, HumanEval pass@1 84.1%, GSM8K 91.5%, MATH 64.3%. These benchmarks were measured with greedy decoding at temperature 0. Latency stays under 800ms for prompts under 1,000 tokens and generation up to 500 tokens. The model supports structured JSON output via a constrained decoding grammar. Function calling follows the standard tool-use schema with parallel tool invocation for up to 8 simultaneous function calls per generation step.

MiniMax-VL-01

MiniMax-VL-01 adds a 2B-parameter vision encoder to the text backbone for multimodal reasoning and image understanding.

MiniMax-VL-01 accepts text and image inputs together, producing text outputs. The vision encoder processes images at native resolution up to 8,192 pixels on the long edge, downscaling larger inputs automatically while preserving aspect ratio. Image tokens occupy roughly the same sequence budget as 512 text tokens per moderate-complexity image, meaning the effective text-only context window shrinks by about 512 tokens per image in the input. The model handles multi-image conversations: you can attach up to 20 images per request and ask comparative questions.

Benchmark results for multimodal performance: MMBench 83.6%, ChartQA 81.9%, DocVQA 90.2%, MathVista 62.8%. The model performs well on infographics, scanned documents, photographs with embedded text, and medical imaging tasks. MiniMax-VL-01 also anchors the video understanding pipeline: video frames are extracted at 1 frame per second, processed as image sequences, and analyzed with frame-level reasoning before producing a summary. This pipeline supports up to 3-minute video clips submitted through the API.

Inference Performance and Fine-Tuning

MiniMax models deliver production throughput via optimized inference and offer supervised fine-tuning for domain adaptation.

Inference throughput peaks at 65 tokens per second per request for MiniMax-Text-01 and 48 tokens per second for MiniMax-VL-01 under typical load. The platform scales horizontally across GPU clusters, so throughput remains consistent as concurrent requests rise. Rate limits vary by plan: the free tier caps at 60 requests per minute, pay-as-you-go at 600 RPM, and enterprise plans offer dedicated throughput with no hard rate cap. Response time p95 stays under 1.2 seconds for prompts under 2,000 tokens across all supported regions.

Supervised fine-tuning accepts datasets in JSONL format with up to 50,000 training examples. The platform handles learning rate scheduling automatically, using cosine decay from an initial value you specify. Training runs on dedicated GPU clusters isolated from inference traffic, so fine-tuning jobs never degrade production response times. A typical 10,000-example dataset completes training in 4 to 6 hours. Fine-tuned models deploy as private endpoints accessible only to your organization. You can compare base and fine-tuned outputs side by side in the platform hub before routing production traffic.

Quantized variants of both models run at 8-bit precision for applications where latency matters more than maximum accuracy. Quantized MiniMax-Text-01 delivers 120 tokens per second with a 1.8% relative drop on MMLU. Quantized MiniMax-VL-01 reaches 85 tokens per second with a 2.1% drop on MMBench. Both quantized models use the same API interface as their full-precision counterparts; swap the model name parameter to switch.

Benchmark Methodology

All MiniMax model benchmarks use standardized evaluation protocols with public test sets and greedy decoding.

MiniMax publishes benchmark results using the EleutherAI evaluation harness configured for 5-shot prompting on MMLU, 0-shot on HumanEval, 8-shot on GSM8K, and 4-shot on MATH. Vision benchmarks use the official evaluation scripts and datasets published by each benchmark's maintainers. No task-specific prompt engineering, retrieval augmentation, or ensembling was applied. MiniMax benchmarks are reproducible: the evaluation configurations and prompt templates are included in the technical report available through the developer resources section.

Choosing Between Models

Select MiniMax-Text-01 for maximal language quality; add MiniMax-VL-01 when images, charts, or documents drive your workflow.

For text-only applications — chatbots, summarization pipelines, code assistants, translation services — MiniMax-Text-01 provides the strongest results and highest throughput. For applications with visual inputs — document processing, chart interpretation, UI screenshot analysis, medical image review — MiniMax-VL-01 is the appropriate choice. Many teams deploy both models: Text-01 handles user-facing chat and text generation, while VL-01 processes uploaded files and images in a parallel pipeline. The unified API contract means switching between models requires changing a single model identifier parameter in your request.

Model Specification Table

Detailed specifications for each MiniMax model, including parameter counts, context windows, recommended use cases, and regional availability.

Model Name Parameters Context Window Use Case Availability
MiniMax-Text-01 380B 256K tokens Text generation, code, reasoning, summarization All regions
MiniMax-VL-01 250B 128K tokens Image understanding, document analysis, visual QA All regions
MiniMax-Text-01 (8-bit) 380B (quantized) 256K tokens Low-latency text generation, high-throughput pipelines US West, US East
MiniMax-VL-01 (8-bit) 250B (quantized) 128K tokens Real-time image analysis, batch document processing US West, US East

What Teams Say About MiniMax Models

Frequently Asked Questions

Popular Searches on MiniMax