Inference providers — who hosts what, and at what price
A working index of where you can rent a model — Anthropic, OpenAI, Google, the open-weight hosts, and the cloud platforms behind them
First-party vs third-party — the distinction that actually matters
The index — major inference providers as of June 2026
Pricing is per million tokens (MTok) in USD, pulled from each provider's public pricing page. First-party hosts list flagship and budget tiers. Third-party hosts list one or two representative open-weight prices — they all have full catalogs. Always verify on the provider's live pricing page before committing a budget; this market reprices quarterly.
Anthropic API — first-party Claude
OpenAI API and Azure OpenAI
Google — AI Studio and Vertex AI
Third-party hosts of open-weight models
These providers host Llama, Qwen, DeepSeek, GPT-OSS, Gemma, and Mistral on their own hardware. The model checkpoint is the same across hosts; the price, latency, throughput, and compliance posture differ. Pricing below is illustrative — see the linked official pages in citations for full catalogs.
Groq
Llama 3.3 70B: $0.59 / $0.79 per MTok
LPU inference hardware. Optimized for very low time-to-first-token on open-weight LLMs. Public catalog includes Llama 3.x and Llama 4 variants, GPT-OSS 20B and 120B, Qwen3. Pricing is published per model. Free API key available with rate limits. Compliance certifications not publicly advertised on the pricing page — ask sales for the current posture.
Cerebras
Per-token rates on cerebras.ai/pricing
Wafer-scale inference. Hosts open-weight models including GPT-OSS-120B, Llama 4 Scout, and the GLM family. Tiered offering: free trial, self-serve Developer (add funds from $10), Enterprise. Per-token rates are listed on the pricing page rather than the inference landing page. Compliance posture is documented at the Cerebras trust center.
SambaNova Cloud
10 models on the pricing page
Reconfigurable dataflow hardware (RDU). Hosts DeepSeek (R1-Distill, V3.1, V3.2), Llama 3.3 70B, Llama 4 Maverick 17B, GPT-OSS-120B, Gemma 3 12B and 4 31B, MiniMax-M2.7. Per-million-token pricing is published per model — entry tier around $0.15 / $0.75, premium DeepSeek V3.x around $3.00 / $4.50. Trial credits referenced.
Together AI
25+ chat models, full price list public
Broad catalog of open-weight chat, image, and embedding models on standard GPU infrastructure. Serverless inference starts around $0.10 / $0.10 per MTok (Llama 3 8B Instruct Lite tier) and runs up to ~$1.40 / $4.40 for premium-tier models like GLM-5.1. 'Start for free' messaging on the landing page; compliance not on the pricing page itself.
Fireworks AI
$1 starter credit; cache and batch built in
Open-weight LLM hosting with cache and batch discounts (50% off cached input and batch inference by default). Embeddings from $0.008/MTok for small models. Per-model text and vision prices live in their docs rather than the pricing landing page. $1 free credit on signup.
Replicate
Hybrid: GPU-second, per-token, per-output
Mixed model: most models bill by hardware-seconds (GPU time × duration), some LLMs bill by token, image and video models bill per output unit. Useful when you want pinned model versions and per-version reproducibility. Free tier and compliance certifications not stated on the public pricing page.
Hugging Face Inference Endpoints
$0.033/hr starting; SOC 2 for endpoints
Closer to managed model hosting than a token marketplace. You deploy any Hub model to a dedicated endpoint on AWS, Azure, or GCP hardware and pay per GPU-hour. CPU instances from ~$0.03/hour. NVIDIA T4 from $0.50/hour. H100 from $4.50/hour. B200 from $9.25/hour. SOC 2 referenced for the Inference Endpoints product.
Modal
$30/mo free credits; SOC 2 (Starter), HIPAA (Enterprise)
Serverless GPU compute. You bring your own container; Modal handles cold starts, autoscaling, and per-second billing. A100 80GB at ~$0.000694/sec. H100 at ~$0.001097/sec. B200 at ~$0.001736/sec. Starter plan includes $30/month in free credits and SOC 2. Enterprise tier adds HIPAA and audit logs.
Anyscale
$100 starter credit; pay-as-you-go GPU
Ray-native compute platform. Pay-as-you-go GPU instances: T4 ~$0.57/hr, A100 ~$4.96/hr, H100 ~$9.29/hr, H200 ~$10.68/hr. $100 in starter credits for new accounts. Anyscale-hosted endpoints product has evolved — confirm the current managed-endpoint offering before committing.
Lambda
Inference API winding down — GPU rental active
GPU cloud. As of mid-2026 Lambda's inference API product is being wound down; the on-demand GPU instance offering remains. If you were planning to use Lambda Inference API, verify current status before integrating; the GPU rental product is still active.
Cloud platforms — Bedrock, Vertex, Azure
OpenRouter — the router layer
OpenRouter is not a host. It is a unified endpoint that routes your request to whichever provider serves the model you ask for, with automatic fallbacks if a primary provider degrades. The economics: free plan gives access to 25+ free models and 50 requests per day with no credit card. Pay-as-you-go adds a 5.5% platform fee on top of the underlying provider's rate. OpenRouter explicitly does not mark up provider pricing — what you see in the model catalog is the provider's actual rate. This makes OpenRouter useful for two things: cross-provider price discovery without holding accounts at all of them, and graceful fallback when one provider has an outage. The trade-off is that the compliance posture inherits from whichever underlying provider serves your request, so for regulated workloads you still need to pin to providers whose BAA you trust.
Perplexity API — search-augmented inference
How to choose — a minimum-effective-dose rule
Most teams overspend on inference by defaulting to the most expensive flagship for every call. The cheap dominant strategy is to route most traffic to a cheap tier and reserve flagship for the small fraction of calls that actually need it. A working heuristic:
- Default to the smallest model on the smallest budget that passes your eval. Haiku 4.5 ($1/$5), Gemini 2.5 Flash-Lite ($0.10/$0.40), or a small Llama variant on Groq ($0.05/$0.08) is the right first stop, not Opus.
- Only escalate to flagship (Opus, GPT-5.5, Gemini 3.1 Pro) when the smaller tier visibly fails on real tasks, and only for the calls that need it. Route, don't replace.
- Turn on prompt caching the moment your system prompt or shared context exceeds a few hundred tokens. Cache reads at 0.1x base input price are the single largest cost lever on long-context workloads.
- Use batch APIs (50% off on Anthropic, OpenAI, and most third-party hosts) for anything that can tolerate asynchronous turnaround — bulk classification, eval grading, content backfills.
- For regulated workloads, pick the cloud envelope first, then the model. AWS BAA + Bedrock + Claude is one paper; Anthropic enterprise HIPAA-ready is another; Azure BAA + Azure OpenAI is a third. Pick one and don't sprawl.
- For raw cost discovery across the open-weight ecosystem, route through OpenRouter for a quarter and read the per-model usage report. You'll learn which hosts your workload actually likes before signing direct contracts.
- Verify pricing the week you sign a contract. Every number on this page is dated June 2026 and was pulled from official sources; this market reprices on a quarterly cadence.
What this page does not promise
Three honest caveats. First: BAA / HIPAA / SOC 2 status is more subtle than a yes/no column can capture. Several providers (Together, Fireworks, Groq, SambaNova, Replicate) do not display compliance certifications on their public pricing pages even when they hold them — you have to ask sales. Treat 'not publicly advertised' as 'unknown without a sales call,' not as 'does not exist.' Second: this page does not list every available model at every provider. The catalogs are too large and they change weekly. The point of the index is to show the structure of the market and let you go to the canonical pricing page for the specific model you need. Third: latency, throughput, time-to-first-token, and reliability matter as much as $/MTok and are not captured here. Two providers can serve the same Llama 4 checkpoint at the same price and one of them will be three times faster in practice. Benchmark on your actual workload before you commit.
Sources
- [01]
Claude API per-token pricing for Opus 4.5–4.7 ($5/$25), Sonnet 4.5–4.6 ($3/$15), and Haiku 4.5 ($1/$5), plus caching and batch multipliers.
https://platform.claude.com/docs/en/about-claude/pricing ↗ - [02]
Anthropic's public pricing landing page, source of truth for Claude API rates and enterprise HIPAA-ready offering.
https://claude.com/pricing ↗ - [03]
OpenAI API per-token pricing for GPT-5.5, 5.4, 5.4-mini, 5.4-nano, cached input discount, batch and regional uplift policy.
https://developers.openai.com/api/docs/pricing ↗ - [04]
Google AI Studio pricing for Gemini 2.5 Flash, 2.5 Flash-Lite, 3.1 Flash-Lite, and 3.5 Flash, plus free tier data-use terms.
https://ai.google.dev/pricing ↗ - [05]
Vertex AI Gemini pricing tiers including 2.5 Pro ($1.25/$10.00, step up above 200k) and 3.x Flash variants.
https://cloud.google.com/vertex-ai/generative-ai/pricing ↗ - [06]
AWS Bedrock foundation model pricing across Anthropic Claude, Meta Llama, Mistral, and Amazon Titan with batch discount notes.
https://aws.amazon.com/bedrock/pricing/ ↗ - [07]
Together AI serverless inference pricing range from $0.10/$0.10 (Llama 3 8B Lite tier) to $1.40/$4.40 (GLM premium tier).
https://www.together.ai/pricing ↗ - [08]
Groq per-model pricing for Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout, GPT-OSS 20B and 120B, Qwen3 32B.
https://groq.com/pricing ↗ - [09]
Fireworks pricing structure with 50% cache and batch discounts and embedding tier pricing; $1 free credit on signup.
https://fireworks.ai/pricing ↗ - [10]
SambaNova Cloud catalog of 10 open-weight models with separate input/output per-million-token pricing per model.
https://cloud.sambanova.ai/pricing ↗ - [11]
Cerebras inference tiers (free trial, Developer self-serve from $10, Enterprise) and hosted models including GPT-OSS-120B and Llama 4 Scout.
https://www.cerebras.ai/inference/ ↗ - [12]
Replicate's hybrid pricing — hardware-time, per-token for LLMs, per-output for image/video models.
https://replicate.com/pricing ↗ - [13]
OpenRouter free plan (25+ models, 50 req/day), pay-as-you-go with 5.5% platform fee, and explicit no-markup policy on provider rates.
https://openrouter.ai/pricing ↗ - [14]
Perplexity Sonar pricing — Sonar ($1/$1), Sonar Pro ($3/$15), Sonar Reasoning Pro and Deep Research ($2/$8 + extras), plus per-request fees.
https://docs.perplexity.ai/guides/pricing ↗ - [15]
Hugging Face Inference Endpoints hourly GPU rates (T4 ~$0.50/hr through B200 ~$9.25/hr) and SOC 2 reference.
https://huggingface.co/pricing ↗ - [16]
Modal per-GPU-second pricing (A100 80GB ~$0.000694, H100 ~$0.001097, B200 ~$0.001736), $30/month Starter credits, SOC 2 and HIPAA-on-Enterprise.
https://modal.com/pricing ↗ - [17]
Anyscale per-GPU-hour pricing (T4 ~$0.57, A100 ~$4.96, H100 ~$9.29, H200 ~$10.68) with $100 starter credits.
https://www.anyscale.com/pricing ↗ - [18]
Lambda's Inference API is being wound down; the GPU instance rental product remains active.
https://lambda.ai/inference ↗ - [19]
Azure OpenAI Service hosts the OpenAI GPT family under Microsoft's Azure compliance envelope (HIPAA-eligible with Azure BAA).
https://azure.microsoft.com/en-us/products/ai-services/openai-service/ ↗ - [20]
AWS HIPAA eligibility and BAA framework, which governs Bedrock usage for regulated workloads.
https://aws.amazon.com/compliance/hipaa-compliance/ ↗