A matte-black aluminum heatsink with a single bio-cyan LED — where inference actually runs.

AtomEons / Learn / inference-providers

Inference providers — who hosts what, and at what price

A working index of where you can rent a model — Anthropic, OpenAI, Google, the open-weight hosts, and the cloud platforms behind them

There are two kinds of inference provider, and confusing them costs real money. The first kind hosts its own models. Anthropic serves Claude. OpenAI serves GPT. Google serves Gemini. These are first-party endpoints — the model maker is the only seller, the pricing page is the source of truth, and a Business Associate Agreement (when offered) comes directly from the lab. The second kind hosts other people's open-weight models. Together, Fireworks, Groq, Cerebras, SambaNova, Replicate, Modal, and Lambda all do some version of this: take a published checkpoint (Llama, Qwen, DeepSeek, GPT-OSS, Gemma, Mistral), put it on their hardware, expose an API. Same model, different host, different price, different throughput, different compliance posture. A Llama 4 token from Groq and a Llama 4 token from Together are the same model output billed by two different companies. Cloud platforms add a third layer. AWS Bedrock and Google Vertex resell first-party models (Claude on Bedrock, Claude on Vertex, Gemini on Vertex) alongside open weights, with their own SLAs and BAAs. Azure does the same for OpenAI. OpenRouter sits a tier above everything as a single endpoint that routes to whichever provider you select — they explicitly do not mark up provider rates, charging only a platform fee on pay-as-you-go. This page indexes the major providers as of June 2026 with pricing pulled from official pages. Pricing in this market moves quarterly. Always check the provider's pricing page before sizing a budget — the URLs in the citations below are the canonical sources. Where a number is missing from public docs (e.g., per-region pricing details, current BAA scope), this page says so explicitly rather than guess.

First-party vs third-party — the distinction that actually matters

First-party means the lab that trained the model is the one selling inference. Anthropic sells Claude. OpenAI sells GPT. Google sells Gemini through both AI Studio (the developer surface) and Vertex AI (the enterprise surface). The pricing page on the lab's domain is authoritative — if a third party quotes you a different number for the same first-party model, something is wrong. Third-party means a host that has taken a publicly released open-weight model (Meta's Llama, Alibaba's Qwen, DeepSeek's V3 and R1 lineages, Mistral's open releases, Google's Gemma, OpenAI's GPT-OSS) and is running it on its own hardware. The same model checkpoint can appear on five hosts with five different price points and five different latency profiles. Together, Fireworks, Groq, Cerebras, SambaNova, and DeepInfra are the most visible names here. There is no proprietary lab relationship — the host is competing on hardware, scheduler, quantization choices, and price. A few hybrid cases are worth flagging. Replicate runs open-weight LLMs and also resells some closed-model APIs (Claude, for example, appears as a passthrough). Hugging Face Inference Endpoints lets you deploy any model from the Hub on rented GPUs — closer to managed hosting than to a token marketplace. Modal and Anyscale are general GPU compute platforms that people use to serve their own model deployments. AWS Bedrock and Vertex AI host first-party models from multiple labs under a single cloud contract. OpenRouter is purely a router. Knowing which kind a provider is tells you who controls the model version, who signs the BAA, and whether 'the same Claude' really is the same Claude. (Spoiler: a Claude token from the Claude API, Bedrock, and Vertex are all served by Anthropic infrastructure or Anthropic-managed deployment — they bill differently, but the model is the same model.)

The index — major inference providers as of June 2026

Pricing is per million tokens (MTok) in USD, pulled from each provider's public pricing page. First-party hosts list flagship and budget tiers. Third-party hosts list one or two representative open-weight prices — they all have full catalogs. Always verify on the provider's live pricing page before committing a budget; this market reprices quarterly.

Provider	Type	Representative model	Input $/MTok	Output $/MTok	Free tier	BAA / HIPAA
Anthropic API	First-party (Claude)	Claude Sonnet 4.6	$3.00	$15.00	Small starter credits	HIPAA-ready offering (enterprise)
Anthropic API	First-party (Claude)	Claude Opus 4.7	$5.00	$25.00	Same starter credits	HIPAA-ready offering (enterprise)
Anthropic API	First-party (Claude)	Claude Haiku 4.5	$1.00	$5.00	Same starter credits	HIPAA-ready offering (enterprise)
OpenAI API	First-party (GPT)	GPT-5.4	$2.50	$15.00	Limited eval credits historically	BAA available — check current scope
OpenAI API	First-party (GPT)	GPT-5.4-mini	$0.75	$4.50	Same	Same
Google AI Studio	First-party (Gemini)	Gemini 2.5 Flash	$0.30	$2.50	Free tier (data used for product improvement)	Use Vertex for BAA
Google AI Studio	First-party (Gemini)	Gemini 3.5 Flash	$1.50	$9.00	Free tier (data used for product improvement)	Use Vertex for BAA
Vertex AI	First-party + multi-lab	Gemini 2.5 Pro	$1.25 (≤200k)	$10.00	GCP free credits for new accounts	GCP HIPAA-eligible with BAA
Vertex AI	First-party + multi-lab	Claude Sonnet 4.5 (via Vertex)	Anthropic rates + regional uplift	Same	GCP free credits	GCP HIPAA-eligible with BAA
AWS Bedrock	Multi-lab resale	Claude Sonnet on Bedrock	Anthropic rates + regional uplift	Same	AWS Free Tier credits	AWS HIPAA-eligible with BAA
AWS Bedrock	Multi-lab resale	Mistral Large 3	$0.50	$1.50	AWS Free Tier credits	AWS HIPAA-eligible with BAA
Azure OpenAI	First-party resale (GPT)	GPT family via Azure	Azure-listed (varies by region/tier)	Same	Azure free account credits	Azure HIPAA-eligible with BAA
Groq	Third-party (open weights)	Llama 3.3 70B Versatile	$0.59	$0.79	Free API key, rate-limited	Not publicly advertised — ask sales
Groq	Third-party (open weights)	Llama 3.1 8B Instant	$0.05	$0.08	Same	Same
Cerebras	Third-party (open weights)	GPT-OSS-120B / Llama 4 Scout	See Cerebras pricing page	Same	Free trial API access	See Cerebras trust center
SambaNova Cloud	Third-party (open weights)	Llama 3.3 70B Instruct	Listed; tier varies	Same	Trial credits referenced	Not publicly advertised
Together AI	Third-party (open weights)	Llama 3 8B Instruct Lite	$0.10	$0.10	Trial credits ('start for free')	Not on pricing page — ask sales
Together AI	Third-party (open weights)	Premium tier (e.g. GLM family)	Up to $1.40	Up to $4.40	Same	Same
Fireworks AI	Third-party (open weights)	Open-weight LLMs (see docs)	Listed in docs; embeddings from $0.008	Same	$1 in free credits	Not on pricing page — ask sales
Replicate	Mixed (open + passthrough)	DeepSeek and others (token-billed)	Per-model (e.g., $3/MTok input on some)	Per-model	Limited free runs historically	Not publicly advertised
Perplexity API	Search-augmented LLM	Sonar Pro	$3.00	$15.00	Paid only	Not on standard pricing page
Perplexity API	Search-augmented LLM	Sonar (entry)	$1.00	$1.00	Paid only	Same
OpenRouter	Router (no hosting)	Pass-through to 60+ providers	Provider rate + 5.5% platform fee (PAYG)	Same	Free plan: 25+ models, 50 req/day	Inherits provider posture
Hugging Face Inference Endpoints	Managed hosting	Any Hub model on rented GPUs	GPU/hour (T4 ~$0.50/hr to B200 ~$9.25/hr)	Same	Free Spaces, Inference API trial	SOC 2 referenced for endpoints
Modal	Serverless GPU compute	Bring-your-own model	Per GPU-second (A100 ~$0.000694/sec)	Same	$30/month free credits (Starter)	SOC 2 (Starter); HIPAA on Enterprise
Anyscale	Ray-based compute	Bring-your-own deployments	Per GPU-hour (H100 ~$9.29/hr)	Same	$100 starter credits	Check Anyscale enterprise terms
Lambda	GPU cloud	On-demand GPU instances	Inference API winding down; GPU rental remains	Same	Varies	Check Lambda enterprise terms

ProviderAnthropic API

TypeFirst-party (Claude)

Representative modelClaude Sonnet 4.6

Input $/MTok$3.00

Output $/MTok$15.00

Free tierSmall starter credits

BAA / HIPAAHIPAA-ready offering (enterprise)

ProviderAnthropic API

TypeFirst-party (Claude)

Representative modelClaude Opus 4.7

Input $/MTok$5.00

Output $/MTok$25.00

Free tierSame starter credits

BAA / HIPAAHIPAA-ready offering (enterprise)

ProviderAnthropic API

TypeFirst-party (Claude)

Representative modelClaude Haiku 4.5

Input $/MTok$1.00

Output $/MTok$5.00

Free tierSame starter credits

BAA / HIPAAHIPAA-ready offering (enterprise)

ProviderOpenAI API

TypeFirst-party (GPT)

Representative modelGPT-5.4

Input $/MTok$2.50

Output $/MTok$15.00

Free tierLimited eval credits historically

BAA / HIPAABAA available — check current scope

ProviderOpenAI API

TypeFirst-party (GPT)

Representative modelGPT-5.4-mini

Input $/MTok$0.75

Output $/MTok$4.50

Free tierSame

BAA / HIPAASame

ProviderGoogle AI Studio

TypeFirst-party (Gemini)

Representative modelGemini 2.5 Flash

Input $/MTok$0.30

Output $/MTok$2.50

Free tierFree tier (data used for product improvement)

BAA / HIPAAUse Vertex for BAA

ProviderGoogle AI Studio

TypeFirst-party (Gemini)

Representative modelGemini 3.5 Flash

Input $/MTok$1.50

Output $/MTok$9.00

Free tierFree tier (data used for product improvement)

BAA / HIPAAUse Vertex for BAA

ProviderVertex AI

TypeFirst-party + multi-lab

Representative modelGemini 2.5 Pro

Input $/MTok$1.25 (≤200k)

Output $/MTok$10.00

Free tierGCP free credits for new accounts

BAA / HIPAAGCP HIPAA-eligible with BAA

ProviderVertex AI

TypeFirst-party + multi-lab

Representative modelClaude Sonnet 4.5 (via Vertex)

Input $/MTokAnthropic rates + regional uplift

Output $/MTokSame

Free tierGCP free credits

BAA / HIPAAGCP HIPAA-eligible with BAA

ProviderAWS Bedrock

TypeMulti-lab resale

Representative modelClaude Sonnet on Bedrock

Input $/MTokAnthropic rates + regional uplift

Output $/MTokSame

Free tierAWS Free Tier credits

BAA / HIPAAAWS HIPAA-eligible with BAA

ProviderAWS Bedrock

TypeMulti-lab resale

Representative modelMistral Large 3

Input $/MTok$0.50

Output $/MTok$1.50

Free tierAWS Free Tier credits

BAA / HIPAAAWS HIPAA-eligible with BAA

ProviderAzure OpenAI

TypeFirst-party resale (GPT)

Representative modelGPT family via Azure

Input $/MTokAzure-listed (varies by region/tier)

Output $/MTokSame

Free tierAzure free account credits

BAA / HIPAAAzure HIPAA-eligible with BAA

ProviderGroq

TypeThird-party (open weights)

Representative modelLlama 3.3 70B Versatile

Input $/MTok$0.59

Output $/MTok$0.79

Free tierFree API key, rate-limited

BAA / HIPAANot publicly advertised — ask sales

ProviderGroq

TypeThird-party (open weights)

Representative modelLlama 3.1 8B Instant

Input $/MTok$0.05

Output $/MTok$0.08

Free tierSame

BAA / HIPAASame

ProviderCerebras

TypeThird-party (open weights)

Representative modelGPT-OSS-120B / Llama 4 Scout

Input $/MTokSee Cerebras pricing page

Output $/MTokSame

Free tierFree trial API access

BAA / HIPAASee Cerebras trust center

ProviderSambaNova Cloud

TypeThird-party (open weights)

Representative modelLlama 3.3 70B Instruct

Input $/MTokListed; tier varies

Output $/MTokSame

Free tierTrial credits referenced

BAA / HIPAANot publicly advertised

ProviderTogether AI

TypeThird-party (open weights)

Representative modelLlama 3 8B Instruct Lite

Input $/MTok$0.10

Output $/MTok$0.10

Free tierTrial credits ('start for free')

BAA / HIPAANot on pricing page — ask sales

ProviderTogether AI

TypeThird-party (open weights)

Representative modelPremium tier (e.g. GLM family)

Input $/MTokUp to $1.40

Output $/MTokUp to $4.40

Free tierSame

BAA / HIPAASame

ProviderFireworks AI

TypeThird-party (open weights)

Representative modelOpen-weight LLMs (see docs)

Input $/MTokListed in docs; embeddings from $0.008

Output $/MTokSame

Free tier$1 in free credits

BAA / HIPAANot on pricing page — ask sales

ProviderReplicate

TypeMixed (open + passthrough)

Representative modelDeepSeek and others (token-billed)

Input $/MTokPer-model (e.g., $3/MTok input on some)

Output $/MTokPer-model

Free tierLimited free runs historically

BAA / HIPAANot publicly advertised

ProviderPerplexity API

TypeSearch-augmented LLM

Representative modelSonar Pro

Input $/MTok$3.00

Output $/MTok$15.00

Free tierPaid only

BAA / HIPAANot on standard pricing page

ProviderPerplexity API

TypeSearch-augmented LLM

Representative modelSonar (entry)

Input $/MTok$1.00

Output $/MTok$1.00

Free tierPaid only

BAA / HIPAASame

ProviderOpenRouter

TypeRouter (no hosting)

Representative modelPass-through to 60+ providers

Input $/MTokProvider rate + 5.5% platform fee (PAYG)

Output $/MTokSame

Free tierFree plan: 25+ models, 50 req/day

BAA / HIPAAInherits provider posture

ProviderHugging Face Inference Endpoints

TypeManaged hosting

Representative modelAny Hub model on rented GPUs

Input $/MTokGPU/hour (T4 ~$0.50/hr to B200 ~$9.25/hr)

Output $/MTokSame

Free tierFree Spaces, Inference API trial

BAA / HIPAASOC 2 referenced for endpoints

ProviderModal

TypeServerless GPU compute

Representative modelBring-your-own model

Input $/MTokPer GPU-second (A100 ~$0.000694/sec)

Output $/MTokSame

Free tier$30/month free credits (Starter)

BAA / HIPAASOC 2 (Starter); HIPAA on Enterprise

ProviderAnyscale

TypeRay-based compute

Representative modelBring-your-own deployments

Input $/MTokPer GPU-hour (H100 ~$9.29/hr)

Output $/MTokSame

Free tier$100 starter credits

BAA / HIPAACheck Anyscale enterprise terms

ProviderLambda

TypeGPU cloud

Representative modelOn-demand GPU instances

Input $/MTokInference API winding down; GPU rental remains

Output $/MTokSame

Free tierVaries

BAA / HIPAACheck Lambda enterprise terms

Anthropic API — first-party Claude

Anthropic publishes a single pricing page for the Claude API. As of June 2026 the active tiers are Claude Opus (4.5, 4.6, 4.7, with 4.1 and 4 in deprecated state), Claude Sonnet (4.5 and 4.6), and Claude Haiku (4.5; 3.5 retired except on Bedrock and Vertex). Flagship Opus 4.5–4.7 is priced at $5 input / $25 output per million tokens. Sonnet 4.5–4.6 is $3 / $15. Haiku 4.5 is $1 / $5. Prompt caching is a real lever. A five-minute cache write costs 1.25x the base input rate, a one-hour write costs 2x, and a cache read costs 0.1x. Read once and the five-minute cache has paid for itself; read twice and the one-hour cache has paid for itself. Batch API gives 50% off both directions for asynchronous workloads. Data residency (US-only inference via the inference_geo parameter on Opus 4.6+) adds a 1.1x multiplier. Claude is also available through AWS Bedrock and Google Vertex. These are Anthropic-managed deployments billed by the cloud provider. Regional and multi-region endpoints carry a 10% premium versus the global endpoint. For HIPAA-bound workloads, Anthropic offers an enterprise HIPAA-ready configuration; for AWS- or GCP-bound compliance, the cloud platform's BAA governs.

OpenAI API and Azure OpenAI

OpenAI's first-party endpoint hosts the GPT family. As of June 2026 the visible flagship pricing on the developer pricing page lists GPT-5.5 at $5.00 input / $30.00 output per million tokens, GPT-5.4 at $2.50 / $15.00, GPT-5.4-mini at $0.75 / $4.50, and GPT-5.4-nano at $0.20 / $1.25. Cached input gets a 90% discount. Batch processing gets 50% off. Regional processing (data residency) carries a 10% uplift on models released on or after 5 March 2026. The Azure OpenAI Service is the Microsoft-hosted resale of the same model family, sold through Azure with Azure's compliance envelope (HIPAA-eligible under an Azure BAA). The pricing structure is similar but billed in Azure currency and with Azure region constraints; the canonical price reference is the Azure OpenAI Service pricing page. Treat the two surfaces as separate contracts: BAA via OpenAI directly for first-party use, BAA via Microsoft for Azure use.

Google — AI Studio and Vertex AI

Google has two doors into the Gemini family. AI Studio is the developer surface — free tier exists but data is used for product improvement, paid tier prices are listed per model. Vertex AI is the enterprise surface — GCP project, GCP IAM, GCP BAA. Vertex also resells Claude (with regional uplift) alongside Google's own Gemini line. As of June 2026, public pricing for Gemini 2.5 Flash on AI Studio is $0.30 input / $2.50 output per million tokens (text/image/video) with audio input at $1.00. Gemini 3.5 Flash is $1.50 / $9.00. Gemini 2.5 Flash-Lite is $0.10 / $0.40. On Vertex, Gemini 2.5 Pro is $1.25 input (≤200k) / $10.00 output, with input doubling above the 200k context threshold. Batch API offers 50% off standard pricing. For compliance: AI Studio is the wrong door for regulated workloads — the free tier explicitly trains on your data. Vertex is the correct door, with GCP's standard HIPAA-eligible posture under a Google Cloud BAA.

Third-party hosts of open-weight models

These providers host Llama, Qwen, DeepSeek, GPT-OSS, Gemma, and Mistral on their own hardware. The model checkpoint is the same across hosts; the price, latency, throughput, and compliance posture differ. Pricing below is illustrative — see the linked official pages in citations for full catalogs.

Groq

Llama 3.3 70B: $0.59 / $0.79 per MTok

LPU inference hardware. Optimized for very low time-to-first-token on open-weight LLMs. Public catalog includes Llama 3.x and Llama 4 variants, GPT-OSS 20B and 120B, Qwen3. Pricing is published per model. Free API key available with rate limits. Compliance certifications not publicly advertised on the pricing page — ask sales for the current posture.

Cerebras

Per-token rates on cerebras.ai/pricing

Wafer-scale inference. Hosts open-weight models including GPT-OSS-120B, Llama 4 Scout, and the GLM family. Tiered offering: free trial, self-serve Developer (add funds from $10), Enterprise. Per-token rates are listed on the pricing page rather than the inference landing page. Compliance posture is documented at the Cerebras trust center.

SambaNova Cloud

10 models on the pricing page

Reconfigurable dataflow hardware (RDU). Hosts DeepSeek (R1-Distill, V3.1, V3.2), Llama 3.3 70B, Llama 4 Maverick 17B, GPT-OSS-120B, Gemma 3 12B and 4 31B, MiniMax-M2.7. Per-million-token pricing is published per model — entry tier around $0.15 / $0.75, premium DeepSeek V3.x around $3.00 / $4.50. Trial credits referenced.

Together AI

25+ chat models, full price list public

Broad catalog of open-weight chat, image, and embedding models on standard GPU infrastructure. Serverless inference starts around $0.10 / $0.10 per MTok (Llama 3 8B Instruct Lite tier) and runs up to ~$1.40 / $4.40 for premium-tier models like GLM-5.1. 'Start for free' messaging on the landing page; compliance not on the pricing page itself.

Fireworks AI

$1 starter credit; cache and batch built in

Open-weight LLM hosting with cache and batch discounts (50% off cached input and batch inference by default). Embeddings from $0.008/MTok for small models. Per-model text and vision prices live in their docs rather than the pricing landing page. $1 free credit on signup.

Replicate

Hybrid: GPU-second, per-token, per-output

Mixed model: most models bill by hardware-seconds (GPU time × duration), some LLMs bill by token, image and video models bill per output unit. Useful when you want pinned model versions and per-version reproducibility. Free tier and compliance certifications not stated on the public pricing page.

Hugging Face Inference Endpoints

$0.033/hr starting; SOC 2 for endpoints

Closer to managed model hosting than a token marketplace. You deploy any Hub model to a dedicated endpoint on AWS, Azure, or GCP hardware and pay per GPU-hour. CPU instances from ~$0.03/hour. NVIDIA T4 from $0.50/hour. H100 from $4.50/hour. B200 from $9.25/hour. SOC 2 referenced for the Inference Endpoints product.

Modal

$30/mo free credits; SOC 2 (Starter), HIPAA (Enterprise)

Serverless GPU compute. You bring your own container; Modal handles cold starts, autoscaling, and per-second billing. A100 80GB at ~$0.000694/sec. H100 at ~$0.001097/sec. B200 at ~$0.001736/sec. Starter plan includes $30/month in free credits and SOC 2. Enterprise tier adds HIPAA and audit logs.

Anyscale

$100 starter credit; pay-as-you-go GPU

Ray-native compute platform. Pay-as-you-go GPU instances: T4 ~$0.57/hr, A100 ~$4.96/hr, H100 ~$9.29/hr, H200 ~$10.68/hr. $100 in starter credits for new accounts. Anyscale-hosted endpoints product has evolved — confirm the current managed-endpoint offering before committing.

Lambda

Inference API winding down — GPU rental active

GPU cloud. As of mid-2026 Lambda's inference API product is being wound down; the on-demand GPU instance offering remains. If you were planning to use Lambda Inference API, verify current status before integrating; the GPU rental product is still active.

Cloud platforms — Bedrock, Vertex, Azure

The three hyperscalers wrap multiple labs' models under their own enterprise contract. The advantage is one paper of record: the same MSA, the same BAA, the same SSO, the same VPC posture covers your usage of multiple model families. AWS Bedrock hosts Anthropic Claude, Meta Llama, Mistral, Amazon Titan, AI21 Jurassic, Cohere Command, and Stability image models. Claude on Bedrock is the same Claude as the Claude API, billed via AWS Marketplace; the regional endpoint variants for Claude 4.5 and later carry a 10% premium over the global endpoint. Bedrock is HIPAA-eligible under an AWS BAA. Google Vertex AI hosts Gemini (first-party), Claude (via the Anthropic relationship), Llama via Model Garden, and a long tail of open weights. Vertex is the enterprise door for HIPAA-bound Gemini workloads — AI Studio is not. The Gemini 2.5 Pro tier on Vertex is priced at $1.25 input / $10.00 output per million tokens with a step up above 200k context. Azure OpenAI Service is the Microsoft-resold form of OpenAI's GPT family with Azure's compliance envelope. Use it when you are already an Azure customer with an existing BAA; pricing maps closely to OpenAI's first-party rates but is governed by Azure region availability and Azure currency.

OpenRouter — the router layer

OpenRouter is not a host. It is a unified endpoint that routes your request to whichever provider serves the model you ask for, with automatic fallbacks if a primary provider degrades. The economics: free plan gives access to 25+ free models and 50 requests per day with no credit card. Pay-as-you-go adds a 5.5% platform fee on top of the underlying provider's rate. OpenRouter explicitly does not mark up provider pricing — what you see in the model catalog is the provider's actual rate. This makes OpenRouter useful for two things: cross-provider price discovery without holding accounts at all of them, and graceful fallback when one provider has an outage. The trade-off is that the compliance posture inherits from whichever underlying provider serves your request, so for regulated workloads you still need to pin to providers whose BAA you trust.

Perplexity API — search-augmented inference

Perplexity is structurally different from the rest of this list. Their Sonar family of API models is search-augmented — the model executes web queries against Perplexity's search infrastructure as part of generating a response, and you are billed for both the token usage and the requests. Sonar is $1.00 input / $1.00 output per million tokens. Sonar Pro is $3.00 / $15.00. Sonar Reasoning Pro and Sonar Deep Research are $2.00 / $8.00 with additional citation and reasoning token charges. Per-request fees apply on top of token pricing — between $5 and $14 per 1,000 requests depending on search context size and model. Their Agent API also brokers access to third-party models from OpenAI, Anthropic, Google, and xAI, which is useful when the value of the call is the search rather than the underlying base model. Use Perplexity when retrieval is the core of the workload; use a generic first-party API and your own retrieval stack when you need fine-grained control over the search step.

How to choose — a minimum-effective-dose rule

Most teams overspend on inference by defaulting to the most expensive flagship for every call. The cheap dominant strategy is to route most traffic to a cheap tier and reserve flagship for the small fraction of calls that actually need it. A working heuristic:

Default to the smallest model on the smallest budget that passes your eval. Haiku 4.5 ($1/$5), Gemini 2.5 Flash-Lite ($0.10/$0.40), or a small Llama variant on Groq ($0.05/$0.08) is the right first stop, not Opus.
Only escalate to flagship (Opus, GPT-5.5, Gemini 3.1 Pro) when the smaller tier visibly fails on real tasks, and only for the calls that need it. Route, don't replace.
Turn on prompt caching the moment your system prompt or shared context exceeds a few hundred tokens. Cache reads at 0.1x base input price are the single largest cost lever on long-context workloads.
Use batch APIs (50% off on Anthropic, OpenAI, and most third-party hosts) for anything that can tolerate asynchronous turnaround — bulk classification, eval grading, content backfills.
For regulated workloads, pick the cloud envelope first, then the model. AWS BAA + Bedrock + Claude is one paper; Anthropic enterprise HIPAA-ready is another; Azure BAA + Azure OpenAI is a third. Pick one and don't sprawl.
For raw cost discovery across the open-weight ecosystem, route through OpenRouter for a quarter and read the per-model usage report. You'll learn which hosts your workload actually likes before signing direct contracts.
Verify pricing the week you sign a contract. Every number on this page is dated June 2026 and was pulled from official sources; this market reprices on a quarterly cadence.

What this page does not promise

Three honest caveats. First: BAA / HIPAA / SOC 2 status is more subtle than a yes/no column can capture. Several providers (Together, Fireworks, Groq, SambaNova, Replicate) do not display compliance certifications on their public pricing pages even when they hold them — you have to ask sales. Treat 'not publicly advertised' as 'unknown without a sales call,' not as 'does not exist.' Second: this page does not list every available model at every provider. The catalogs are too large and they change weekly. The point of the index is to show the structure of the market and let you go to the canonical pricing page for the specific model you need. Third: latency, throughput, time-to-first-token, and reliability matter as much as $/MTok and are not captured here. Two providers can serve the same Llama 4 checkpoint at the same price and one of them will be three times faster in practice. Benchmark on your actual workload before you commit.

Sources

[01]
Claude API per-token pricing for Opus 4.5–4.7 ($5/$25), Sonnet 4.5–4.6 ($3/$15), and Haiku 4.5 ($1/$5), plus caching and batch multipliers.
https://platform.claude.com/docs/en/about-claude/pricing ↗
[02]
Anthropic's public pricing landing page, source of truth for Claude API rates and enterprise HIPAA-ready offering.
https://claude.com/pricing ↗
[03]
OpenAI API per-token pricing for GPT-5.5, 5.4, 5.4-mini, 5.4-nano, cached input discount, batch and regional uplift policy.
https://developers.openai.com/api/docs/pricing ↗
[04]
Google AI Studio pricing for Gemini 2.5 Flash, 2.5 Flash-Lite, 3.1 Flash-Lite, and 3.5 Flash, plus free tier data-use terms.
https://ai.google.dev/pricing ↗
[05]
Vertex AI Gemini pricing tiers including 2.5 Pro ($1.25/$10.00, step up above 200k) and 3.x Flash variants.
https://cloud.google.com/vertex-ai/generative-ai/pricing ↗
[06]
AWS Bedrock foundation model pricing across Anthropic Claude, Meta Llama, Mistral, and Amazon Titan with batch discount notes.
https://aws.amazon.com/bedrock/pricing/ ↗
[07]
Together AI serverless inference pricing range from $0.10/$0.10 (Llama 3 8B Lite tier) to $1.40/$4.40 (GLM premium tier).
https://www.together.ai/pricing ↗
[08]
Groq per-model pricing for Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout, GPT-OSS 20B and 120B, Qwen3 32B.
https://groq.com/pricing ↗
[09]
Fireworks pricing structure with 50% cache and batch discounts and embedding tier pricing; $1 free credit on signup.
https://fireworks.ai/pricing ↗
[10]
SambaNova Cloud catalog of 10 open-weight models with separate input/output per-million-token pricing per model.
https://cloud.sambanova.ai/pricing ↗
[11]
Cerebras inference tiers (free trial, Developer self-serve from $10, Enterprise) and hosted models including GPT-OSS-120B and Llama 4 Scout.
https://www.cerebras.ai/inference/ ↗
[12]
Replicate's hybrid pricing — hardware-time, per-token for LLMs, per-output for image/video models.
https://replicate.com/pricing ↗
[13]
OpenRouter free plan (25+ models, 50 req/day), pay-as-you-go with 5.5% platform fee, and explicit no-markup policy on provider rates.
https://openrouter.ai/pricing ↗
[14]
Perplexity Sonar pricing — Sonar ($1/$1), Sonar Pro ($3/$15), Sonar Reasoning Pro and Deep Research ($2/$8 + extras), plus per-request fees.
https://docs.perplexity.ai/guides/pricing ↗
[15]
Hugging Face Inference Endpoints hourly GPU rates (T4 ~$0.50/hr through B200 ~$9.25/hr) and SOC 2 reference.
https://huggingface.co/pricing ↗
[16]
Modal per-GPU-second pricing (A100 80GB ~$0.000694, H100 ~$0.001097, B200 ~$0.001736), $30/month Starter credits, SOC 2 and HIPAA-on-Enterprise.
https://modal.com/pricing ↗
[17]
Anyscale per-GPU-hour pricing (T4 ~$0.57, A100 ~$4.96, H100 ~$9.29, H200 ~$10.68) with $100 starter credits.
https://www.anyscale.com/pricing ↗
[18]
Lambda's Inference API is being wound down; the GPU instance rental product remains active.
https://lambda.ai/inference ↗
[19]
Azure OpenAI Service hosts the OpenAI GPT family under Microsoft's Azure compliance envelope (HIPAA-eligible with Azure BAA).
https://azure.microsoft.com/en-us/products/ai-services/openai-service/ ↗
[20]
AWS HIPAA eligibility and BAA framework, which governs Bedrock usage for regulated workloads.
https://aws.amazon.com/compliance/hipaa-compliance/ ↗

Keep reading

Learn — model routing playbook →Research — inference cost notes →Tools — token cost calculator →OrangeBox — local inference alternative →B00KMakor — context length and pricing →Compare: Claude vs GPT vs Gemini pricing →Tracker — model release timeline →