AtomEons / Atlas / MLOps: How models actually live in production

From laptop notebook to global inference: the boring infrastructure that makes AI real

MLOps: How models actually live in production

Training a model is the easy part. Keeping it serving millions of users without melting GPUs or hallucinating into a lawsuit is MLOps.

The lifecycle, in plain English

A model's life has roughly five stages, and MLOps tooling exists for each one. **Experimentation.** Someone trains a model on a laptop or a small cluster, tries different hyperparameters, and tracks what works. Weights & Biases (universally called "wandb") and MLflow are the dominant tools here. They log every training run — loss curves, GPU utilization, sample outputs, the exact git commit, the random seed — so that six months later when someone asks "why does the v3 model behave differently than v2," you can answer. Without this, ML teams drown in unreproducibility. **Model registry.** Once a model works, it gets versioned and stored somewhere durable with metadata: what data trained it, what evaluation scores it has, who approved it, what its known failure modes are. Hugging Face Hub is the closest thing to a public registry. Internal registries usually live in wandb, MLflow, or Vertex AI's model registry. The point is that "the model" stops being a file on someone's disk and becomes an artifact with a lineage. **Serving.** This is where the rubber meets the GPU. You need to take that registered model and expose it as an API that handles concurrent requests, batches them efficiently, manages memory across requests, and doesn't crash when traffic spikes. **Monitoring.** Once live, the model needs eyes on it constantly. Latency, throughput, error rates, cost per request, but also ML-specific metrics: are inputs starting to look different from training data? Are outputs starting to drift in unexpected ways? **Retraining and rollback.** Every model eventually goes stale. You retrain on fresh data, A/B test the new version against the old, and either promote it or roll back.

Inference serving — the hot center of modern MLOps

For large language models specifically, three open-source inference engines dominate. **vLLM** came out of Berkeley in 2023 and introduced "PagedAttention," a technique that manages the KV cache (the memory storing intermediate attention state during generation) the way operating systems manage virtual memory — in pages that can be reused, swapped, and shared across requests. The result was a dramatic throughput improvement, often 5-10x over naive implementations. vLLM is now the default inference engine for most self-hosted LLM deployments. **TGI** (Text Generation Inference) is Hugging Face's serving stack. It supports continuous batching, quantization, and streaming, and integrates tightly with the Hugging Face model ecosystem. **llama.cpp** is the lightweight champion, written in C++ with minimal dependencies, designed to run quantized LLMs on consumer hardware including Apple Silicon. It's why someone with a MacBook Pro can run a 70B parameter model at home. The GGUF file format that llama.cpp uses has become a de facto standard for quantized model distribution. For closed-source production deployments, the big cloud providers offer managed inference (Amazon Bedrock, Google Vertex AI, Azure OpenAI Service), but a new class of specialist inference platforms has emerged.

The new inference stack: Modal, Replicate, Anyscale, Together

These companies exist because rolling your own GPU infrastructure is genuinely hard, but the big-three clouds are clunky for ML workloads. **Modal** lets you write a Python function, decorate it with GPU requirements, and have it autoscale across a fleet of GPUs with cold-start times measured in seconds rather than minutes. It's particularly popular for batch inference and irregular workloads. **Replicate** focuses on running pre-trained models as APIs with a per-second billing model. They popularized the "cog" container format for packaging ML models reproducibly. **Anyscale** is the commercial home of Ray, the distributed computing framework that underpins much of OpenAI's training infrastructure. Their Anyscale Endpoints product offers managed vLLM-style serving. **Together AI** specializes in fast, cheap inference for popular open models, often beating the model creators on price by aggressively optimizing the serving stack. **Weights & Biases** sits above all of these as the experiment-tracking and model-registry layer that most serious teams use regardless of where the model actually runs.

Drift, the silent killer

The most distinctive MLOps problem is drift. A search ranking model trained on 2024 query patterns will silently degrade as user behavior shifts in 2025. The model still returns answers. Latency still looks fine. But the answers slowly get worse, and by the time anyone notices, business metrics are already bleeding. Two flavors. **Data drift** is when the inputs change distribution — new slang, new product categories, new languages your training set didn't cover. **Concept drift** is when the relationship between inputs and correct outputs changes — what counted as spam in 2020 is not what counts as spam now. Detection tools like Evidently, Arize, and Fiddler watch distributions of inputs and outputs over time and alert when they shift outside thresholds. For LLMs specifically, the harder problem is detecting quality drift in generated text — there is no clean numerical "is this response good" signal, so teams resort to LLM-as-judge evaluation, human review samples, and proxy metrics like user thumbs-down rates.

A/B testing and progressive rollout

You never just replace a model. You deploy the new version alongside the old, route some percentage of traffic to it (usually starting at 1%), watch the metrics, and ramp up only if everything stays green. If anything regresses — quality, latency, cost, error rate — you roll back instantly. Feature-flag systems like LaunchDarkly and Statsig got pulled into ML workflows for exactly this reason.

The honest summary

MLOps is the part of AI that doesn't get written up in research papers. It's deployment scripts, Kubernetes manifests, Grafana dashboards, on-call rotations at 3am when inference latency spikes, and the slow grind of keeping a model honest as the world around it changes. The labs that ship reliable AI products — OpenAI, Anthropic, Google DeepMind — spend more engineering time on this than on the modeling itself. The models are the engine. MLOps is everything else that lets the car drive.

← atlas index