A black sphere, a black cube, and a black cylinder on dark slate — three modalities, one form factor.

Diffusion · the atlas

How image + video + audio actually get generated.

Every press-photo on atomeons.com was generated by a diffusion-lineage model (Nano Banana Pro). This page walks the underlying mechanics: forward noise process → learned reverse process → latent diffusion → classifier-free guidance → flow matching. Then the 2026 ecosystem of who-makes-what, and the cost+quality answers that matter for builders.

How they actually work

Six mechanisms, one model family.

Forward process

Take a clean image. Add a tiny bit of Gaussian noise. Repeat 1,000+ times. After enough steps the image becomes indistinguishable from pure noise. This is the forward (corruption) process — it's deterministic and not learned; it just defines a path from data to noise.

Reverse process (the learned part)

Train a neural network to predict what noise was added at each step. At inference, you start with pure noise and iteratively call the model to subtract its predicted noise — taking you back through the forward path in reverse, from noise to image. This reverse network is the model.

Conditioning

Modern diffusion models are conditional — you don't just sample any image; you sample an image given a text prompt (or another image, or a depth map, or a sketch, etc.). The conditioning signal is passed into the reverse network at every step, biasing the noise prediction toward outputs matching the prompt.

Latent diffusion (Rombach et al. 2021, the unlock)

Don't diffuse in pixel space (megabytes per image). Diffuse in a small latent space encoded by a VAE. ~64× smaller per dimension. Makes high-resolution generation tractable on consumer GPUs. Stable Diffusion was the first widely-deployed latent diffusion model.

Classifier-free guidance (Ho + Salimans 2021)

At inference, run the model twice — once with the prompt and once unconditional. Steer the output toward the prompt by linearly amplifying the difference. The 'guidance scale' parameter every diffusion UI exposes is this. High guidance = more prompt-faithful but lower diversity.

Flow matching (Lipman et al. 2022, the 2024 successor)

Reformulate diffusion as learning a continuous velocity field that transports noise to data. Mathematically related to denoising-diffusion but with cleaner training objective + faster sampling. Stable Diffusion 3 + Flux are flow-matching-based. The new state of the art.

The 2026 ecosystem

Eight systems shipping right now.

Stable Diffusion lineage (Stability AI + community)

Open weights · Hugging Face · ComfyUI / Auto1111

SD 1.5 (2022, latent diffusion), SDXL (2023, larger), SD 3 (2024, flow matching, MMDiT). Open-weights. The reason there's a vibrant local-image-generation community on consumer hardware. ComfyUI + Automatic1111 + Forge are the UI ecosystem.

Flux (Black Forest Labs · 2024+)

Open weights for dev/schnell · API for pro

Successor team to Stable Diffusion's original authors. Flux.1-dev (dev license), Flux.1-schnell (Apache 2.0), Flux.1-pro (API only). Best open-weights image quality of 2024-2025. Heavy flow-matching architecture (MMDiT 12B params).

DALL-E 3 (OpenAI · 2023)

API only · ChatGPT consumer

OpenAI's third-generation image model. API only. Strong prompt adherence + text rendering. Available through ChatGPT + API. Quietly capped by safety filters that can be more restrictive than open-weights options.

Imagen 4 family (Google · 2024)

API only · Google AI Studio + Vertex

Three variants: Imagen 4 (standard), Imagen 4 Fast (lower latency), Imagen 4 Ultra (highest quality). Google AI Studio + Vertex AI access. Strong on text-in-image rendering. Note: different model than Nano Banana Pro below.

Nano Banana Pro · Gemini 3 Pro Image (Google · 2024-2025)

API only · Google AI Studio · generativelanguage.googleapis.com

Google's multimodal Gemini-family image model — used as the image generation engine on atomeons.com. Different lineage from Imagen 4: Nano Banana Pro is the image branch of the Gemini transformer family, not a dedicated diffusion model. Strong on prompt adherence + brand-consistent style.

Sora (OpenAI · video · 2024)

ChatGPT consumer · API limited

OpenAI's video diffusion model. Multi-second video clips from text. Released as ChatGPT consumer feature December 2024. Quality leader for short-clip text-to-video at announcement. Heavy compute per generation.

Veo 2 / Veo 3 (Google · video · 2024+)

API · Google Vertex AI

Google DeepMind's video model. Multi-second clips from text. Strong on physical-world coherence. Available through Google Vertex AI. Often paired with Imagen 4 for stills + Veo for motion.

MusicLM / MusicGen / Suno / Udio (audio · 2023+)

Mostly API · some open-weights (Stable Audio, MusicGen)

Audio diffusion is a different model family. Suno + Udio are the consumer-facing music generators. Stability Audio (Stability AI) is the open-weights alternative. Less attention than image/video in 2025-2026 but actively shipping.

Practical Q+A

Four questions builders actually ask.

Which diffusion model do I use for what?

Brand-consistent product photography → Nano Banana Pro / Imagen 4. Creative illustration → Flux.1-pro or Midjourney. Open-weights local generation → Flux.1-schnell (Apache) or SDXL. Video clips → Sora or Veo 3. Music → Suno (consumer) / Udio (consumer) / Stable Audio (open).

Why does the same prompt produce different images on different models?

Different training data, different conditioning encoders (CLIP vs T5 vs Gemini), different architecture (U-Net vs MMDiT), different sampling schedulers, different guidance scales, different safety filters. Two models with the 'same' prompt can produce wildly different outputs.

What's the cost difference?

Open-weights local inference: ~$0.001-$0.01 per image on a consumer GPU (electricity cost only). API generation: $0.04-$0.40+ per image depending on model + resolution + provider. Video and music generation costs are 10-100× higher per second of output.

What's the realistic quality floor in 2026?

Flux.1-pro and Nano Banana Pro both produce images that pass casual inspection as photography for most subjects. Failure modes are still concentrated around hands, faces (uncanny-valley risk), readable text, and physical-world physics. Video generation has stronger physics issues than stills.

Multimodal LLMs →How we use Nano Banana Pro →← atlas index