What is it about?

LLM inference pricing treats every token like it costs the same. Input tokens, output tokens, a flat rate or a linear combo: simple, very simple, actually too simple. The Transformer's autoregressive structure means energy per token is deeply nonlinear. Long input? Quadratic attention cost hits hard at prefill. Very short output? That cost never gets amortized. You just paid a lot for very little. We mapped this out properly. The model is called SweetSpot. It predicts the energy-per-output-token curve as a function of (input tokens, output tokens) from first principles, FLOPs and memory access complexity, and hits 1.79% MAPE across 13 LLMs (1B to 9B params, OPT / LLaMA / Gemma / Falcon / Qwen2 / Granite) on NVIDIA H100s. The sweet spot: short-to-moderate inputs, medium outputs. The nightmare: 4096 token prompt → 64 token reply. Up to 33x less efficient than optimal. Also: GQA models consistently beat MHA at the same scale. Architecture matters, not just parameter count.

Featured Image

Why is it important?

Current energy estimates for LLM inference assume a simple linear relationship with sequence length — but the actual Transformer architecture makes this fundamentally wrong. SweetSpot matters because it: Corrects a broken assumption baked into how the entire industry estimates inference costs Quantifies the stakes: up to 33x energy difference between efficient and inefficient usage patterns, at datacenter scale Is actionable, as the model is accurate enough (1.79% MAPE) to directly inform real production decisions like prompt truncation, summarization, and adaptive generation strategies Essentially, it gives engineers a principled tool to stop wasting energy they didn't even know they were wasting.

Read the Original

This page is a summary of: SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference, May 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3777884.3797011.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page