What is it about?
Large language models can generate text much faster by using a small helper model to predict future words before a larger model verifies them. However, no single helper model works best for every task. For example, a model that performs well on programming questions may not be the best choice for general conversations. Dynamically choosing different helper models can improve performance, but this is difficult on edge devices because memory is limited and loading models takes time. In this work, we show that simply choosing the best helper model is not enough. A model may provide the best predictions, but if it is not already loaded in memory, the loading delay can eliminate the performance benefit. We therefore developed MemSpec, a runtime system that jointly decides which helper model to use and which models should remain in memory. Our results show that this coordinated approach significantly improves text-generation speed on memory-constrained edge devices while achieving performance close to an ideal system with perfect knowledge of future workloads. The work highlights the importance of memory-aware runtime management for efficient on-device AI.
Featured Image
Photo by Zulfugar Karimov on Unsplash
Why is it important?
A common assumption in adaptive speculative decoding is that better draft selection leads to better performance. Our work shows that this assumption does not always hold on memory-constrained edge devices. Even if the optimal draft model is identified, the cost of loading that model can outweigh its benefit. We show that memory management is a critical but often overlooked factor in adaptive LLM acceleration. By coordinating draft selection with model residency decisions, MemSpec substantially improves throughput while operating under the same memory constraints. This insight may influence the design of future LLM runtime systems for smartphones, robots, and other edge platforms.
Perspectives
Working on this project reminded me that improving AI systems is not only about building better models. As we explored adaptive speculative decoding on edge devices, we repeatedly encountered situations where the theoretically best model was not the fastest choice in practice because of memory limitations. I found this observation particularly interesting because it highlights how system constraints can fundamentally change the behavior of AI algorithms. It also reinforced my belief that many important challenges in AI lie at the intersection of machine learning and computer systems. I hope this work encourages researchers to look beyond model accuracy and consider practical deployment constraints when designing future AI systems. As more LLMs move from cloud servers to smartphones, robots, and other edge devices, these system-level considerations will become increasingly important.
Myeonggyun Han
Kyungpook National University
Read the Original
This page is a summary of: MemSpec: Memory-Aware Runtime for Adaptive Draft Scheduling in Speculative Decoding on Edge Devices, June 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3814943.3816174.
You can read the full text:
Contributors
The following have contributed to this page







