Benchmarking Serverless AI Architectures: Modular RAG, Serverless RAG, and Long Context Inference

Florian Alexandru Serb Petrusel; Pedro Antonio García López

doi:10.1145/3809481.3816479

What is it about?

This research compares different architectural methods for helping Artificial Intelligence (AI) read, search, and answer questions from massive technical documents up to 300 pages long. We built and tested three distinct systems in a cloud environment: two systems that chop documents into smaller, searchable pieces (Modular RAG and Serverless RAG), and a third approach that feeds the entire document into the AI's memory at once (Long-Context Inference). Our experiments reveal that while feeding a whole document to an AI works beautifully for medium-sized texts, scaling it up to giant files causes the AI to suffer from "Context Fatigue"—making it slow, prone to errors, and nearly 100 times more expensive than traditional lookup methods.

Photo by Jackson Sophat on Unsplash

Why is it important?

With recent breakthroughs, many AI providers claim their models can read millions of words instantly, leading people to believe that older data-retrieval systems are obsolete. Our study proves that this is a misconception. By testing these architectures under real-world serverless cloud conditions (AWS Lambda), we uncovered hidden bottlenecks like extreme latency spikes and massive computing bills. This work is highly valuable for software developers and cloud architects because it provides concrete thresholds showing exactly when to use each architecture, preventing companies from wasting budgets on inefficient AI setups.

Perspectives

Building this automated benchmarking platform allowed us to see the massive gap between theoretical AI capabilities and practical cloud deployment realities. It proved that a "one-size-fits-all" approach does not exist. The future of processing heavy corporate documentation efficiently will rely on hybrid routers that analyze file sizes and query types in real time to pick the smartest, cheapest path.
FLORIAN ALEXANDRU SERB PETRUSEL

This page is a summary of: Benchmarking Serverless AI Architectures: Modular RAG, Serverless RAG, and Long Context Inference, June 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3809481.3816479.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page

FLORIAN ALEXANDRU SERB PETRUSEL

Finding the best way to process massive documents using AI without exploding cloud costs.

What is it about?

Why is it important?

Perspectives

Resources

WoAIS 2026 Workshop Presentation

Live Demo, Source Code & Extended Thesis

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Finding the best way to process massive documents using AI without exploding cloud costs.

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

WoAIS 2026 Workshop Presentation

Live Demo, Source Code & Extended Thesis

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management