What is it about?
This research compares different architectural methods for helping Artificial Intelligence (AI) read, search, and answer questions from massive technical documents up to 300 pages long. We built and tested three distinct systems in a cloud environment: two systems that chop documents into smaller, searchable pieces (Modular RAG and Serverless RAG), and a third approach that feeds the entire document into the AI's memory at once (Long-Context Inference). Our experiments reveal that while feeding a whole document to an AI works beautifully for medium-sized texts, scaling it up to giant files causes the AI to suffer from "Context Fatigue"—making it slow, prone to errors, and nearly 100 times more expensive than traditional lookup methods.
Featured Image
Photo by Jackson Sophat on Unsplash
Why is it important?
With recent breakthroughs, many AI providers claim their models can read millions of words instantly, leading people to believe that older data-retrieval systems are obsolete. Our study proves that this is a misconception. By testing these architectures under real-world serverless cloud conditions (AWS Lambda), we uncovered hidden bottlenecks like extreme latency spikes and massive computing bills. This work is highly valuable for software developers and cloud architects because it provides concrete thresholds showing exactly when to use each architecture, preventing companies from wasting budgets on inefficient AI setups.
Perspectives
Building this automated benchmarking platform allowed us to see the massive gap between theoretical AI capabilities and practical cloud deployment realities. It proved that a "one-size-fits-all" approach does not exist. The future of processing heavy corporate documentation efficiently will rely on hybrid routers that analyze file sizes and query types in real time to pick the smartest, cheapest path.
FLORIAN ALEXANDRU SERB PETRUSEL
Read the Original
This page is a summary of: Benchmarking Serverless AI Architectures: Modular RAG, Serverless RAG, and Long Context Inference, June 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3809481.3816479.
You can read the full text:
Resources
WoAIS 2026 Workshop Presentation
Watch the official video presentation delivered at the WoAIS 2026 workshop. This brief session summarizes our methodology, the AWS Lambda serverless deployments, and the core findings comparing RAG architectures against Long-Context inference for massive technical documents.
Live Demo, Source Code & Extended Thesis
Visit my portfolio to interact with a live demo of the serverless architecture in action. This project page also provides direct access to the open-source GitHub repositories containing the deployment code, as well as the full, extended PDF version of the Bachelor's Thesis for deeper technical context.
Contributors
The following have contributed to this page







