What is it about?
This study looks at two types of artificial intelligence (AI) models used to analyze images for cybersecurity purposes, such as spotting fake websites, detecting hidden triggers in images, or identifying malware. One approach relies on general-purpose models that can handle both text and images and are guided using natural language instructions (called prompts). The other approach uses specialized models trained specifically for each task. The research compares how well these models perform across different challenges. It finds that while the specialized models are more accurate, especially for complex problems like detecting malware, the general-purpose models are much easier to use and still perform quite well for simpler tasks. This means that in situations where time or computing resources are limited, the general-purpose models can be a practical alternative.
Featured Image
Photo by BoliviaInteligente on Unsplash
Why is it important?
As powerful AI models that understand both text and images (called large multimodal models) become more widely available, many are starting to wonder if these general-purpose tools can replace highly specialized models in fields like cybersecurity. This study is the first to systematically compare these two approaches across several real-world security tasks, including detecting backdoors in images, spotting phishing attempts, and classifying malware. What makes this work timely is the rapid rise of tools like GPT-4o and Gemini, which are easier to use and require no training, compared to traditional models that demand expertise and resources. Our research helps clarify when these new tools are enough and when you still need more specialized solutions. This can save time and money for cybersecurity teams and researchers by guiding smarter choices in model selection.
Perspectives
Working on this paper was an exciting opportunity to explore the intersection of cutting-edge AI and real-world cybersecurity challenges. I’ve always been fascinated by how general-purpose AI models like GPT-4o are reshaping what’s possible, and it was eye-opening to put them to the test against highly specialized systems. What made this project especially rewarding was seeing how prompt engineering alone could push these models to perform surprisingly well on certain tasks. It reminded me that accessibility and adaptability can be just as important as raw performance, especially when time and resources are limited. I hope this work sparks more discussion around how we choose and use AI tools, and encourages others to test these models in creative, practical ways.
Fouad Trad
American University of Beirut
Read the Original
This page is a summary of: Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications, ACM Transactions on Intelligent Systems and Technology, May 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3735648.
You can read the full text:
Contributors
The following have contributed to this page







