What is it about?

Traditional methods in NLP require thousands of examples to train ML models for classification, but this is impractical in many business settings where data are limited. Instead of creating models using Full-Data Settings, one can create models which require few examples per class (Few-Shot Settings). There are many approaches to Few-Shot Classification, like (a) using contrastive learning with older (but inexpensive) Masked Language Models (MLMs) like BERT (requires 10-20 examples per class) and (b) prompting modern (but expensive) Large Language Models (LLMs) like GPT-4 (requires 1-5 examples per class). However, the performance-cost trade-offs of these methods remain underexplored. Our work addresses this gap by studying the aforementioned approaches over the popular Banking77 financial intent detection dataset, including the evaluation of cutting-edge LLMs like OpenAI GPT-4, Anthropic Claude, and Cohere's Command-nightly. We complete the picture with two additional methods: first, a cost-effective prompting method for LLMs that provides dynamic few-shot examples based on retrieval-augmented generation (RAG), able to reduce operational costs multiple times compared to classic few-shot approaches, and second, a data augmentation method using GPT-4, able to improve performance in data-limited scenarios.

Featured Image

Why is it important?

To the best of our knowledge, this is the first study on the performance-cost investigation of LLMs versus MLMs. Many companies tend to use the most modern models like proprietary LLMs (OpenAI GPT-4), which come at a pretty heavy cost without comparing their performance with cheaper older models, which might perform equally. Also, we present a simple but effective active learning method for text classification (based on RAG), able to reduce LLM costs more than 3 times in real-life business settings and save hundreds of dollars to a company.

Perspectives

Many insights derive from this study. Mainly, it seems better for a company to pick MLMs (like BERT) when having more than 5 samples available per class (the more, the better of course!). For fewer examples per class available, it's better to pick state-of-the-art LLMs (like OpenAI GPT-4). While GPT-4 is expensive, there are competitor LLMs that come at a fraction of its price with comparable performance, like Anthropic Claude 2. Lastly, at all times, at inference time, instead of providing all examples and their classes to an LLM (231 examples in our example), it might be better to provide just the top 5 relevant samples to it and cut down your costs effectively up to 3x (due to fewer tokens being charged in API).

Lefteris Loukas
Helvia.ai

Our work provides a practical rule of a thumb for text classification in settings with lots of classes, such as intent detection in chatbot use cases: - If you have less than 5 examples per class, it's better to use "dynamic" few-shot prompting (employing RAG) that performs better and costs a fraction of the regular few-shot prompting. - If you have more than 5 examples per class, it's better to finetune a pretrained model such as MPNet using a contrastive learning technique such as SetFit.

Stavros Vassos
Helvia.ai

Read the Original

This page is a summary of: Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking, November 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3604237.3626891.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page