Chapter 2.12 of the report discusses how to improve model performance through techniques such as prompt engineering, fine-tuning, and Attention mechanisms.
1. Prompting
Prompting is a key component in the AI processing pipeline, involving providing natural language instructions to the model that describe the task it should perform. Mastering the art of writing effective prompts can significantly enhance the performance of LLMs without requiring underlying improvements to the model itself.
), but the report includes some new information I haven't learned before, so today I will supplement my understanding.
1.1 Graph of Thoughts Prompting
CoT) and "Tree of Thoughts" (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method called "Graph of Thoughts" (GoT), which also showed potential. GoT allows LLMs to simulate their thinking in a more flexible, graph-like structure, which is closer to the actual human reasoning process.
Researchers then designed a model architecture to implement GoT and found that compared to ToT, it improved output quality by 62% on a ranking task while reducing costs by about 31%.
1.2 Optimization by PROmpting (OPRO)
A paper published by DeepMind introduced "Optimization Through Prompting" (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithm performance. OPRO guides LLMs via natural language to create new prompts based on problem descriptions and previous solutions.
For example:
Step two: "Let's carefully consider the problem and solve it together," with a training accuracy of 63.2%; Step four: "Let's break it down!" with a training accuracy of 71.3%; Step five: "Let's calculate the solution!" with a training accuracy of 73.9%; Step six: "Let's do the math!" with a training accuracy of 78.2%.
These generated prompts aim to improve the performance of AI systems on specific benchmarks. Compared to other prompting methods like "think step-by-step" or blank-start approaches, OPRO significantly improved accuracy on almost all 23 BIG-bench Hard tasks.
2. Fine-Tuning
Fine-tuning has become increasingly popular as a method for enhancing the performance of LLMs, involving further training or adjusting the model on a smaller dataset. Fine-tuning not only improves the overall performance of the model but also enhances its capabilities on specific tasks and allows for more precise control over model behavior.
), and today let's take a look at QLoRA:
2.1 QLoRA
QLoRA, a new method developed by researchers at the University of Washington in 2023, aims to increase the efficiency of model fine-tuning. It significantly reduces memory usage, making it possible to fine-tune a 65-billion-parameter model on a single 48GB GPU while maintaining full 16-bit fine-tuning performance. For comparison, fine-tuning a leading open-source LLM of similar scale, such as the 65B Llama model, typically requires around 780GB of GPU memory. Thus, QLoRA's efficiency is nearly 16 times higher.
QLoRA significantly improves efficiency through technologies such as 4-bit NormalFloat (NF4), double quantization, and page optimization. QLoRA was used to train the Guanaco model, which performed comparably or even exceeded models like ChatGPT in the Vicuna benchmark (a benchmark for evaluating LLM outputs).
Notably, the Guanaco model was successfully created after only 24 hours of fine-tuning on a single GPU. QLoRA highlights how methods for optimizing and further improving models are becoming more efficient, meaning that creating more capable models will require fewer resources.
3. Attention Mechanism
Although LLMs can flexibly handle various tasks, they usually require substantial computational resources for training. As mentioned earlier, the high cost of training may hinder the broader application of AI. Optimization methods aim to improve the efficiency of AI by improving memory usage, thereby making LLMs more accessible and practical.
3.1 Flash-Decoding
Flash-Decoding, developed by Stanford University researchers, addresses the inefficiency of traditional LLMs when handling long-sequence tasks by accelerating the attention mechanism. It achieves this by parallel loading keys and values and separately rescaling and combining them to maintain correct attention outputs.
In various tests, Flash-Decoding outperformed other leading methods, such as PyTorch Eager and FlashAttention-2, showing faster inference speeds: for example, with a batch size of 256 and a sequence length of 256, Flash-Decoding is 48 times faster than PyTorch Eager and 6 times faster than FlashAttention-2.
!)