Layerwise Importance Sampled AdamW (LISA): A Machine Learning Optimization Algorithm that Randomly Freezes Layers of LLM Based on a Given Probability

Tasks like creating documents, developing complex code, answering queries, and conducting human-like conversations are where large language models like ChatGPT shine. As LLMs find more and more uses across many different types of tasks, fine-tuning them for certain domains has become an important tactic for improving their capabilities in the future. However, these technologies are quite costly, which makes it difficult to construct models on a big scale. Parameter-efficient fine-tuning (PEFT) methods have been suggested to minimize the number of trainable parameters and lower the cost. These methods include adapter weights, prompt weights, and LoRA.

Among them, LoRA is one of the most widely adopted PEFT techniques, allowing the adaptor to be merged back to the base model parameters. But LoRA still need ways to go before it can compete with full parameter fine-tuning in every scenario when it comes to fine-tuning chores. For instance, there are concerns over LoRA’s efficacy on large-scale datasets due to observations that it often fails during continuous pre-training. This is because LoRA training has less representational capacity than the base model because it has fewer trainable parameters.

To address this limitation, researchers from the Hong Kong University of Science and Technology and the University of Illinois investigated the training statistics of LoRA in every layer to bridge the gap between LoRA and full-parameter fine-tuning. The team found that LoRA’s layerwise weight norms are surprisingly skewed; most of the weights are assigned to the bottom or top layer during the update, with very few weights assigned to the other self-attention layers. This indicates that different layers are given different weights depending on their importance.

In keeping with the concept of importance sampling, this crucial finding motivated them to “sample” several levels according to their relative significance. As a result, the team introduced the Layerwise Importance Sampled Adam (LISA) algorithm that allows for the training of large-scale language models (≥ 65B parameters) with the same or less memory consumption as LoRA by selectively updating only the essential LLM layers while leaving others untouched.

Upon fine-tuning for downstream tasks, LISA outperformed both LoRA and traditional full-parameter fine-tuning methods. This significant performance gap suggests that LISA could be a promising alternative to LoRA, demonstrating its superiority in the field of large-scale language model training.

This research demonstrates that LISA enhances convergence characteristics and surpasses LoRA by 8–36% in MT-Bench, making it a compelling choice for fine-tuning tasks for current LLMs. Moreover, LISA’s performance is not limited to specific tasks or model sizes. It consistently delivers improved results across various activities, including instruction following, medical QA, and math problems for models ranging from 7 B to 70 B in size.

The team highlights that, similar to LoRA, LISA’s main drawback is the memory consumption caused by the optimization forward pass, which still requires the model to be displayed in memory. In the future, they want to do additional trials to confirm QLoRA’s performance, which will help them compensate for this shortcoming.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

The post Layerwise Importance Sampled AdamW (LISA): A Machine Learning Optimization Algorithm that Randomly Freezes Layers of LLM Based on a Given Probability appeared first on MarkTechPost.

#AIPaperSummary #AIShorts #Applications #ArtificialIntelligence #MachineLearning #TechNews #Technology
[Source: AI Techpark]