Can We Drastically Reduce AI Training Costs? This AI Paper from MIT, Princeton, and Together AI Unveils How BitDelta Achieves Groundbreaking Efficiency in Machine Learning

Training Large Language Models (LLMs) involves two main phases: pre-training on extensive datasets and fine-tuning for specific tasks. While pre-training requires significant computational resources, fine-tuning adds comparatively less new information to the model, making it more compressible. This pretrain-finetune paradigm has greatly advanced machine learning, allowing LLMs to excel in various tasks and adapt to individual needs, promising a future with highly specialized models tailored to specific requirements.

Various quantization techniques, such as rescaling activations, decomposing matrix multiplications, and iterative weight rounding, aim to reduce memory usage and latency in LLMs. Additionally, pruning methods induce sparsity by zeroing certain parameter values. Parameter-efficient fine-tuning (PEFT) approaches, like adapter layers and Low-Rank Adaptation (LoRA), reduce trainable parameters during fine-tuning, enhancing efficiency without sacrificing accuracy. These methods offer significant potential for compression-aware training and multi-tenant serving systems.

Researchers from the Massachusetts Institute of Technology, Princeton University, and Together AI have proposed BitDelta, which effectively quantizes fine-tuning deltas to 1 bit without sacrificing performance. This discovery suggests potential redundancy in fine-tuning information and offers multi-tenant serving and storage implications. By employing a high-precision base model alongside multiple 1-bit deltas, BitDelta significantly reduces GPU memory requirements by over 10×, thereby enhancing generation latency in multi-tenant environments.

BitDelta employs a two-stage process for efficient quantization of fine-tuning deltas in LLMs. Firstly, it quantizes each weight matrix delta into a binary matrix multiplied by a scaling factor, initialized as the average absolute value of the delta. Secondly, it calibrates scaling factors via model distillation over a small dataset, maintaining frozen binary matrices. BitDelta‘s efficiency allows for rapid compression of models, facilitating shared server usage and significantly reducing GPU memory consumption and inference latency.

BitDelta is evaluated against original uncompressed models and 8-bit RTN and 4-bit GPTQ quantization methods. Across Llama-2 and Mistral model families, BitDelta consistently performs well on high-margin metrics, often outperforming baselines. It accurately preserves fine-tuned information, even surpassing GPTQ when applied to quantized base models, showcasing its effectiveness and versatility across different model sizes and fine-tuning techniques.

In conclusion, researchers from the Massachusetts Institute of Technology, Princeton University, and Together AI have proposed BitDelta, a simple yet powerful method for quantizing weight deltas in LLMs down to 1 bit, efficiently representing multiple fine-tuned models with one base model and multiple deltas. BitDelta achieves minimal performance degradation through distillation-based calibration while significantly reducing GPU memory requirements and improving generation latency. This approach paves the way for more efficient model deployment and resource utilization in machine learning applications.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.