Can Machine Learning Evolve Beyond Public Data Limits? This Research from China Introduces OpenFedLLM: Pioneering Collaborative and Privacy-Preserving Training of Large Language Models Using Federated Learning

LLMs, trained on extensive public datasets, have shown remarkable success across various fields, but the depletion of high-quality public data is imminent by 2026. Due to this scarcity, researchers combine existing datasets or generate model-created data. However, abundant high-quality data must still be utilized due to privacy or logistical constraints. For instance, BloomberGPT excels in finance with private financial data spanning 40 years. Collaborative training on decentralized personal data, without direct sharing, emerges as a critical approach to support the development of modern LLMs amid data scarcity and privacy concerns.

Researchers from Shanghai Jiao Tong University, Zhejiang University, and Shanghai AI Laboratory have developed OpenFedLLM, which facilitates collaborative and privacy-preserving training of LLMs on distributed private data through federated learning FL. OpenFedLLM integrates federated instruction tuning, value alignment, and diverse FL algorithms, offering a user-friendly interface for both LLM and FL communities. Empirical studies demonstrate FL’s superiority over individual training, especially in resource-constrained scenarios, with potential applications in finance.

In recent years, LLMs like GPT-3.5/4 and Llama2 have shown success across various domains, typically trained in three stages: pre-training on large corpora, instruction tuning, and value alignment. However, the exhaustion of high-quality public data by 2026 has prompted exploration into training LLMs on privately-held data. FL offers a solution by enabling collaborative training without sharing raw data. Various FL algorithms have been proposed to improve performance, though their efficacy in LLM training needs to be better understood. Previous works have explored FL with LLMs but are limited in scope. This study provides a comprehensive exploration of FL and LLMs, covering instruction tuning, value alignment, and multiple FL algorithms, with extensive empirical evaluation.

The OpenFedLLM framework is outlined, focusing on training LLMs via FL while preserving privacy. Two key components are introduced: federated instruction tuning and federated value alignment. Federated instruction tuning enhances LLMs’ ability to follow instructions, while federated value alignment injects human values into the models. Parameter-efficient fine-tuning techniques like LoRA are integrated to ensure computational and communication efficiency. The framework follows standard FL protocols, enabling seamless integration with various FL algorithms and facilitating collaborative model training across distributed parties.

Data management in FedLLM becomes intricate due to decentralized data distribution, necessitating nuanced selection methods. Heterogeneous preferences pose challenges in federated value alignment (FedVA), suggesting the need for grouping clients with similar values. Personalized FL emerges as a direction to tailor models to individual tasks or values. Robustness, security, privacy preservation, and efficiency are crucial concerns in FedLLM, especially with the emergence of malicious data and the need for large-scale model training. Adapting FedLLM to cross-silo and cross-device FL settings presents challenges and opportunities, with advancements in model compression and efficient training strategies offering promising solutions for deployment on resource-constrained devices.

In the study, researchers have outlined a holistic approach to training LLMs using FL on distributed private data, offering a promising avenue amid diminishing public data. The framework, OpenFedLLM, integrates instruction tuning, value alignment, FL algorithms, datasets, and evaluation metrics, facilitating comprehensive exploration. Empirical analyses showcase the superiority of FL over local training, with FL-fine-tuned LLMs surpassing even state-of-the-art models like GPT-4 in certain benchmarks. The work contributes valuable insights and methodologies for leveraging decentralized data in LLM training.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.