Large language models (LLMs) have gained significant attention in recent years, but their safety in multilingual contexts remains a critical concern. Researchers are grappling with the challenge of mitigating toxicity in non-English languages, a problem that has been largely overlooked despite substantial investments in LLM safety. The issue is particularly pressing as studies have revealed high toxicity levels in multilingual LLMs, underscoring the urgent need for effective multilingual toxicity mitigation strategies. Current approaches to reducing toxicity in open-ended generations for non-English languages face significant hurdles, primarily due to the resource-intensive nature of existing solutions. These methods typically require extensive datasets of toxic and non-toxic samples in the target language, which are often scarce or nonexistent, forcing researchers to rely on translated English data as a substitute.
Researchers have explored various approaches to address the challenges of multilingual toxicity mitigation in LLMs. Cross-lingual generalization of Reinforcement Learning with Human Feedback (RLHF) or AI Feedback (RLAIF) has shown mixed results across different tasks. For question-answering, preference tuning on English-dominant data negatively impacts multilingual capabilities, necessitating multilingual training data. Conversely, summarization tasks demonstrate effective zero-shot cross-lingual generalization with English reward models. In the realm of LLM safety, efforts to develop safeguards against malicious instructions have shown limited zero-shot cross-lingual generalization to both low-resource and high-resource languages. Current solutions for multilingual toxicity mitigation often rely on translating toxic and non-toxic data from English to target languages, extending existing detoxification methods to multilingual settings. However, these approaches remain resource-intensive and may not fully address the complexities of multilingual toxicity.
Researchers from the Department of Computer Science, at Brown University study cross-lingual detoxification of LLMs using English preference tuning without translation for cross-lingual detoxification of LLMs. They present the observation, using Direct Preference Optimization (DPO) with only English training data significantly reduces toxicity levels in LLM generations across 17 different languages. This technique demonstrates zero-shot cross-lingual generalization, contradicting prior assumptions about limited cross-lingual transfer in LLM safety tasks. The method proves effective for various multilingual LLMs of different sizes and pretraining compositions, including mGPT, Llama3, and Aya-23. This discovery opens new avenues for efficient multilingual toxicity mitigation, addressing a critical challenge in LLM safety across diverse linguistic contexts.
The method’s architecture involves localizing toxicity within the LLM using probes and performing causal interventions. A linear probe for binary toxicity classification is trained on the Jigsaw dataset, taking the average residual stream from the last layer as input. Value vectors are ranked by cosine similarity to the probe, identifying the top 100 as potential sources of toxicity. Actual sources of toxicity are determined by collecting average neuron activations over 20 tokens using English prompts from the RTP-LX dataset. Causal interventions are then conducted by editing neuron activations and evaluating changes in toxicity across languages. This process involves amplifying or negatively intervening on selected neuron activations during the forward pass, allowing for verification of the toxic behavior explanation across different languages.
Results demonstrate the dual multilinguality of MLPs in LLMs. Value vectors consistently promote toxic tokens across various languages, while key vectors respond to multilingual input prompts designed to elicit toxic continuations. Among the top 100 sub-updates identified as potential toxicity sources, 36 were classified as actual sources. These value vectors promote multilingual tokens grouped by concepts such as sexual content, corruption, or political issues. Causal intervention experiments confirm that manipulating these toxic neuron activations significantly affects content toxicity across languages. By modifying just 36 of 196,608 toxic neuron activations, the average toxicity level across 17 languages was reduced from 0.175 to 0.032. The study also reveals that toxic key vectors are multilingual, showing positive activation across many languages before DPO training and reduced activation across all languages after DPO. This explains the cross-lingual generalization of DPO detoxification through the suppression of these multilingual neurons.
In this study, researchers show that safety preference tuning with DPO demonstrates effective zero-shot cross-lingual generalization in detoxifying LLMs. This approach proves robust across various multilingual LLMs, offering a powerful solution for multilingual toxicity mitigation. The study’s mechanistic explanation reveals the dual multilinguality of toxic neurons, providing insight into the generalization behavior. The effectiveness of this method is rooted in shared multilingual representations, allowing for cross-lingual transfer of safety preferences. Importantly, the research establishes that bilingual sentence retrieval can serve as a predictor for the cross-lingual generalizability of English safety preference tuning, offering a practical tool for assessing potential effectiveness across different language pairs.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.
Join our Telegram Channel and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 45k+ ML SubReddit
The post Researchers at Brown University Explore Zero-Shot Cross-Lingual Generalization of Preference Tuning in Detoxifying LLMs appeared first on MarkTechPost.
#AIPaperSummary #AIShorts #Applications #ArtificialIntelligence #EditorsPick #LanguageModel #LargeLanguageModel #Staff #TechNews #Technology [Source: AI Techpark]