Transformer-based generative Large Language Models (LLMs) have shown considerable strength in a broad range of Natural Language Processing (NLP) tasks. Numerous applications benefit from its wide applicability; however, for most developers, the expense of training and implementing these models is frequently prohibitive. For this, top AI firms like OpenAI, Google, and Baidu offer a language model-as-a-service (LMaaS) by granting access to their LLMs through APIs.
Application developers provide the LLM service with user input messages and particular instructions in an LMaaS scenario. To provide better quality of service (QoS) and support more customers, service providers strive to decrease response times and boost throughput. However, there are inefficiencies in the way that current systems, such as TensorFlow Serving and Triton Inference Server, handle queries. They do it in a first-come, first-served (FCFS) fashion with a predetermined batch size. These systems employ limited batch sizes, which restricts the GPUs’ capacity for parallel computation to prevent out-of-memory (OOM) issues.
Continuous batching has been suggested to address this, which dynamically eliminates finished requests and adds new ones while processing. This approach frequently uses conservative GPU memory management techniques, which limit throughput by not taking full advantage of the GPUs’ parallel processing capacity. Although they promise to reduce memory, other strategies like model quantization and pruning may lower the caliber of the generated output.
It has been noted that in many applications, there is a positive correlation between the length of the text that is created and the text that is entered by the user. This is particularly true for jobs like code translation, bug patching, text detoxification, grammatical correction, multilingual machine translation, and code commenting. The duration of the user’s input and the output that is produced are discovered to be strongly positively correlated by examining the requests made by these applications. The batching process can be made more efficient by using this correlation to forecast the duration of created requests.
A team of AI researchers from China has proposed Magnus, a system that employs application-level and user-level semantic information in conjunction with the length of the user’s input to forecast request generation lengths properly. Four parts make up Magnus: a batch scheduler, an adaptive batcher, a serving time estimator, and a generation length predictor. The generation length predictor estimates request lengths based on user input, application-level semantic characteristics, and user-level semantic features using a random forest regressor. In order to minimize computational waste, the adaptive batcher groups requests with similar projected lengths and chooses the right batch size.
The batch scheduler chooses batches based on the highest response ratio next (HRRN) policy, minimizing request queue times and reducing response times, and the serving time estimator employs KNN regression to predict batch serving times in order to further improve QoS.
When Magnus’ prototype system was tested using ChatGLM-6B instances on NVIDIA V100 GPUs, it showed notable gains over the baselines in terms of serving latency, request throughput, and serving efficiency. The testbed’s experimental results showed that, in comparison to baseline approaches, Magnus increases request throughput by up to 234% and reduces response times by up to 89.7%. This enhancement demonstrates how well batch serving in LMaaS can be optimized by employing generation length estimates.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 44k+ ML SubReddit
The post This AI Paper from China Propose ‘Magnus’: Revolutionizing Efficient LLM Serving for LMaaS with Semantic-Based Request Length Prediction appeared first on MarkTechPost.
#AIShorts #Applications #ArtificialIntelligence #EditorsPick #Staff #TechNews #Technology [Source: AI Techpark]