Top 12 Trending LLM Leaderboards: A Guide to Leading AI Models’ Evaluation

Here is a list of top 12 Trending LLM Leaderboards: A Guide to Leading AI Models’ Evaluation

Open LLM Leaderboard

With numerous LLMs and chatbots emerging weekly, it’s challenging to discern genuine advancements from hype. The Open LLM Leaderboard addresses this by using the Eleuther AI-Language Model Evaluation Harness to benchmark models across six tasks: AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k. These benchmarks test various reasoning and general knowledge skills. Detailed numerical results and model specifics are available on Hugging Face.

**Image Source** **[Dated: 02 June, 2024]**

MTEB Leaderboard

Text embeddings are often evaluated on a limited set of datasets from a single task, failing to account for their applicability to other tasks like clustering or reranking. This lack of comprehensive evaluation hinders progress tracking in the field. The Massive Text Embedding Benchmark (MTEB) addresses this issue by spanning eight embedding tasks across 58 datasets and 112 languages. Benchmarking 33 models, MTEB offers the most extensive evaluation of text embeddings. The findings reveal that no single text embedding method excels across all tasks, indicating the need for further development toward a universal text embedding method.

Big Code Models Leaderboard

Inspired by the Open LLM Leaderboard, this leaderboard compares multilingual code generation models on the HumanEval and MultiPL-E benchmarks. HumanEval measures functional correctness with 164 Python problems, while MultiPL-E translates these problems into 18 languages. Additionally, throughput is measured on batch sizes of 1 and 50. The evaluation uses original benchmark prompts, specific prompts for base and instruction models, and various evaluation parameters. The average pass@1 score and win rate across languages determine rankings, with memory usage assessed by Optimum-Benchmark.

SEAL Leaderboards

The SEAL Leaderboards utilize Elo-scale rankings to compare model performance across datasets. Human evaluators rate model responses to prompts, with ratings determining which model wins, loses, or ties. The Bradley-Terry model is used for the maximum likelihood estimation of BT coefficients, and the binary cross-entropy loss is minimized. Rankings are based on average scores and win rates across multiple languages, with bootstrapping applied to estimate confidence intervals. This methodology ensures comprehensive and reliable model performance evaluation. Key models are queried from various APIs, providing up-to-date and relevant comparisons.

Berkeley Function-Calling Leaderboard

The Berkeley Function-Calling Leaderboard (BFCL) evaluates LLMs on their ability to call functions and tools, a critical capability for powering applications like Langchain and AutoGPT. BFCL features a diverse dataset, including 2,000 question-function-answer pairs across multiple languages and scenarios, from simple to complex, parallel function calls. It measures models’ performance in function relevance detection, execution, and accuracy, with detailed metrics on cost and latency. Current leaders include GPT-4, OpenFunctions-v2, and Mistral-medium. The leaderboard provides insights into models’ strengths and common errors, guiding improvements in function-calling capabilities.

Occiglot Euro LLM Leaderboard

This is a copy of the Open LLM Leaderboard from Hugging Face with the extension of the translated benchmarks. With numerous LLMs and chatbots emerging weekly, filtering genuine progress from the hype is challenging. The Occiglot Euro LLM Leaderboard evaluates models using a fork of the Eleuther AI-Language Model Evaluation Harness on five benchmarks: AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, and Belebele. These benchmarks test models’ performance across diverse tasks and languages. Detailed results and model specifics are available on Hugging Face, with flagged models requiring caution.

LMSYS Chatbot Arena Leaderboard

LMSYS Chatbot Arena is a crowdsourced open platform for evaluating LLMs. With over 1,000,000 human pairwise comparisons, models are ranked using the Bradley-Terry model and displayed in Elo-scale. The leaderboard includes 102 models and 1,149,962 votes as of May 27, 2024. New leaderboard categories like coding and long user queries are available for preview. Users can contribute their votes at chat.lmsys.org. Model rankings account for statistical confidence intervals, with detailed methodologies in their paper.

Artificial Analysis LLM Performance Leaderboard

Artificial Analysis benchmarks LLMs on serverless API endpoints, measuring quality and performance from a customer perspective. Serverless endpoints are priced per token, with different rates for input and output tokens. Performance benchmarking includes Time to First Token (TTFT), throughput (tokens per second), and total response time for 100 output tokens. Quality is assessed using a weighted average of normalized scores from MMLU, MT-Bench, and Chatbot Arena Elo Score. Tests are conducted daily on various prompt lengths and load scenarios. Results reflect real-world customer experiences across proprietary and open weights models.

Open Medical LLM Leaderboard

The Open Medical LLM Leaderboard tracks, ranks, and evaluates LLMs on medical question-answering tasks. It assesses models using diverse medical datasets, including MedQA (USMLE), PubMedQA, MedMCQA, and MMLU subsets related to medicine and biology. These datasets cover medical aspects like clinical knowledge, anatomy, and genetics, featuring multiple-choice and open-ended questions requiring medical reasoning.

The primary evaluation metric is Accuracy (ACC). Models can be submitted for automated evaluation via the “Submit” page. The leaderboard uses the Eleuther AI-Language Model Evaluation Harness. GPT-4 and Med-PaLM-2 results are from their official papers, with Med-PaLM-2 using 5-shot accuracy for comparison. Gemini’s results are from a recent Clinical-NLP (NAACL 24) paper. More details on datasets and technical information are available on the leaderboard’s “About” page and discussion forum.

Hughes Hallucination Evaluation Model (HHEM) Leaderboard

The Hughes Hallucination Evaluation Model (HHEM) Leaderboard evaluates the frequency of hallucinations in document summaries generated by LLMs. Hallucinations are instances where a model introduces factually incorrect or unrelated content in its summaries. Using Vectara’s HHEM, the leaderboard assigns a hallucination score from 0 to 1, based on 1006 documents from datasets like CNN/Daily Mail Corpus. Metrics include Hallucination Rate (percentage of summaries scoring below 0.5), Factual Consistency Rate, Answer Rate (non-empty summaries), and Average Summary Length. Models not hosted on Hugging Face, such as GPT variants, are evaluated and uploaded by the HHEM team.

OpenVLM Leaderboard

This platform presents evaluation results of 63 Vision-Language Models (VLMs) using the OpenSource Framework VLMEvalKit. Covering 23 multi-modal benchmarks, the leaderboard includes models like GPT-4v, Gemini, QwenVLPlus, and LLaVA, updated as of May 27, 2024.

Metrics:

Avg Score: The average score across all VLM Benchmarks (normalized to 0-100; higher is better).
Avg Rank: The average rank across all VLM Benchmarks (lower is better).

The main results are based on eight benchmarks: MMBench_V11, MMStar, MMMU_VAL, MathVista, OCRBench, AI2D, HallusionBench, and MMVet. Subsequent tabs provide detailed evaluation results for each dataset.

LLM-Perf Leaderboard

The LLM-Perf Leaderboard benchmarks LLMs in latency, throughput, memory, and energy consumption across various hardware, backends, and optimizations using Optimum-Benchmark. Community members can request evaluations for new base models through the Open LLM Leaderboard and hardware/backend/optimization configurations via the LLM-Perf Leaderboard or Optimum-Benchmark repository.

Evaluations use a single GPU to ensure consistency, with LLMs running on a singleton batch with a 256-token prompt, generating 64 tokens over at least ten iterations and 10 seconds. Energy consumption is measured in kWh using CodeCarbon, and memory metrics include Max Allocated Memory, Max Reserved Memory, and Max Used Memory. All benchmarks are performed using the benchmark_cuda_pytorch.py script to guarantee reproducibility.

The post Top 12 Trending LLM Leaderboards: A Guide to Leading AI Models’ Evaluation appeared first on MarkTechPost.

#AIShorts #Applications #ArtificialIntelligence #EditorsPick #LanguageModel #LargeLanguageModel #MultimodalAI #Staff #TechNews #Technology
[Source: AI Techpark]