Tailor-made LLM evaluations: custom evaluations for your LLM

cohenlinoy 's Collections

updated Jul 22

Collection of articles and resources focusing on automatic evaluation for LLM's and their role as unbiased judges in assessing other LLMs' outputs

Upvote

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Paper • 2306.05685 • Published Jun 9, 2023 • 29
Generative Judge for Evaluating Alignment

Paper • 2310.05470 • Published Oct 9, 2023 • 1
Humans or LLMs as the Judge? A Study on Judgement Biases

Paper • 2402.10669 • Published Feb 16
JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Paper • 2310.17631 • Published Oct 26, 2023 • 32
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Paper • 2310.08491 • Published Oct 12, 2023 • 53
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Paper • 2402.14016 • Published Feb 21
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Paper • 2406.12624 • Published Jun 18 • 36
Calibrating LLM-Based Evaluator

Paper • 2309.13308 • Published Sep 23, 2023 • 11
BAAI/JudgeLM-33B-v1.0

Text Generation • Updated Oct 28, 2023 • 660 • 23
BAAI/JudgeLM-13B-v1.0

Text Generation • Updated Oct 27, 2023 • 30 • 5
BAAI/JudgeLM-7B-v1.0

Text Generation • Updated Oct 27, 2023 • 2.86k • 13
Benchmarking Cognitive Biases in Large Language Models as Evaluators

Paper • 2309.17012 • Published Sep 29, 2023 • 1
Evaluating Large Language Models: A Comprehensive Survey

Paper • 2310.19736 • Published Oct 30, 2023 • 2
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

Paper • 2305.13711 • Published May 23, 2023 • 2
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Paper • 2303.16634 • Published Mar 29, 2023 • 3
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Paper • 2403.02839 • Published Mar 5 • 1
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 116

Upvote