llm-as-a-judge · GitHub Topics

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

Python 3.03 k

3 天前

prometheus-eval / prometheus-eval

#大语言模型#Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation litellm 大语言模型 llmops Python vllm gpt4 llm-as-a-judge

Python 978

3 个月前

metauto-ai / agent-as-a-judge

⚖️ The First Coding Agent-as-a-Judge

llm-as-a-judge 大语言模型

Python 589

3 个月前

haizelabs / verdict

#大语言模型#Inference-time scaling for LLMs-as-a-judge.

大语言模型 llm-as-a-judge

Jupyter Notebook 262

18 天前

IAAR-Shanghai / xFinder

#大语言模型#[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

evaluation gpt 大语言模型 large-language-models Regular expression reliability benchmark dataset chatglm phi qwen llm-as-a-judge

Python 175

5 个月前

IAAR-Shanghai / xVerify

#大语言模型#xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

llm-as-a-judge benchmark evaluation Regular expression reliability ChatGPT 大语言模型 open-r1

Python 125

3 个月前

martin-wey / CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

alignment code-generation dpo large-language-models llm-as-a-judge

Python 71

1 年前

KID-22 / LLM-IR-Bias-Fairness-Survey

#大语言模型#This is the repo for the survey of Bias and Fairness in IR with LLMs.

bias fairness information-retrieval large-language-models recommender-systems ChatGPT 大语言模型 llm-as-a-judge

4 个月前

MJ-Bench / MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

llm-as-a-judge

Jupyter Notebook 46

2 个月前

whitecircle-ai / circle-guard-bench

#大语言模型#First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

人工智能 benchmark 大语言模型 large-language-models llm-eval llm-evaluation guardrails benchmarking guardrail jailbreak llm-as-a-judge llm-security

Python 39

7 天前

lupantech / ineqmath

Solving Inequality Proofs with Large Language Models.

llm-as-a-judge 大语言模型 theorem-proving

Python 34

12 天前

zhaochen0110 / Timo

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

llm-as-a-judge 大语言模型 rlhf

Python 23

9 个月前

docling-project / docling-sdg

A set of tools to create synthetically-generated data from documents

人工智能 documents llm-as-a-judge question-answering sdg

Python 23

2 个月前

PKU-ONELab / Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation llm-as-a-judge nlg

Python 20

5 个月前

minnesotanlp / cobbler

#自然语言处理#Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

bias evaluation 大语言模型自然语言处理 bias-detection llm-as-a-judge llm-evaluation

Jupyter Notebook 20

1 年前

OussamaSghaier / CuREV

Harnessing Large Language Models for Curated Code Reviews

代码审查 large-language-models llm-as-a-judge

Python 13

4 个月前

root-signals / rs-sdk

#大语言模型#Root Signals SDK

evaluation 大语言模型 llm-as-a-judge observability evals

Python 12

9 天前

UMass-Meta-LLM-Eval / llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

large-language-models llm-as-a-judge 自然语言处理

Python 8

10 个月前

aws-samples / genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

genai generative-ai information-retrieval llm-as-a-judge llm-evaluation

Jupyter Notebook 8

1 年前

PKU-ONELab / LLM-evaluator-reliability

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

evaluation llm-as-a-judge nlg

Python 7

5 个月前