GitHub 中文社区
回车: Github搜索    Shift+回车: Google搜索
论坛
排行榜
趋势
登录

©2025 GitHub中文社区论坛GitHub官网网站地图GitHub官方翻译

  • X iconGitHub on X
  • Facebook iconGitHub on Facebook
  • Linkedin iconGitHub on LinkedIn
  • YouTube iconGitHub on YouTube
  • Twitch iconGitHub on Twitch
  • TikTok iconGitHub on TikTok
  • GitHub markGitHub’s organization on GitHub
集合主题趋势排行榜
#

llm-as-a-judge

Website
Wikipedia
Agenta-AI/agenta
https://static.github-zh.com/github_avatars/Agenta-AI?size=40
Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-toolsprompt-engineeringprompt-managementllm-evaluationllm-frameworkrag-evaluationllm-observabilityllm-as-a-judgellm-monitoringllm-platformllm-playgroundllmops-platform
Python 2.83 k
2 天前
https://static.github-zh.com/github_avatars/prometheus-eval?size=40
prometheus-eval / prometheus-eval

#大语言模型#Evaluate your LLM's response with Prometheus and GPT4 💯

evaluation大语言模型llmopsPythonvllmgpt4llm-as-a-judge
Python 954
2 个月前
https://static.github-zh.com/github_avatars/metauto-ai?size=40
metauto-ai / agent-as-a-judge

⚖️ The First Coding Agent-as-a-Judge

llm-as-a-judge大语言模型
Python 550
1 个月前
https://static.github-zh.com/github_avatars/haizelabs?size=40
haizelabs / verdict

#大语言模型#Scale your LLM-as-a-judge.

大语言模型llm-as-a-judge
Jupyter Notebook 239
9 天前
https://static.github-zh.com/github_avatars/IAAR-Shanghai?size=40
IAAR-Shanghai / xFinder

#大语言模型#[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

evaluationgpt大语言模型large-language-modelsRegular expressionreliabilitybenchmarkdatasetchatglmphiqwenllm-as-a-judge
Python 173
4 个月前
https://static.github-zh.com/github_avatars/IAAR-Shanghai?size=40
IAAR-Shanghai / xVerify

#大语言模型#xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

llm-as-a-judgebenchmarkevaluationRegular expressionreliabilityChatGPT大语言模型open-r1
Python 109
2 个月前
https://static.github-zh.com/github_avatars/martin-wey?size=40
martin-wey / CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

alignmentcode-generationdpolarge-language-modelsllm-as-a-judge
Python 71
1 年前
https://static.github-zh.com/github_avatars/KID-22?size=40
KID-22 / LLM-IR-Bias-Fairness-Survey

#大语言模型#This is the repo for the survey of Bias and Fairness in IR with LLMs.

biasfairnessinformation-retrievallarge-language-modelsrecommender-systemsChatGPT大语言模型llm-as-a-judge
53
2 个月前
https://static.github-zh.com/github_avatars/MJ-Bench?size=40
MJ-Bench / MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

llm-as-a-judge
Jupyter Notebook 44
13 天前
https://static.github-zh.com/github_avatars/whitecircle-ai?size=40
whitecircle-ai / circle-guard-bench

#大语言模型#First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

人工智能benchmark大语言模型large-language-modelsllm-evalllm-evaluationguardrailsbenchmarkingguardrailjailbreakllm-as-a-judgellm-security
Python 38
10 天前
https://static.github-zh.com/github_avatars/zhaochen0110?size=40
zhaochen0110 / Timo

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

llm-as-a-judge大语言模型rlhf
Python 21
8 个月前
https://static.github-zh.com/github_avatars/lupantech?size=40
lupantech / ineqmath

Solving Inequality Proofs with Large Language Models.

llm-as-a-judge大语言模型theorem-proving
Python 20
3 天前
https://static.github-zh.com/github_avatars/PKU-ONELab?size=40
PKU-ONELab / Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluationllm-as-a-judgenlg
Python 20
4 个月前
https://static.github-zh.com/github_avatars/minnesotanlp?size=40
minnesotanlp / cobbler

#自然语言处理#Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

biasevaluation大语言模型自然语言处理bias-detectionllm-as-a-judgellm-evaluation
Jupyter Notebook 20
1 年前
https://static.github-zh.com/github_avatars/docling-project?size=40
docling-project / docling-sdg

A set of tools to create synthetically-generated data from documents

人工智能documentsllm-as-a-judgequestion-answeringsdg
Python 15
6 天前
https://static.github-zh.com/github_avatars/OussamaSghaier?size=40
OussamaSghaier / CuREV

Harnessing Large Language Models for Curated Code Reviews

代码审查large-language-modelsllm-as-a-judge
Python 13
3 个月前
https://static.github-zh.com/github_avatars/root-signals?size=40
root-signals / rs-python-sdk

#大语言模型#Root Signals Python SDK

evaluation大语言模型llm-as-a-judgeobservabilityevals
Python 12
4 天前
https://static.github-zh.com/github_avatars/UMass-Meta-LLM-Eval?size=40
UMass-Meta-LLM-Eval / llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

large-language-modelsllm-as-a-judge自然语言处理
Python 8
8 个月前
https://static.github-zh.com/github_avatars/aws-samples?size=40
aws-samples / genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

genaigenerative-aiinformation-retrievalllm-as-a-judgellm-evaluation
Jupyter Notebook 8
9 个月前
https://static.github-zh.com/github_avatars/PKU-ONELab?size=40
PKU-ONELab / LLM-evaluator-reliability

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

evaluationllm-as-a-judgenlg
Python 7
4 个月前
loading...