#

llm-as-a-judge

Agenta-AI/agenta
https://static.github-zh.com/github_avatars/Agenta-AI?size=40

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Python 3.16 k
3 天前
https://static.github-zh.com/github_avatars/prometheus-eval?size=40
Python 986
5 个月前
https://static.github-zh.com/github_avatars/metauto-ai?size=40

⚖️ The First Coding Agent-as-a-Judge

Python 627
4 个月前
https://static.github-zh.com/github_avatars/haizelabs?size=40
Jupyter Notebook 295
15 天前
https://static.github-zh.com/github_avatars/IAAR-Shanghai?size=40
Python 177
7 个月前
https://static.github-zh.com/github_avatars/IAAR-Shanghai?size=40
Python 128
5 个月前
https://static.github-zh.com/github_avatars/martin-wey?size=40

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

Python 72
1 年前
https://static.github-zh.com/github_avatars/MJ-Bench?size=40

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Jupyter Notebook 47
3 个月前
https://static.github-zh.com/github_avatars/lupantech?size=40

Solving Inequality Proofs with Large Language Models.

Python 44
20 天前
https://static.github-zh.com/github_avatars/whitecircle-ai?size=40

#大语言模型#First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

Python 41
2 个月前
https://static.github-zh.com/github_avatars/docling-project?size=40

A set of tools to create synthetically-generated data from documents

Python 27
1 个月前
https://static.github-zh.com/github_avatars/zhaochen0110?size=40

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

Python 24
1 年前
https://static.github-zh.com/github_avatars/minnesotanlp?size=40

#自然语言处理#Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Jupyter Notebook 21
2 年前
https://static.github-zh.com/github_avatars/PKU-ONELab?size=40

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

Python 20
7 个月前
https://static.github-zh.com/github_avatars/OussamaSghaier?size=40

Harnessing Large Language Models for Curated Code Reviews

Python 15
6 个月前
https://static.github-zh.com/github_avatars/aws-samples?size=40

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

Jupyter Notebook 9
1 年前
https://static.github-zh.com/github_avatars/PKU-ONELab?size=40

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

Python 8
7 个月前
https://static.github-zh.com/github_avatars/HillPhelmuth?size=40
C# 8
25 天前
loading...
Website
Wikipedia