The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
#大语言模型#Evaluate your LLM's response with Prometheus and GPT4 💯
#大语言模型#[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
#大语言模型#xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
CodeUltraFeedback: aligning large language models to coding preferences
#大语言模型#This is the repo for the survey of Bias and Fairness in IR with LLMs.
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
#自然语言处理#Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Harnessing Large Language Models for Curated Code Reviews
#大语言模型#Root Signals Python SDK
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
A set of tools to create synthetically-generated data from documents
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
LLM-as-judge evals as Semantic Kernel Plugins
MCP for Root Signals Evaluation Platform
Controversial Questions for Argumentation and Retrieval