A framework for few-shot evaluation of language models.
DeepEval 是大语言模型评估框架,专为评估和测试大语言模型系统而设计。它类似于 Pytest,但专注于对 LLM 输出进行单元测试。
#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
#计算机科学#This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
Data-Driven Evaluation for LLM-Powered Applications
#大语言模型#AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
#大语言模型#Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Python SDK for running evaluations on LLM generated responses
#大语言模型#Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
The official evaluation suite and dynamic data release for MixEval.
A research library for automating experiments on Deep Graph Networks
#计算机科学#AI Data Management & Evaluation Platform
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.
#大语言模型#A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
#大语言模型#Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
#自然语言处理#Multilingual Large Language Models Evaluation Benchmark
Evaluation suite for large-scale language models.