evaluation-framework · GitHub Topics

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

evaluation-framework language-model transformer

Python 9.69 k

1 天前

confident-ai / deepeval

DeepEval 是大语言模型评估框架，专为评估和测试大语言模型系统而设计。它类似于 Pytest，但专注于对 LLM 输出进行单元测试。

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 9.66 k

6 小时前

promptfoo / promptfoo

#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 7.8 k

8 小时前

huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-framework evaluation-metrics huggingface

Python 1.76 k

3 天前

MaurizioFD / RecSys2019_DeepLearning_Evaluation

#计算机科学#This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

recommender-system recommendation-system recommendation-algorithms 深度学习 evaluation-framework 神经网络 collaborative-filtering content-based-recommendation hybrid-recommender-system reproducibility reproducible-research knn matrix-factorization

Python 987

2 年前

relari-ai / continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation

Python 501

6 个月前

ServiceNow / AgentLab

#大语言模型#AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

agents benchmark evaluation-framework 大语言模型 llm-agents prompting agent lab

Python 368

3 天前

TonicAI / tonic_validate

#大语言模型#Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics large-language-models 大语言模型 llmops rag retrieval-augmented-generation evaluation-framework

Python 315

21 天前

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-ops llmops

Python 289

2 个月前

aiverify-foundation / moonshot

#大语言模型#Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework 大语言模型 red-teaming

Python 261

8 天前

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation evaluation-framework foundation-models 大语言模型 large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference

Python 242

9 个月前

diningphil / PyDGN

A research library for automating experiments on Deep Graph Networks

evaluation-framework

Python 223

1 年前

zeno-ml / zeno

#计算机科学#AI Data Management & Evaluation Platform

数据科学机器学习 Python 人工智能 evaluation evaluation-framework

Svelte 214

2 年前

symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation evaluation-framework 大语言模型 software-quality 软件工程

Go 179

3 个月前

lartpang / PySODEvalToolkit

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Python 监控 metrics-visualization saliency saliency-detection salient-object-detection LaTeX evaluation evaluation-metrics evaluation-framework evaluator camouflaged-object-detection

Python 178

10 个月前

bijington / expressive

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

evaluation evaluation-framework Parsing cross-platform expression-evaluator expression-parser netstandard Xamarin Hacktoberfest

C# 172

10 个月前