llm-evaluation-framework · GitHub Topics

DeepEval 是大语言模型评估框架，专为评估和测试大语言模型系统而设计。它类似于 Pytest，但专注于对 LLM 输出进行单元测试。

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 9.66 k

14 小时前

#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 7.8 k

15 小时前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python 1.57 k

3 天前

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation evaluation-framework foundation-models 大语言模型 large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference

Python 242

9 个月前

cvs-health / langfair

#大语言模型#LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

人工智能 bias bias-detection fairness fairness-ai fairness-ml fairness-testing large-language-models 大语言模型 responsible-ai Python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 222

2 天前

Addepto / contextcheck

#大语言模型# MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

大语言模型 llm-evaluation rag Testing chatbot-framework Open Source ai-chat ai-testing-tool large-language-models 持续集成 llm-evaluation-framework

Python 80

8 个月前

parea-ai / parea-sdk-py

#大语言模型#Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-tools llmops llm-eval llm-evaluation-framework prompt-engineering generative-ai good-first-issue 监控

Python 78

6 个月前

zli12321 / qa_metrics

#大语言模型#An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model promp...

大语言模型 llm-evaluation llm-evaluation-framework

Python 52

13 天前

multinear / multinear

#大语言模型#Develop reliable AI apps

evaluation 大语言模型 reliability llm-eval llm-evaluation llm-evaluation-framework

Python 41

1 个月前

flexpa / llm-fhir-eval

#大语言模型#Benchmarking Large Language Models for FHIR

evals fhir 大语言模型 llm-evaluation-framework

TypeScript 38

10 天前

zhuohaoyu / KIEval

#大语言模型#[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

explainable-ai 大语言模型 llm-evaluation llm-evaluation-framework llm-evaluation-metrics 机器学习

Python 37

1 年前

aws-samples / fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework

Python 18

9 个月前

honeyhiveai / realign

Realign is a testing and simulation framework for AI applications.

人工智能 alignment evaluation 大语言模型 prompt-engineering red-teaming Simulation llm-eval llm-evaluation llm-evaluation-framework llmops rag

Python 16

8 个月前

Networks-Learning / prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

llm-eval llm-evaluation llm-evaluation-framework ranking-algorithm

Jupyter Notebook 9

9 个月前

pyladiesams / eval-llm-based-apps-jan2025

#大语言模型#Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

大语言模型 llmops workshop llm-eval llm-evaluation-framework llm-evaluation-metrics llm-monitoring

Jupyter Notebook 7

3 个月前

yukinagae / genkitx-promptfoo

#大语言模型#Community Plugin for Genkit to use Promptfoo

人工智能 evaluation evaluation-framework Firebase genkit 大语言模型 llm-eval llm-evaluation llm-evaluation-framework llmops 插件 prompt prompt-testing prompts Testing

TypeScript 4

7 个月前

parea-ai / parea-sdk-ts

#大语言模型#TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-evaluation-framework llm-tools llm-eval prompt-engineering

TypeScript 4

6 个月前

stair-lab / melt

Multilingual Evaluation Toolkits

llm-evaluation-framework multilingual

Python 4

9 个月前

ronniross / llm-confidence-scorer

#数据仓库#Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.

大语言模型 llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-training dataset 数据集

Python 3

11 天前

yuzu-ai / ShinRakuda

#大语言模型#Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across div...

大语言模型 llm-eval llm-evaluation llm-evaluation-framework japanese

Python 3

10 个月前