llm-evaluation-metrics · GitHub Topics

DeepEval 是大语言模型评估框架，专为评估和测试大语言模型系统而设计。它类似于 Pytest，但专注于对 LLM 输出进行单元测试。

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 9.66 k

1 天前

The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods. All features: benchmarks, methods, evaluations, models etc. are easily extens...

privacy-protection benchmarks llm-evaluation-metrics 大语言模型 Open Source

Python 334

12 天前

cvs-health / langfair

#大语言模型#LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

人工智能 bias bias-detection fairness fairness-ai fairness-ml fairness-testing large-language-models 大语言模型 responsible-ai Python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 222

3 天前

zhuohaoyu / KIEval

#大语言模型#[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

explainable-ai 大语言模型 llm-evaluation llm-evaluation-framework llm-evaluation-metrics 机器学习

Python 37

1 年前

attogram / ollama-multirun

Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

人工智能 ollama llm-evaluation ollama-interface Bash ollama-app llm-evaluation-metrics llm-eval static-site-generator

Shell 8

5 天前

pyladiesams / eval-llm-based-apps-jan2025

#大语言模型#Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

大语言模型 llmops workshop llm-eval llm-evaluation-framework llm-evaluation-metrics llm-monitoring

Jupyter Notebook 7

3 个月前

ronniross / llm-confidence-scorer

#数据仓库#Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.

大语言模型 llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-training dataset 数据集

Python 3

12 天前

ritwickbhargav80 / quick-llm-model-evaluations

This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.

llm-evaluation-metrics 大语言模型 retrieval-augmented-generation Streamlit

Python 0

1 年前