#

llm-evaluation

mlflow/mlflow
https://static.github-zh.com/github_avatars/mlflow?size=40

#计算机科学#MLflow 是一个开源框架,旨在管理整个机器学习生命周期。 它可以在不同的平台上训练模型并为模型提供服务,让你能够使用相同的一组工具,而不管试验是在计算机本地、远程计算目标上、虚拟机上

Python 22.05 k
38 分钟前
langfuse/langfuse
https://static.github-zh.com/github_avatars/langfuse?size=40

#大语言模型#🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

TypeScript 16.17 k
44 分钟前
https://static.github-zh.com/github_avatars/comet-ml?size=40

#大语言模型#Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Python 13.92 k
1 小时前
https://static.github-zh.com/github_avatars/confident-ai?size=40

DeepEval 是大语言模型评估框架,专为评估和测试大语言模型系统而设计。它类似于 Pytest,但专注于对 LLM 输出进行单元测试。

Python 10.8 k
2 小时前
https://static.github-zh.com/github_avatars/promptfoo?size=40

#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...

TypeScript 8.39 k
2 小时前
https://static.github-zh.com/github_avatars/jeinlee1991?size=40

ReLE中文大模型能力评测(持续更新):目前已囊括291个大模型,覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.1、qwen...

4.86 k
17 小时前
https://static.github-zh.com/github_avatars/PacktPublishing?size=40

#大语言模型#The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

Python 4.11 k
6 个月前
Agenta-AI/agenta
https://static.github-zh.com/github_avatars/Agenta-AI?size=40

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Python 3.16 k
3 天前
https://static.github-zh.com/github_avatars/lmnr-ai?size=40

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

TypeScript 2.29 k
14 小时前
msoedov/agentic_security
https://static.github-zh.com/github_avatars/msoedov?size=40
Python 1.67 k
17 小时前
https://static.github-zh.com/github_avatars/huggingface?size=40

Build, enrich, and transform datasets using AI models with no code

TypeScript 1.43 k
31 分钟前
https://static.github-zh.com/github_avatars/microsoft?size=40

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandab...

Python 1.03 k
3 天前
https://static.github-zh.com/github_avatars/cvs-health?size=40

#大语言模型#UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Python 1.02 k
4 天前
JudgmentLabs/judgeval
https://static.github-zh.com/github_avatars/JudgmentLabs?size=40
Python 1.02 k
5 小时前
loading...
Website
Wikipedia