llm-evaluation · GitHub Topics

#计算机科学#MLflow 是一个开源框架，旨在管理整个机器学习生命周期。它可以在不同的平台上训练模型并为模型提供服务，让你能够使用相同的一组工具，而不管试验是在计算机本地、远程计算目标上、虚拟机上

机器学习人工智能 mlflow Apache Spark model-management agentops agents evaluation langchain llm-evaluation llmops observability Open Source openai prompt-engineering mlops

Python 21.4 k

4 小时前

langfuse / langfuse

#大语言模型#🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

analytics 大语言模型 llmops large-language-models openai 自托管 ycombinator 监控 observability Open Source langchain llama-index evaluation prompt-engineering prompt-management playground llm-evaluation llm-observability autogen

TypeScript 14.47 k

1 小时前

comet-ml / opik

#大语言模型#Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Open Source langchain openai playground prompt-engineering llama-index 大语言模型 llm-evaluation llm-observability llmops

Python 11.93 k

11 小时前

confident-ai / deepeval

DeepEval 是大语言模型评估框架，专为评估和测试大语言模型系统而设计。它类似于 Pytest，但专注于对 LLM 输出进行单元测试。

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python 9.66 k

5 小时前

promptfoo / promptfoo

#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 7.8 k

7 小时前

Arize-ai / phoenix

#数据仓库#AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval 数据集 agents 大语言模型 prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex smolagents

Jupyter Notebook 6.49 k

5 小时前

NVIDIA / garak

the LLM vulnerability scanner

人工智能 llm-evaluation llm-security security-scanners vulnerability-assessment

Python 4.87 k

6 天前

Giskard-AI / giskard

#大语言模型#🐢 Open-Source Evaluation & Testing for AI & LLM systems

mlops ml-validation ml-testing llmops responsible-ai fairness-ai llm-eval llm-evaluation rag-evaluation ai-security llm-security ai-red-team red-team-tools 大语言模型

Python 4.74 k

23 天前

Helicone / helicone

#大语言模型#🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

large-language-models prompt-engineering agent-monitoring analytics evaluation gpt langchain llama-index 大语言模型 llm-cost llm-evaluation llm-observability llmops 监控 Open Source openai playground prompt-management ycombinator

TypeScript 4.25 k

10 小时前

Marker-Inc-Korea / AutoRAG

#大语言模型#AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation 大语言模型 llm-evaluation llm-ops Open Source ops optimization pipeline Python qa rag rag-evaluation retrieval-augmented-generation

Python 4.15 k

1 个月前

PacktPublishing / LLM-Engineers-Handbook

#大语言模型#The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

genai 大语言模型 llmops mlops rag Amazon Web Services fine-tuning-llm llm-evaluation ml-system-design

Python 3.77 k

5 个月前

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

llm-tools prompt-engineering prompt-management llm-evaluation llm-framework rag-evaluation llm-observability llm-as-a-judge llm-monitoring llm-platform llm-playground llmops-platform

Python 3.03 k

1 天前

truera / trulens

#计算机科学#Evaluation and Tracking for LLM Experiments and AI Agents

机器学习 neural-networks explainable-ml llmops ai-monitoring ai-observability evals llm-evaluation 大语言模型 ai-agents llm-eval agentops

Python 2.68 k

2 天前

lmnr-ai / lmnr

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

aiops developer-tools observability agents 人工智能 Rust analytics llm-evaluation llm-observability 监控 Open Source 自托管 ai-observability llmops evals evaluation TypeScript ts

TypeScript 2.2 k

19 小时前

msoedov / agentic_security

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

llm-security ai-red-team llm-evaluation llm-evaluation-framework prompt-testing agent-framework

Python 1.57 k

3 天前

microsoft / prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandab...

generative-ai llm-evaluation 大语言模型 promptengineering

Python 976

4 天前

cvs-health / uqlm

#大语言模型#UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

ai-safety hallucination 大语言模型 llm-evaluation uncertainty-estimation uncertainty-quantification

Python 821

21 小时前

cyberark / FuzzyAI

#大语言模型#A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

jailbreak jailbreaking 大语言模型人工智能安全 Fuzzing/Fuzz testing llm-evaluation llm-security ai-red-team

Jupyter Notebook 664

18 天前

JudgmentLabs / judgeval

#大语言模型#The open source post-building layer for agents. Our traces + evals power agent post-training (RL, SFT), monitoring, and regression testing.

langchain langgraph llama-index 大语言模型 llm-evaluation llm-observability Open Source openai prompt-engineering agent agentic-ai agents

Python 619

1 天前

onejune2018 / Awesome-LLM-Eval

#自然语言处理#Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

benchmark bert chatglm ChatGPT dataset evaluation gpt3 大语言模型 leaderboard 机器学习自然语言处理 openai llama llm-evaluation qwen rag

554

9 个月前