llm-eval · GitHub Topics

#大语言模型#Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with comma...

大语言模型 prompt-engineering prompts llmops prompt-testing Testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework 持续集成 CI/CD pentesting red-teaming vulnerability-scanners

TypeScript 8.37 k

11 小时前

Arize-ai / phoenix

#数据仓库#AI Observability & Evaluation

llmops ai-monitoring ai-observability llm-eval 数据集 agents 大语言模型 prompt-engineering anthropic evals llm-evaluation openai langchain llamaindex smolagents

Jupyter Notebook 6.95 k

16 小时前

Giskard-AI / giskard-oss

#大语言模型#🐢 Open-Source Evaluation & Testing library for LLM Agents

mlops ml-validation ml-testing llmops responsible-ai fairness-ai llm-eval llm-evaluation rag-evaluation ai-security llm-security ai-red-team red-team-tools 大语言模型

Python 4.86 k

3 天前

truera / trulens

#计算机科学#Evaluation and Tracking for LLM Experiments and AI Agents

机器学习 neural-networks explainable-ml llmops ai-monitoring ai-observability evals llm-evaluation 大语言模型 ai-agents llm-eval agentops

Python 2.77 k

2 天前

iterative / datachain

#大语言模型#ETL, Analytics, Versioning for Unstructured Data

人工智能 cv data-wrangling 大语言模型 llm-eval multimodal data-analytics embeddings mlops 机器学习

Python 2.65 k

8 小时前

uptrain-ai / uptrain

#计算机科学#UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform ro...

机器学习 experimentation llm-prompting llmops 监控 prompt-engineering evaluation llm-eval

Python 2.32 k

1 年前

AI-QL / tuui

#大语言模型#A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context Protocol (MCP) and enabling cross-vendor LLM API orchestration.

agent agentic-ai 人工智能 deepseek 大语言模型 mcp openai-api qwen mcp-client mcp-host model-context-protocol ai-playground llm-eval prompt Testing anthropic claude

TypeScript 1.07 k

1 天前

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-ops llmops

Python 292

3 个月前

Re-Align / just-eval

#大语言模型#A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 大语言模型 llm-eval llm-evaluation

Python 87

2 年前

parea-ai / parea-sdk-py

#大语言模型#Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

大语言模型 llm-evaluation llm-tools llmops llm-eval llm-evaluation-framework prompt-engineering generative-ai good-first-issue 监控

Python 78

7 个月前

kuk / rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

llm-eval

Jupyter Notebook 60

2 年前

grigio / llm-eval-simple

#大语言模型#llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection

大语言模型 llm-eval

Python 46

6 天前

multinear / multinear

#大语言模型#Develop reliable AI apps

evaluation 大语言模型 reliability llm-eval llm-evaluation llm-evaluation-framework

Python 44

12 天前

whitecircle-ai / circle-guard-bench

#大语言模型#First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

人工智能 benchmark 大语言模型 large-language-models llm-eval llm-evaluation guardrails benchmarking guardrail jailbreak llm-as-a-judge llm-security

Python 41

2 个月前

Auto-Playground / ragrank

#大语言模型#🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

evaluation language-model 大语言模型 llm-eval llmops 机器学习 prompt-engineering rag

Python 40

3 个月前

alan-turing-institute / prompto

#自然语言处理#An open source library for asynchronous querying of LLM endpoints

hut23 large-language-models llm-eval llm-evaluation 大语言模型 transformers 深度学习机器学习自然语言处理 Python transformer

Python 32

2 个月前

genia-dev / vibraniumdome

#大语言模型#LLM Security Platform.

adversarial-attacks ChatGPT 大语言模型 openai prompt-injection 安全 llm-agent llm-security llmops prompt-engineering prompts llm-framework llm-inference llm-serving llm-evaluation llm-eval

Python 22

1 年前

Supahands / llm-comparison-backend

#大语言模型#This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated ...

人工智能 ChatGPT 大语言模型 llm-eval

Python 21

2 个月前

honeyhiveai / realign

Realign is a testing and simulation framework for AI applications.

人工智能 alignment evaluation 大语言模型 prompt-engineering red-teaming Simulation llm-eval llm-evaluation llm-evaluation-framework llmops rag

Python 17

9 个月前

attogram / ollama-multirun

Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

人工智能 ollama llm-evaluation ollama-interface Bash ollama-app llm-evaluation-metrics llm-eval static-site-generator

Shell 10

15 天前