该仓库已收录但尚未编辑。项目介绍及使用教程请前往 GitHub 阅读 README

0 条讨论

登录后发表评论

关于

FlashInfer: Kernel Library for LLM Serving

flashinfer.ai

gpu large-large-models CUDA PyTorch llm-inference jit attention Nvidia distributed-inference moe

创建时间

2023-07-22

是否国产

否

Readme

语言

Cuda36.8%
Python33.4%
C++28.8%
Jinja0.6%
Shell0.4%
C0.1%
其他0.01%

您可能感兴趣的

grok-1

@xai-org

大模型Grok-1开源

Python50.49 k

1 年前

Open-Sora-Plan

@PKU-YuanGroup

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

Python12.02 k

2 个月前

Open-Sora

@hpcaitech

Open-Sora：完全开源的高效复现类Sora视频生成方案

Python27.16 k

4 个月前

transformer-debugger

OpenAI@openai

Python4.09 k

1 年前

sglang

@sgl-project

#大语言模型#SGLang is a fast serving framework for large language models and vision language models.

CUDA inference llama llava 大语言模型

Python17.79 k

1 小时前

vllm

@vllm-project

#大语言模型#vLLM 是一个高效的开源库，用于加速大语言模型推理，通过优化内存管理和分布式处理实现高吞吐量和低延迟。

gpt 大语言模型 PyTorch llmops mlops

Python57.66 k

1 小时前

dbrx

Databricks@databricks

#大语言模型#Code examples and resources for DBRX, a large language model developed by Databricks

databricks gen-ai generative-ai 大语言模型 llm-inference

Python2.57 k

1 年前

Awesome-LLM-Inference

@xlite-dev

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

flash-attention tensorrt-llm vllm

Python4.48 k

22 天前

devika

@stitionai

Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. D...

Python19.4 k

1 年前

LightLLM

@ModelTC

#自然语言处理#LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

深度学习 gpt llama 大语言模型 model-serving

Python3.59 k

17 小时前

GaLore

@jiaweizzhao

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Python1.6 k

10 个月前

fsdp_qlora

@AnswerDotAI

Training LLMs with QLoRA + FSDP

Jupyter Notebook1.53 k

10 个月前

distrifuser

MIT HAN Lab@mit-han-lab

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

acceleration diffusion-models generative-ai generative-model parallelism

Python704

9 个月前

flash-attention

@Dao-AILab

Fast and memory-efficient exact attention

Python19.42 k

5 天前

flash-attention-minimal

@tspeterkim

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda919

8 个月前

grok

OpenAI@openai

Python4.18 k1

1 年前

TensorRT-LLM

NVIDIA Corporation@NVIDIA

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-L...

blackwell CUDA moe PyTorch llm-serving

C++11.55 k1

2 小时前