flash-attention · GitHub Topics

QwenLM / Qwen

#自然语言处理#通义千问-7B（Qwen-7B）是阿里云研发的通义千问大模型系列的70亿参数规模的模型

中文 large-language-models 自然语言处理 flash-attention 大语言模型 pretrained-models

Python 18.89 k

8 天前

ymcui / Chinese-LLaMA-Alpaca-2

#自然语言处理#中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

alpaca llama 大语言模型 llama-2 large-language-models 自然语言处理 alpaca-2 flash-attention llama2 alpaca2 Yarn rlhf

Python 7.17 k

16 天前

InternLM / InternLM

#大语言模型#Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

聊天机器人 gpt 大语言模型 long-context rlhf fine-tuning-llm 中文 flash-attention pretrained-models

Python 7.01 k

7 天前

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

CUDA cuda-kernels flash-attention cuda-library cuda-cpp

Cuda 5.79 k

1 天前

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

flash-attention tensorrt-llm vllm llm-inference deepseek deepseek-v3 deepseek-r1 qwen3

Python 4.32 k

1 天前

MoonshotAI / MoBA

#大语言模型#MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention 大语言模型 llm-serving llm-training moe PyTorch transformer

Python 1.85 k

4 个月前

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

gemma internlm internlm2 llama3 llava llm-framework llm-training multi-modal pipeline-parallelism flash-attention PyTorch

Python 400

9 天前

DAMO-NLP-SG / Inf-CLIP

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training schem...

contrastive-learning flash-attention memory-efficient clip

Python 262

6 个月前

xlite-dev / ffpa-attn

⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉

attention CUDA flash-attention mlsys deepseek deepseek-r1 deepseek-v3

Cuda 193

3 个月前

alexzhang13 / flashattention2-custom-mask

#计算机科学#Triton implementation of FlashAttention2 that adds Custom Masks.

attention attention-mechanism cuda-kernels 深度学习 flash-attention triton

Python 127

1 年前

CoinCheung / gdGPT

#自然语言处理#Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

deepspeed 大语言模型 pipeline 自然语言处理 PyTorch bloom flash-attention baichuan2-7b mixtral-8x7b llama2

Python 97

1 年前

Bruce-Lee-LY / decoding_attention

#大语言模型#Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

CUDA gpu inference 大语言模型 mha Nvidia flash-attention

C++ 40

2 个月前

Bruce-Lee-LY / flash_attention_inference

#大语言模型#Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

CUDA flash-attention gpu inference 大语言模型 Nvidia cutlass mha

C++ 39

5 个月前

kklemon / FlashPerceiver

#自然语言处理#Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

attention-mechanism 深度学习 flash-attention 自然语言处理 transformer

Python 26

9 个月前

RulinShao / FastCkpt

Python package for rematerialization-aware gradient checkpointing

flash-attention

Python 25

2 年前

erfanzar / jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

flash-attention jax

Python 24

5 个月前

Naman-ntc / FastCode

Utilities for efficient fine-tuning, inference and evaluation of code generation models

code-generation efficient finetuning inference transformers flash-attention

Python 21

2 年前

kyegomez / FlashMHA

An simple pytorch implementation of Flash MultiHead Attention

人工智能 artificial-neural-networks attention attention-mechanisms gpt4 transformer flash-attention

Jupyter Notebook 20

1 年前

pxl-th / NNop.jl

Flash Attention & friends in pure Julia

gpgpu gpu Julia 语言 amdgpu CUDA flash-attention

Julia 11

2 个月前

AI-DarwinLabs / amd-mi300-ml-stack

#计算机科学#🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

conda 深度学习 deepspeed flash-attention gpu-computing hpc 机器学习 slurm rocm

Shell 11

8 个月前