[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
2023-06-12
否
2024-08-13T09:45:30Z
#自然语言处理#[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Open-Sora: 完全开源的高效复现类Sora视频生成方案
#大语言模型#LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
#自然语言处理#[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Training LLMs with QLoRA + FSDP
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
#大语言模型#Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
An easy-to-use package for implementing SmoothQuant for LLMs
#大语言模型#Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
0 条讨论