GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.
#计算机科学#High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp...