GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.
#计算机科学#High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp...
#计算机科学#Mabor is a cutting-edge deep learning framework built for flexibility, efficiency, and portability—without compromise.
Parallel and Distributed Systems - Exercise 3