HamzaElshafie

Hamza Elshafie HamzaElshafie

Achievements

gpt-oss-20B gpt-oss-20B Public

A PyTorch implementation of the GPT-OSS-20B architecture. All components are coded from scratch: RoPE with YaRN, RMSNorm, SwiGLU with clamping and residual connection, Mixture-of-Experts (MoE), Sel…

Python 216 15
attn-arena attn-arena Public

PyTorch implementations and benchmarks of hardware efficient attention variants for LLM inference on consumer single and multi GPU setups, with Flash Attention 2 backend.

Python
h100_gemm h100_gemm Public

A series of high-performance GEMM (General Matrix Multiply) implementations Iteratively optimised for H100 GPUs in Pure CUDA.

Cuda 71 10
CUDA_Kernels CUDA_Kernels Public

Random ML CUDA Kernels.

Cuda
vllm vllm Public

Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python
flash-attention-2-triton flash-attention-2-triton Public

Python