Emre's Blog

02 Jul, 2026 Writing A Megakernel For LLM Decode - A Worklog
19 May, 2026 Unleashing Blackwell's 4-bit: a surgical look at MXFP4 and NVFP4
13 May, 2026 From 429 GB/s to the DRAM wall: writing an FP8 quantizer on an RTX 5080
07 May, 2026 8.5x Faster Speech-to-Text: From 429ms to 50ms on a Single GPU
18 Oct, 2025 W8A16 Quantization with LLM.int8-Style Outlier Handling
06 Sep, 2025 How Critical Are Outliers in Transformer Models? A Live Experiment - Phase 1
24 Mar, 2025 Transformer Architecture: Building Blocks Explained