- Writing A Megakernel For LLM Decode - A Worklog
- Unleashing Blackwell's 4-bit: a surgical look at MXFP4 and NVFP4
- From 429 GB/s to the DRAM wall: writing an FP8 quantizer on an RTX 5080
- 8.5x Faster Speech-to-Text: From 429ms to 50ms on a Single GPU
- W8A16 Quantization with LLM.int8-Style Outlier Handling
- How Critical Are Outliers in Transformer Models? A Live Experiment - Phase 1
- Transformer Architecture: Building Blocks Explained