Emre's Blog

8.5x Faster Speech-to-Text: From 429ms to 50ms on a Single GPU

My STT journey started with Whisper. One of Freya's STT models was a fine-tuned Whisper Large-V3. I optimized it with TensorRT, got it to 130ms, around 94x realtime. Solid, but we hit a wall. The architecture has a heavy encoder and a lightweight decoder, and TRT could only help with the encoder side. We needed something better.

STT is the first link in our voice agent chain: STT → LLM → TTS. When your target is sub-second round-trip for a real-time conversation, every millisecond here sits in front of everything else.

This post is the full breakdown, what changed, what measured, what killed, and what actually moved the needle.

Voice Agent E2E Pipeline


Where We Started

Before Freya-STT, Freya was running a fine-tuned Whisper Large-V3. But Freya-Whisper had an accuracy problem. The fine-tuned variant was hitting 33% WER on Turkish, and the architecture made it hard to push further. The encoder is heavy (637M params), the decoder is light (172M params), and the accuracy ceiling just wasn't moving.

When Alibaba released Qwen3-ASR in January 2026, we tested it immediately. 1.7B parameters, 52 languages, and it's built on a decoder-heavy architecture similar to LLMs. On Freya's Turkish test set, it scored 8.26% WER and 1.93% CER out of the box. That's a different league from the existing setup on accuracy. But it ran around 429ms per transcription, slower than optimized Freya-Whisper.

We benchmarked on 25 in-house Turkish recordings, 5 seconds each — our team simulating call-center scenarios, with background noise mixed in afterwards. Accented speakers, crosstalk. Not clean lab audio.

Baseline Metrics

429ms just for transcription. In a voice agent pipeline, STT is the first thing that runs, LLM and TTS come after. At 429ms, you're already past the point where a conversation feels natural before you've even started generating a response.

The model was accurate but wasn't fast enough. So the question became: can we keep this accuracy and make it fast?


The Framework Swap

The single biggest win was changing the inference engine. HF Transformers is great for prototyping, but we needed something that could handle concurrent customer requests, scale under load, and not fall apart when multiple audio streams hit the server at the same time. A bare model.generate() call in a FastAPI endpoint doesn't cut it when you have real users waiting.

The problem with Transformers for serving is fundamental. Every call to model.generate() holds the GPU exclusively for the full duration of generation. If you have 10 concurrent requests, 9 of them are waiting in a Python queue while 1 runs on the GPU. The GPU is idle between requests, busy during requests, and there's no way to interleave work from different requests within a single forward pass.

We switched to vLLM, built for exactly Freya's use case.

Why vLLM for an ASR model?

Freya-STT's decoder is architecturally identical to a standard LLM decoder. Same attention layers, same KV cache pattern, same autoregressive generation loop. vLLM has audio model support that treats it like any other language model and brings everything that comes with these:

The result is promising, 76ms from 429ms. 5.6x times faster.


FP8 Quantization

Before diving in, I would like to explain what FP8 actually is and why it matters for inference speed.

How floating point numbers work

Every floating point number is stored as three parts: a sign bit, an exponent, and a mantissa. The exponent determines the range (how big or small the number can be), and the mantissa determines the precision (how many decimal places you get).

Floating Point Formats

FP32 gives you 23 bits of mantissa, which is extremely precise but costs 4 bytes per weight. BF16 keeps the same 8-bit exponent (same range as FP32) but drops the mantissa to 7 bits, cutting storage in half. FP8 E4M3 goes further: 4-bit exponent, 3-bit mantissa, just 1 byte per weight.

3 bits of mantissa means you only get 8 distinct values per exponent range. The maximum representable value is 448. That sounds rough, but for neural network weights that cluster around zero, it's enough.

Why fewer bytes means faster inference

At batch size 1 (autoregressive decode), every decode step does a matrix multiplication: one row of activations times the full weight matrix. The GPU loads the entire weight matrix from HBM once per step, but it only does a tiny amount of math with it. This is called being memory-bound: the bottleneck is how fast you can read weights from memory, not how fast you can multiply them.

H100 has 3.35 TB/s of HBM bandwidth. If your model weights are 3.4GB in BF16, that's roughly 1ms just to read them. In FP8, the same weights are 1.7GB, so it takes 0.5ms. Multiply that by ~89 decode steps and the savings add up.

Enabling FP8 in vLLM

Hopper GPUs (H100, SM90) and Ada Lovelace (RTX 4090, SM89) have native FP8 tensor cores, dedicated silicon that runs FP8 matrix multiplications at 2x the throughput of BF16. vLLM supports FP8 weight quantization out of the box, with backends like FlashInfer and Marlin CUTLASS kernels. We auto-detect GPU capability at startup so the same container runs optimally on any hardware, no manual configuration needed.

BF16 vs FP8 Comparison

Result: 76ms → 68ms. A free 10% on top of the framework swap. CER went from 1.93% to 1.81%, it actually improved slightly. FP8 quantization can act as a mild regularizer, this is a known effect in the literature.


Static FP8 Scaling (68ms → 55ms)

This is where I had to look inside vLLM to understand what was actually happening.

After enabling FP8, Nsight Systems showed something unexpected. 13.4% of total GPU time was spent on vLLM's FP8 dynamic quantization scaling kernels. Not the GEMMs themselves, not attention, just the scaling reductions that happen before every GEMM. Over 63,000 of these tiny reduction kernels per transcription, each taking ~3 microseconds. We had to understand why.

When vLLM runs FP8 inference, it doesn't just store the weights in FP8. It also quantizes the activations on the fly before every matrix multiplication.

Before each GEMM, scan the activation tensor to find the absolute maximum value, compute a scale factor (amax / 448.0, where 448 is the max representable value in FP8 E4M3), then quantize the tensor using that scale.

vLLM's default is dynamic per-token scaling. It computes a separate scale for every row (token) in the activation matrix. That's an abs().amax() reduction across the hidden dimension for every single token.

At decode time, M=1, one token, one row. But vLLM still runs the per-token reduction kernel ~63,000 times per transcription. Each reduction is a full pass over 1,024 elements (Freya-STT's hidden dim). Individually cheap, collectively expensive.

Dynamic vs Static FP8 Scaling

We replaced it with per-tensor static scaling: one amax across the entire tensor instead of per-row.

We intercept vLLM's quantization function and force per-tensor mode.

A caveat: per-token scaling exists for a reason. When activation distributions vary significantly across tokens, a single scale for the whole tensor can clip outliers or waste precision on rows with small values. Per-token gives each row its own dynamic range. In theory, it's strictly better for accuracy.

We tested it on our full evaluation set and saw zero CER regression. Freya-STT's activation distributions are well-behaved after training, the per-token scales end up very similar to each other anyway. Not every model will behave this way, so measure your accuracy before and after.

Result: 68ms → 55ms. 19% latency reduction with no accuracy cost.


Fused CUDA Kernels — Bifrost (55ms → ~50ms)

At this point, the easy wins are done. Framework swap, quantization, scaling patch — all config-level or Python-level changes. To go further, I had to look at what the GPU was actually doing. And tbh, this is the most important step for me.

Profiling: 60,000 kernel launches

I profiled the full pipeline using PyTorch Profiler and NVIDIA Nsight Systems. The results were surprising:

Nsight Systems Kernel Breakdown

~60,000 kernel launches per transcription. Average kernel duration: 0.97 microseconds. Kernel launch overhead on H100: 0.5–1.5 microseconds.

The kernels were faster than the overhead of launching them. We were launch-bound, not compute-bound.

The target: residual + RMSNorm + FP8 quantize

The worst offender was a three-operation sequence that runs between every GEMM pair in the decoder. In Qwen3-ASR, that's 28 layers, ~89 decode steps, twice per layer:

  1. Residual add: x = hidden + residual
  2. RMSNorm: normalize with root mean square, reduction across D=1,536
  3. FP8 quantize: find absmax, compute scale, quantize to FP8 E4M3

Three separate kernels. Each one writes its output to HBM, the next one reads it back. Six HBM round-trips for what should be a single pass through the data.

Kernel Fusion: 3 Ops to 1

Writing the fused kernel

I created 'Bifrost', Freya's own kernel library, and wrote a fused Triton kernel that does all three operations in one launch.

The key question for any fused kernel is: do we have enough registers to hold all the intermediate data without spilling to local memory? If the data doesn't fit in registers, the kernel spills to L2/HBM and you lose the whole point of fusion.

For Qwen3-ASR, D=1,536 (hidden dimension). Each element is 2 bytes in BF16. The kernel needs to hold:

Peak register usage is calculated around ~18-24 KB per SM. H100 has 256 KB of register file per SM. We're using less than 10% of the register budget, so there's no spill, no pressure, and the kernel can process one full row per thread block in a single pass without tiling.

One HBM read (hidden + residual + weights), all math in registers, one HBM write (FP8 output + scale). FP8 quantization is gated by a compile-time flag, so on GPUs without FP8 support, that branch is dead-code-eliminated by the Triton compiler.

5.3x in microbenchmarks (0.121ms → 0.023ms at M=1, D=1536). In E2E, the savings are smaller because this kernel is only ~56 of the 60,000 launches. But every microsecond counts when you're hunting for the last 5ms.

End-to-end reduced from ~55ms to ~50ms.

Same approach, different pipeline: CUTLASS Fused Conv + Snake Activation

I applied the same fusion methodology to Freya's TTS vocoder. The Snake activation (x + (1/alpha) * sin(alpha*x)^2) is a non-standard activation function used in neural vocoders. Unlike ReLU or GELU, it's periodic, which helps the model generate smooth waveforms. PyTorch has no native Snake op, so it decomposes into 5 separate elementwise kernels, each bouncing through HBM.

Snake Activation Fusion

I started with a standalone Triton kernel that fuses all 5 ops into one (3x faster than PyTorch, 91% H100 memory bandwidth). Then wrote its CUTLASS variant that pushed it further, 1.4x faster than Triton by using vectorized loads and avoiding Triton's code generation overhead for this particular access pattern.

But the real win came from going one level deeper. The vocoder's inner loop is Snake(Conv1d(x) + bias), a convolution followed by bias-add followed by Snake. We can express the Conv1d as a GEMM (with im2col), and CUTLASS lets you define custom epilogues that run fused with the GEMM output. So I built a custom CUTLASS Epilogue Visitor Tree (EVT) that computes Snake directly on the GEMM accumulator, still in registers, before writing to HBM.

CUTLASS had no support for Snake, so I wrote a custom SnakeOp for the CUTLASS EVT framework, the same way CUTLASS ships Sm90Compute nodes for standard activations like GELU and ReLU. The SnakeOp takes the accumulator and alpha as inputs and computes x + (1/alpha) * sin(alpha*x)^2 inline. Since Snake activation is becoming common in audio/speech models, I rewrote it from scratch on a clean CUTLASS fork and opened a PR upstream. Review pending.

Conv7+Snake CUTLASS Fusion

Vocoder decoder: 2.29ms → 1.66ms (-28%). I added Bifrost runtime autotuning. It benchmarks multiple CUTLASS tile and cluster configurations (if supported) at startup and picks the fastest config per layer shape. Different ResidualUnits in the vocoder have different channel counts, so the optimal tile size varies.


The Scorecard

Optimization Scorecard

93% of the speedup came from the first two changes (framework swap + FP8). The last 15ms was all kernel profiling, Triton/CUTLASS development, and numerical validation.


At Scale

Single-request latency is one thing. Production is another. A system that's fast at concurrency=1 but falls apart at concurrency=8 is useless for real traffic.

We load-tested with real 5-second Turkish audio files, streaming mode, sustained for 30 seconds per concurrency level on a single H100:

Load Test Results

160 RPS on a single GPU. Zero errors, zero rejections across all concurrency levels.

At concurrency 1, each request gets the full GPU. At concurrency 16, vLLM's continuous batching groups all 16 decode steps into a single forward pass, M=16 instead of M=1. The GEMMs become slightly more compute-efficient at higher M (better arithmetic intensity), which partially offsets the increased load. That's why latency only doubles (71ms → 100ms) while throughput goes up 11x (14 → 160 RPS).

VRAM holds steady at ~11GB out of 80GB. The model weights are ~3.4GB in FP8, the KV cache and CUDA graphs take another ~7GB, and the rest is headroom. On our production H200s, this leaves over 120GB free for running other models on the same card.


Accuracy

Every optimization was validated against the same ground-truth transcripts. Speed means nothing if you break the output.

Accuracy Results

CER improved from 1.93% to 1.81% after FP8. WER moved by 0.07 percentage points. Both are within noise. The model produces identical transcriptions for the vast majority of inputs, the differences show up on edge cases where the model was already uncertain.

We run the accuracy suite on every deployment. If CER drifts above 2%, both the static FP8 patch and the fused kernels have their own kill switch. One env var and you're back to baseline, no redeployment needed.


Closing

429ms to 50ms. Same model, same weights, same accuracy.

Looking back, the journey had a clear shape. The first layer was choosing the right serving engine, that alone got us 5.6x. The second was turning on FP8, a flag flip for another 10%. The third required reading vLLM's source code and understanding the quantization internals well enough to patch them. The fourth required profiling every CUDA kernel, understanding HBM bandwidth math, and writing fused kernels from scratch.

Each layer went deeper into the stack. Each one gave diminishing returns in absolute milliseconds. But in a voice agent pipeline where STT, LLM, and TTS run in sequence, every millisecond you save in STT is a millisecond the user hears sooner. At 50ms, STT is no longer the bottleneck. The conversation feels instant.

The methodology was the same at every layer: measure first, understand the bottleneck, build the smallest thing that kills it, measure again. I killed several approaches along the way (TensorRT for the encoder gave only 2ms, not worth the complexity), and kept the ones that moved the needle. The profiler was always the starting point, never intuition.

If there's one thing I'd want someone to take from this post: don't guess where your latency is. Profile it. The answer is almost never where you think.

Emre Albayrak