updated the stt streaming code to track and calculate message latency #6

jjmaldonis · 2026-01-15T04:55:48Z

minor change to stream_audio_file.py to track audio cursor
large refactor to print_transcript.py to calculate interim result and EOT message latency for both Nova and Flux
updated README to reflect changes

We've needed to update our STT latency scripts for a while. This code is meant to do that. It tracks two types of STT latency: interim result latency and EOT latency.

Interim result latency is calculated (for Nova) for interim_result=true messages or (for Flux) for Update messages. The code performs the typical audio cursor - transcript cursor calculation.
EOT latency is calculated (for Nova) for speech_final, is_final, and UtteranceEnd messages or (for Flux) for EndOfTurn and EagerEndOfTurn messages. The calculation is very basic and it's not as accurate of a measurement as we want, but I think it's the best we can do. More details below:

How this script measures EOT latency:
When an EOT message is received (speech_final, is_final, UtteranceEnd, EndOfTurn, or EagerEndOfTurn), the calculation is simple: find the prior interim_result / Update message and subtract the received times. The result is the amount of wall clock time it took to receive an EOT signal after the prior interim result was received; said another way, it's the amount of time it took the EOT to trigger after Deepgram finished processing the most recent non-EOT message.

There are better ways to calculate EOT latency, but all of them (to my knowledge) require ground truth timestamps and careful labeling of the audio. Since we don't have that information, I believe the current calculation is a reasonable approximation.

Below is a snippet of the output from the print_transcript.py script for a Flux transcript:

...
[04:05:29.592302] [latency=0.180s] [00:01:29.84 - 00:01:33.92] [Update]: The effective date is June first twenty
[04:05:29.776400] [latency=0.140s] [00:01:29.84 - 00:01:34.15] [Update]: The effective date is June first twenty twenty
[04:05:30.074540] [latency=0.200s] [00:01:29.84 - 00:01:34.40] [Update]: The effective date is June first twenty twenty three
[04:05:30.168314] [latency=0.220s] [00:01:29.84 - 00:01:34.48] [Update]: The effective date is June first twenty twenty three.
[04:05:30.194915] [eot_latency=0.027s] [00:01:29.84 - 00:01:34.56] [EagerEndOfTurn]: The effective date is June first twenty twenty three.
[04:05:30.236265] [eot_latency=0.068s] [00:01:29.84 - 00:01:34.56] [EndOfTurn]: The effective date is June first twenty twenty three.
[04:05:30.477548] [latency=0.200s] [00:01:34.56 - 00:01:34.79] [Update]:
[04:05:30.563035] [latency=0.220s] [00:01:34.56 - 00:01:34.87] [Update]:
...

Message Latency: min=0.120s, p50=0.180s, p95=0.240s, p99=0.340s, max=0.420s (510 measurements)

EOT Latency: min=0.021s, p50=0.116s, p95=0.322s, p99=0.360s, max=0.360s (27 events)
  EagerEndOfTurn: min=0.021s, p50=0.113s, p95=0.322s, p99=0.322s, max=0.322s (17 events)
  EndOfTurn: min=0.021s, p50=0.138s, p95=0.360s, p99=0.360s, max=0.360s (10 events)

And below are the same sections for a Nova transcript:

...
[04:06:38.247087] [latency=0.156s] [00:01:26.45 - 00:01:30.74] [InterimResult]: 1901718. The
[04:06:39.247024] [latency=0.135s] [00:01:26.45 - 00:01:31.76] [InterimResult]: 1901718. The effective date
[04:06:40.250576] [latency=0.113s] [00:01:26.45 - 00:01:32.78] [InterimResult]: 1901718. The effective date is June
[04:06:41.353355] [latency=0.191s] [00:01:26.45 - 00:01:33.80] [InterimResult]: 1901718. The effective date is June first
[04:06:42.269397] [latency=0.116s] [00:01:26.45 - 00:01:34.78] [InterimResult]: 1901718. The effective date is 06/01/2023.
[04:06:42.652210] [eot_latency=0.383s] [00:01:26.45 - 00:01:35.70] [IsFinal] [SpeechFinal]: 1901718. The effective date is 06/01/2023.
[04:06:43.550564] [latency=0.116s] [00:01:35.60 - 00:01:36.80] [InterimResult]: Okay.
...

Message Latency: min=0.103s, p50=0.150s, p95=0.194s, p99=0.199s, max=0.199s (91 measurements)

EOT Latency: min=0.104s, p50=0.699s, p95=1.099s, p99=1.103s, max=1.103s (25 events)
  speech_final: min=0.104s, p50=0.602s, p95=1.099s, p99=1.103s, max=1.103s (21 events)
  is_final: min=0.899s, p50=0.909s, p95=0.998s, p99=0.998s, max=0.998s (4 events)

You'll notice that the format of the messages/transcript is similar between both Flux and Nova, and you'll also notice that the summarized latency data at the end is representative of the significant improvements that Flux provides, particularly for EOT latency.

- minor change to stream_audio_file.py to track audio cursor - large refactor to print_transcript.py to calculate interim result and EOT message latency for both Nova and Flux - updated README to reflect changes

nkaimakis · 2026-01-16T05:39:00Z

@jjmaldonis you are correct in your description of how hard it is to actually define EOT latency (and also latency in general). for the most accurate downstream benchmarking script assuming no ground truth data, the best approach would likely be to use SileroVAD and measure the delta between the last SileroVAD activity and the Flux EOT message, but even that relies on SileroVAD accuracy as a dependency.

the current approach is actually fairly favorable to Flux re: "the amount of time it took the EOT to trigger after Deepgram finished processing the most recent non-EOT message." in related customer documentation we should we should call this out explicitly.

actual EOT benchmarking shows a p50 EOT latency of closer to 320ms, compared to the above p50 of 116ms.

Flux processes audio in 80ms increments, regularly decodes every 240ms, and triggers additional decodes at 80ms increments when EOT thresholds have been reached.

I think this setup is generally on par/useful for customers and is probably not worth getting into the SileroVAD EOT stuff, though one thing that stands out is looking at 'Message Latency' across Flux vs Nova:
Flux: Message Latency: min=0.120s, p50=0.180s, p95=0.240s, p99=0.340s, max=0.420s (510 measurements)
Nova: Message Latency: min=0.103s, p50=0.150s, p95=0.194s, p99=0.199s, max=0.199s (91 measurements)

knowing how Nova operates via endpointing vs Flux via regular Updates and EOT triggered decodes, I know this to be accurate. that said, it still feels a bit misleading in that the customer experience with Flux vs. Nova will be that Flux is generally faster. because this benchmarking is relative to audio_cursor - transcript_cursor, this benefits Nova operating on 1s chunks + endpointing (which I assume is the default ultra-low 10ms) vs Flux's standard intervals + EOT. perhaps an additional metric to highlight here to better capture the full picture is transcript updates/time frame (I.e. transcript updates per second). also in this vein - a time-to-first-transcript metric would be meaningful.

updated the stt streaming code to track and calculate message latency

d65b5fa

- minor change to stream_audio_file.py to track audio cursor - large refactor to print_transcript.py to calculate interim result and EOT message latency for both Nova and Flux - updated README to reflect changes

jjmaldonis requested a review from a team as a code owner January 15, 2026 04:55

leah-deepgram approved these changes Jan 15, 2026

View reviewed changes

jkroll-deepgram approved these changes Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updated the stt streaming code to track and calculate message latency #6

updated the stt streaming code to track and calculate message latency #6

Uh oh!

jjmaldonis commented Jan 15, 2026

Uh oh!

nkaimakis commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

updated the stt streaming code to track and calculate message latency #6

Are you sure you want to change the base?

updated the stt streaming code to track and calculate message latency #6

Uh oh!

Conversation

jjmaldonis commented Jan 15, 2026

Uh oh!

nkaimakis commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants