Skip to content

Conversation

@jjmaldonis
Copy link
Contributor

  • minor change to stream_audio_file.py to track audio cursor
  • large refactor to print_transcript.py to calculate interim result and EOT message latency for both Nova and Flux
  • updated README to reflect changes

We've needed to update our STT latency scripts for a while. This code is meant to do that. It tracks two types of STT latency: interim result latency and EOT latency.

  • Interim result latency is calculated (for Nova) for interim_result=true messages or (for Flux) for Update messages. The code performs the typical audio cursor - transcript cursor calculation.
  • EOT latency is calculated (for Nova) for speech_final, is_final, and UtteranceEnd messages or (for Flux) for EndOfTurn and EagerEndOfTurn messages. The calculation is very basic and it's not as accurate of a measurement as we want, but I think it's the best we can do. More details below:

How this script measures EOT latency:
When an EOT message is received (speech_final, is_final, UtteranceEnd, EndOfTurn, or EagerEndOfTurn), the calculation is simple: find the prior interim_result / Update message and subtract the received times. The result is the amount of wall clock time it took to receive an EOT signal after the prior interim result was received; said another way, it's the amount of time it took the EOT to trigger after Deepgram finished processing the most recent non-EOT message.

There are better ways to calculate EOT latency, but all of them (to my knowledge) require ground truth timestamps and careful labeling of the audio. Since we don't have that information, I believe the current calculation is a reasonable approximation.


Below is a snippet of the output from the print_transcript.py script for a Flux transcript:

...
[04:05:29.592302] [latency=0.180s] [00:01:29.84 - 00:01:33.92] [Update]: The effective date is June first twenty
[04:05:29.776400] [latency=0.140s] [00:01:29.84 - 00:01:34.15] [Update]: The effective date is June first twenty twenty
[04:05:30.074540] [latency=0.200s] [00:01:29.84 - 00:01:34.40] [Update]: The effective date is June first twenty twenty three
[04:05:30.168314] [latency=0.220s] [00:01:29.84 - 00:01:34.48] [Update]: The effective date is June first twenty twenty three.
[04:05:30.194915] [eot_latency=0.027s] [00:01:29.84 - 00:01:34.56] [EagerEndOfTurn]: The effective date is June first twenty twenty three.
[04:05:30.236265] [eot_latency=0.068s] [00:01:29.84 - 00:01:34.56] [EndOfTurn]: The effective date is June first twenty twenty three.
[04:05:30.477548] [latency=0.200s] [00:01:34.56 - 00:01:34.79] [Update]:
[04:05:30.563035] [latency=0.220s] [00:01:34.56 - 00:01:34.87] [Update]:
...

Message Latency: min=0.120s, p50=0.180s, p95=0.240s, p99=0.340s, max=0.420s (510 measurements)

EOT Latency: min=0.021s, p50=0.116s, p95=0.322s, p99=0.360s, max=0.360s (27 events)
  EagerEndOfTurn: min=0.021s, p50=0.113s, p95=0.322s, p99=0.322s, max=0.322s (17 events)
  EndOfTurn: min=0.021s, p50=0.138s, p95=0.360s, p99=0.360s, max=0.360s (10 events)

And below are the same sections for a Nova transcript:

...
[04:06:38.247087] [latency=0.156s] [00:01:26.45 - 00:01:30.74] [InterimResult]: 1901718. The
[04:06:39.247024] [latency=0.135s] [00:01:26.45 - 00:01:31.76] [InterimResult]: 1901718. The effective date
[04:06:40.250576] [latency=0.113s] [00:01:26.45 - 00:01:32.78] [InterimResult]: 1901718. The effective date is June
[04:06:41.353355] [latency=0.191s] [00:01:26.45 - 00:01:33.80] [InterimResult]: 1901718. The effective date is June first
[04:06:42.269397] [latency=0.116s] [00:01:26.45 - 00:01:34.78] [InterimResult]: 1901718. The effective date is 06/01/2023.
[04:06:42.652210] [eot_latency=0.383s] [00:01:26.45 - 00:01:35.70] [IsFinal] [SpeechFinal]: 1901718. The effective date is 06/01/2023.
[04:06:43.550564] [latency=0.116s] [00:01:35.60 - 00:01:36.80] [InterimResult]: Okay.
...

Message Latency: min=0.103s, p50=0.150s, p95=0.194s, p99=0.199s, max=0.199s (91 measurements)

EOT Latency: min=0.104s, p50=0.699s, p95=1.099s, p99=1.103s, max=1.103s (25 events)
  speech_final: min=0.104s, p50=0.602s, p95=1.099s, p99=1.103s, max=1.103s (21 events)
  is_final: min=0.899s, p50=0.909s, p95=0.998s, p99=0.998s, max=0.998s (4 events)

You'll notice that the format of the messages/transcript is similar between both Flux and Nova, and you'll also notice that the summarized latency data at the end is representative of the significant improvements that Flux provides, particularly for EOT latency.

- minor change to stream_audio_file.py to track audio cursor
- large refactor to print_transcript.py to calculate interim result and EOT message latency for both Nova and Flux
- updated README to reflect changes
@jjmaldonis jjmaldonis requested a review from a team as a code owner January 15, 2026 04:55
@nkaimakis
Copy link

@jjmaldonis you are correct in your description of how hard it is to actually define EOT latency (and also latency in general). for the most accurate downstream benchmarking script assuming no ground truth data, the best approach would likely be to use SileroVAD and measure the delta between the last SileroVAD activity and the Flux EOT message, but even that relies on SileroVAD accuracy as a dependency.

the current approach is actually fairly favorable to Flux re: "the amount of time it took the EOT to trigger after Deepgram finished processing the most recent non-EOT message." in related customer documentation we should we should call this out explicitly.

actual EOT benchmarking shows a p50 EOT latency of closer to 320ms, compared to the above p50 of 116ms.

Flux processes audio in 80ms increments, regularly decodes every 240ms, and triggers additional decodes at 80ms increments when EOT thresholds have been reached.

I think this setup is generally on par/useful for customers and is probably not worth getting into the SileroVAD EOT stuff, though one thing that stands out is looking at 'Message Latency' across Flux vs Nova:
Flux: Message Latency: min=0.120s, p50=0.180s, p95=0.240s, p99=0.340s, max=0.420s (510 measurements)
Nova: Message Latency: min=0.103s, p50=0.150s, p95=0.194s, p99=0.199s, max=0.199s (91 measurements)

knowing how Nova operates via endpointing vs Flux via regular Updates and EOT triggered decodes, I know this to be accurate. that said, it still feels a bit misleading in that the customer experience with Flux vs. Nova will be that Flux is generally faster. because this benchmarking is relative to audio_cursor - transcript_cursor, this benefits Nova operating on 1s chunks + endpointing (which I assume is the default ultra-low 10ms) vs Flux's standard intervals + EOT. perhaps an additional metric to highlight here to better capture the full picture is transcript updates/time frame (I.e. transcript updates per second). also in this vein - a time-to-first-transcript metric would be meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants