Skip to content

New Feature: standardized latency telemetry for chat completion operations (.NET) #13387

@Cozmopolit

Description

@Cozmopolit

name: Feature request
about: Suggest an idea for this project


Feature Request: standardized latency telemetry for chat completion operations (.NET)

Summary

I’d like to request standardized latency telemetry for chat completion operations in the .NET Semantic Kernel, across all connectors (Azure OpenAI / OpenAI, Gemini, Mistral, etc.).

The goal is to have one consistent way to observe:

  • Total response time per chat completion call, and
  • For streaming scenarios, time‑to‑first‑token (TTFB) vs. time‑to‑last‑token,

without every host application having to roll its own stopwatch‑based timing and correlation logic around SK.

Motivation / Use Cases

In our host application (CIT – an internal conversation & telemetry system) we want to:

  • Monitor LLM performance per endpoint/model (P50/P90+ latencies),
  • Compare providers/models (e.g., Gemini 2.5 vs. Mistral Medium vs. Azure GPT‑4o),
  • Combine cost (token usage) and latency (response time) in the same analysis,
  • Detect regressions, throttling, or overloaded endpoints early.

Today SK already exposes token usage telemetry (via OTel Counters), which is extremely helpful. For latency, however, the only option is to:

  • Start/stop a stopwatch in the host application around each GetChatMessageContent[s]Async call,
  • Try to correlate those timings with SK’s token metrics, often across async boundaries and retries.

In our case we actually implemented a connector‑independent latency layer (stopwatch in our ConversationRunManager, queue‑based registry, bounded wait in our MeterListener), but had to deactivate it again because the cross‑component correlation (multiple calls per run, async flows, streaming, retries) became too fragile and produced inconsistent/incorrect values.

This feels like something that SK itself could do much more robustly and uniformly inside the connector implementations.

Proposed behavior (high‑level, not prescriptive)

At a high level, I’m asking for:

  • A per‑call latency signal for chat completions emitted by SK,
  • With well‑defined semantics for:
    • Non‑streaming calls, and
    • Streaming calls (time‑to‑first‑token vs. time‑to‑last‑token).

For example (one possible design, not a hard requirement):

  • Emit an Activity or a Histogram‑style metric per call, such as:

    • Microsoft.SemanticKernel.Connectors.OpenAI.ChatCompletion (Activity)
      with Activity.Duration representing total time from “request sent” to “last token / end of stream”.

    • And optionally:

      • A metric or tag capturing time‑to‑first‑token for streaming calls,
      • Tags like provider, model, is_streaming, status (success/failed/cancelled), etc.
  • Apply the same pattern consistently across .NET connectors:

    • OpenAI / Azure OpenAI
    • Google Gemini
    • Mistral / OpenRouter / other HTTP‑based connectors exposed via SK
    • (or at least define the pattern once and roll it out over time)

I’m not asking for SK to log anything to a database or to introduce host‑specific correlation IDs (like our internal RunId). The main ask is:

For each logical chat completion operation that SK executes, expose a standard latency measurement in telemetry, with clear semantics documented for streaming vs. non‑streaming.

This would allow hosts to:

  • Plug in their OpenTelemetry exporter / monitoring solution of choice, and
  • Do all latency analysis (per model, per endpoint, per environment) on top of SK’s telemetry, without having to duplicate timing logic inside each application.

Out of scope / non‑goals

  • No requirements on where or how hosts consume telemetry (Grafana, Application Insights, etc. are host concerns).
  • No requirement that SK propagates host‑level correlation IDs (like our internal run IDs) – that would be a separate discussion.
  • No request to change existing token usage telemetry semantics; this is specifically about latency.

Prior attempts and why host‑side timing is fragile

We did try to keep this “outside” of SK:

  • Stopwatch around GetChatMessageContentsAsync in our ConversationRunManager,
  • Queue‑based registry keyed by (conversationId:endpointId:runId),
  • Bounded wait (up to ~2s) in our token usage MeterListener to match latency and token metrics,
  • AsyncLocal fallbacks when the registry wasn’t populated in time.

This worked for simple cases, but under more realistic, multi‑call / multi‑run scenarios we repeatedly saw:

  • NULL or missing latency values,
  • Monotonically increasing ResponseTimeMs over logically separate calls (mis‑correlation),
  • Race conditions due to the “measure here, consume there, hope they meet within N ms” architecture.

In the end we disabled latency tracking in the host application to avoid logging misleading data. This experience was the trigger for this feature request: SK is already the place where the actual connector calls and streaming loops live; it seems like the most natural and robust place to measure and expose these timings once, uniformly.

Open questions for maintainers

I’d very much appreciate guidance on:

  • Whether this fits into Semantic Kernel’s long‑term observability / telemetry strategy,
  • Whether you’d prefer:
    • Activities,
    • Metrics (Histograms),
    • Or a combination (e.g., Activity for correlation + Histogram for aggregation),
  • What a good, stable metric/Activity naming scheme and tag set would look like.

I’m happy to adjust to whatever design you consider appropriate for SK.


I’m happy to contribute a PR once there is agreement on the desired semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions