New Feature: standardized latency telemetry for chat completion operations (.NET)

---
name: Feature request
about: Suggest an idea for this project

---







# Feature Request: standardized latency telemetry for chat completion operations (.NET)

## Summary

I’d like to request **standardized latency telemetry** for chat completion operations in the .NET Semantic Kernel, across all connectors (Azure OpenAI / OpenAI, Gemini, Mistral, etc.).

The goal is to have **one consistent way** to observe:

- Total response time per chat completion call, and
- For streaming scenarios, time‑to‑first‑token (TTFB) vs. time‑to‑last‑token,

without every host application having to roll its own stopwatch‑based timing and correlation logic around SK.

## Motivation / Use Cases

In our host application (CIT – an internal conversation & telemetry system) we want to:

- Monitor LLM performance per endpoint/model (P50/P90+ latencies),
- Compare providers/models (e.g., Gemini 2.5 vs. Mistral Medium vs. Azure GPT‑4o),
- Combine **cost** (token usage) and **latency** (response time) in the same analysis,
- Detect regressions, throttling, or overloaded endpoints early.

Today SK already exposes **token usage telemetry** (via OTel Counters), which is extremely helpful. For latency, however, the only option is to:

- Start/stop a stopwatch in the host application around each `GetChatMessageContent[s]Async` call,
- Try to correlate those timings with SK’s token metrics, often across async boundaries and retries.

In our case we actually implemented a connector‑independent latency layer (stopwatch in our `ConversationRunManager`, queue‑based registry, bounded wait in our `MeterListener`), but had to **deactivate** it again because the cross‑component correlation (multiple calls per run, async flows, streaming, retries) became too fragile and produced inconsistent/incorrect values.

This feels like something that SK itself could do **much more robustly and uniformly** inside the connector implementations.

## Proposed behavior (high‑level, not prescriptive)

At a high level, I’m asking for:

- A **per‑call latency signal** for chat completions emitted by SK,
- With **well‑defined semantics** for:
  - Non‑streaming calls, and
  - Streaming calls (time‑to‑first‑token vs. time‑to‑last‑token).

For example (one possible design, not a hard requirement):

- Emit an `Activity` or a Histogram‑style metric per call, such as:

  - `Microsoft.SemanticKernel.Connectors.OpenAI.ChatCompletion` (Activity)  
    with `Activity.Duration` representing total time from “request sent” to “last token / end of stream”.

  - And optionally:
    - A metric or tag capturing **time‑to‑first‑token** for streaming calls,
    - Tags like `provider`, `model`, `is_streaming`, `status` (success/failed/cancelled), etc.

- Apply the same pattern consistently across .NET connectors:
  - OpenAI / Azure OpenAI
  - Google Gemini
  - Mistral / OpenRouter / other HTTP‑based connectors exposed via SK
  - (or at least define the pattern once and roll it out over time)

I’m not asking for SK to log anything to a database or to introduce host‑specific correlation IDs (like our internal `RunId`). The main ask is:

> For each logical chat completion operation that SK executes, expose a **standard latency measurement** in telemetry, with clear semantics documented for streaming vs. non‑streaming.

This would allow hosts to:

- Plug in their OpenTelemetry exporter / monitoring solution of choice, and
- Do all latency analysis (per model, per endpoint, per environment) **on top of** SK’s telemetry, without having to duplicate timing logic inside each application.

## Out of scope / non‑goals

- No requirements on **where** or **how** hosts consume telemetry (Grafana, Application Insights, etc. are host concerns).
- No requirement that SK propagates host‑level correlation IDs (like our internal run IDs) – that would be a separate discussion.
- No request to change existing token usage telemetry semantics; this is specifically about **latency**.

## Prior attempts and why host‑side timing is fragile

We did try to keep this “outside” of SK:

- Stopwatch around `GetChatMessageContentsAsync` in our `ConversationRunManager`,
- Queue‑based registry keyed by `(conversationId:endpointId:runId)`,
- Bounded wait (up to ~2s) in our token usage `MeterListener` to match latency and token metrics,
- AsyncLocal fallbacks when the registry wasn’t populated in time.

This worked for simple cases, but under more realistic, multi‑call / multi‑run scenarios we repeatedly saw:

- `NULL` or missing latency values,
- Monotonically increasing `ResponseTimeMs` over logically separate calls (mis‑correlation),
- Race conditions due to the “measure here, consume there, hope they meet within N ms” architecture.

In the end we **disabled** latency tracking in the host application to avoid logging misleading data. This experience was the trigger for this feature request: SK is already the place where the actual connector calls and streaming loops live; it seems like the most natural and robust place to measure and expose these timings once, uniformly.

## Open questions for maintainers

I’d very much appreciate guidance on:

- Whether this fits into Semantic Kernel’s long‑term observability / telemetry strategy,
- Whether you’d prefer:
  - Activities,
  - Metrics (Histograms),
  - Or a combination (e.g., Activity for correlation + Histogram for aggregation),
- What a good, stable metric/Activity naming scheme and tag set would look like.

I’m happy to adjust to whatever design you consider appropriate for SK.

---

I’m happy to contribute a PR once there is agreement on the desired semantics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Feature: standardized latency telemetry for chat completion operations (.NET) #13387

Feature Request: standardized latency telemetry for chat completion operations (.NET)

Summary

Motivation / Use Cases

Proposed behavior (high‑level, not prescriptive)

Out of scope / non‑goals

Prior attempts and why host‑side timing is fragile

Open questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New Feature: standardized latency telemetry for chat completion operations (.NET) #13387

Description

Feature Request: standardized latency telemetry for chat completion operations (.NET)

Summary

Motivation / Use Cases

Proposed behavior (high‑level, not prescriptive)

Out of scope / non‑goals

Prior attempts and why host‑side timing is fragile

Open questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions