High-Observability Scenario¶

An operational overlay for when you need to understand what the adapter is doing in production, not just whether it is up.

What this scenario is for¶

This profile is for deployments where operators need to understand what the adapter is doing in production, not just whether it is up.

Typical examples:

a shared internal platform with on-call ownership
a production deployment that needs metrics, traces-adjacent visibility, or export to an OTel collector
environments where upstream instability must be diagnosed quickly
deployments where queue pressure, cleanup behavior, and session churn need to be visible
systems where health checks alone are not enough for operations

This is best understood as an observability overlay. You usually apply it on top of either:

a single-node durable deployment, or
a distributed production deployment

It is not about changing business behavior. It is about making behavior visible.

What this scenario assumes¶

A typical high-observability setup assumes:

the adapter is important enough to monitor deliberately
there is an OpenTelemetry collector or vendor endpoint available
operators want more than a binary healthy/unhealthy signal
production debugging speed matters
some additional operational complexity is acceptable in exchange for better visibility

If you expect to answer questions like "Why are uploads failing?", "Why is this upstream degrading?", or "Why is storage pressure climbing?", this is the right scenario.

Recommended knobs and values¶

These are the highest-value settings to make explicit when observability matters.

Core¶

core:
  log_level: "info"
  max_start_wait_seconds: 60
  cleanup_interval_seconds: 60
  upstream_metadata_cache_ttl_seconds: 60

log_level: "info" gives operators more useful runtime signals than quieter defaults. Startup and cleanup timings become easier to reason about when they are explicitly set. A shorter metadata cache TTL can make upstream changes visible faster during operations and debugging.

This profile is not about noisy debugging logs all the time. It is about having enough signal to operate the service confidently.

Telemetry¶

telemetry:
  enabled: true
  transport: "http"
  endpoint: "https://otel-collector.internal/v1/metrics"
  logs_endpoint: "https://otel-collector.internal/v1/logs"
  service_name: "remote-mcp-adapter"
  service_namespace: "mcp"
  export_interval_seconds: 10
  export_timeout_seconds: 5
  max_queue_size: 2048
  queue_batch_size: 256
  periodic_flush_seconds: 5
  shutdown_drain_timeout_seconds: 10
  emit_logs: true
  flush_on_shutdown: true
  drop_on_queue_full: true

enabled: true is the core switch that turns observability from optional to real. service_name and service_namespace make environments easier to separate in observability backends. Explicit export cadence and queue sizing make telemetry behavior predictable under load. flush_on_shutdown: true improves signal preservation during restarts and rollouts.

If your collector requires headers, add them explicitly:

telemetry:
  headers:
    Authorization: "Bearer ${OTEL_TOKEN}"

Telemetry queue behavior¶

telemetry:
  max_queue_size: 2048
  queue_batch_size: 256
  periodic_flush_seconds: 5
  drop_on_queue_full: true

In production, blocking core request handling because the telemetry pipeline is slow is usually the wrong tradeoff. Dropping excess telemetry under sustained pressure is often preferable to coupling service latency to exporter latency. Batch sizing and flush cadence let you tune exporter pressure without relying on unsupported worker-count knobs.

Health and upstream behavior¶

core:
  max_start_wait_seconds: 60

  upstream_ping:
    interval_seconds: 10
    timeout_seconds: 3
    failure_threshold: 3
    open_cooldown_seconds: 15
    half_open_probe_allowance: 3

Upstream visibility matters more once multiple dependencies are involved. Explicit ping and cooldown settings make degraded behavior easier to interpret from metrics and logs. Operators need to understand whether an upstream is down, flapping, or recovering.

State persistence and topology¶

This profile works with both durable single-node and distributed deployments.

for one-node setups, pair it with disk persistence
for multi-replica setups, pair it with Redis persistence

Observability does not replace a sound topology. It helps you understand the topology you already chose.

Auth and security¶

core:
  auth:
    enabled: true

Observability is not a replacement for auth. Production services that are important enough to monitor are usually also important enough to protect.

Storage and limits¶

This overlay should usually be paired with explicit limits from either the durable or restricted-limits profiles. Observability tells you when pressure is rising — limits tell the system what to do before pressure becomes catastrophic. Both together are much more useful than either one alone.

Full example¶

This example focuses on the observability-specific settings you would layer onto a real deployment.

core:
  log_level: "info"
  max_start_wait_seconds: 60
  cleanup_interval_seconds: 60
  upstream_metadata_cache_ttl_seconds: 60

  auth:
    enabled: true

  upstream_ping:
    interval_seconds: 10
    timeout_seconds: 3
    failure_threshold: 3
    open_cooldown_seconds: 15
    half_open_probe_allowance: 3

telemetry:
  enabled: true
  transport: "http"
  endpoint: "https://otel-collector.internal/v1/metrics"
  logs_endpoint: "https://otel-collector.internal/v1/logs"
  service_name: "remote-mcp-adapter"
  service_namespace: "mcp"
  export_interval_seconds: 10
  export_timeout_seconds: 5
  max_queue_size: 2048
  queue_batch_size: 256
  periodic_flush_seconds: 5
  shutdown_drain_timeout_seconds: 10
  emit_logs: true
  drop_on_queue_full: true
  flush_on_shutdown: true

  headers:
    Authorization: "Bearer ${OTEL_TOKEN}"

What this profile improves¶

Compared to a minimally instrumented deployment, this profile gives you better visibility into:

upstream health and recovery behavior
exporter backlog and telemetry pressure
per-instance runtime behavior in multi-node environments
whether restarts and shutdowns lose operational signal
whether configuration changes have measurable effect

This profile improves your ability to explain and diagnose behavior. It does not replace careful limits, auth, or topology design.

Common high-observability mistakes¶

Turning on telemetry without naming the service clearly

Metrics become much less useful when multiple environments and services all report under ambiguous names.

Letting telemetry backpressure affect request handling

Observability should help the service, not become a new availability risk. Queue and drop behavior should be chosen intentionally.

Using observability as a substitute for limits

Metrics can tell you that storage or sessions are growing, but they do not enforce any ceiling on their own.

Collector auth left implicit

If your exporter requires headers or vendor tokens, define them explicitly instead of assuming the environment around the process will always inject them correctly.

No one reviews the signals

Instrumentation only helps if somebody actually knows what the important dashboards, alerts, and failure patterns look like.

When to apply this profile¶

Use this overlay when:

the adapter has operational owners
incident response speed matters
upstream instability needs to be diagnosed quickly
production behavior needs to be measured rather than guessed

If your bigger concern is strict risk reduction or strict fairness, the high-security and restricted-limits profiles are the more immediately relevant overlays.

Next steps¶

Back to: Configuration — overview and scenario index.
Previous scenario: Restricted-Limits Scenario — stricter resource ceilings.
See also: Telemetry — exporter setup, field meanings, and operational details.
See also: Health — health endpoint behavior and degraded states.
Next scenario: Agent-Optimized Code Mode Scenario — compact discovery for coding agents and smaller models.