Monitoring & Observability

QBITEL Bridge provides comprehensive observability through metrics, distributed tracing, structured logging, and pre-built dashboards.

Observability Stack

Component Role Port
Prometheus Metrics collection and alerting 9090
Grafana Dashboards and visualization 3000
OpenTelemetry Collector Telemetry data pipeline 4317 (gRPC)
Tempo Distributed tracing backend 3200
Sentry Error tracking and alerting --

Deploy Monitoring Stack

# Deploy the full observability stack via Kustomize
kubectl apply -k kubernetes/observability/

# Or via Helm with monitoring enabled
helm install qbitel-bridge ./helm/qbitel-bridge \
  --set monitoring.enabled=true

# Access Grafana
kubectl port-forward -n qbitel-monitoring svc/grafana 3000:3000
# Open http://localhost:3000 (admin / qbitel-admin)

Key Metrics

AI Engine Metrics

Metric Description
qbitel_discovery_requests_total Total protocol discovery requests
qbitel_discovery_latency_seconds Discovery processing latency histogram
qbitel_classification_accuracy Protocol classification accuracy gauge
qbitel_threat_alerts_total Total threat alerts by severity
qbitel_agent_decisions_total Autonomous agent decisions by type
qbitel_llm_inference_seconds LLM inference latency

System Metrics

Metric Description
qbitel_http_requests_total HTTP request count by method and status
qbitel_http_request_duration_seconds HTTP request duration histogram
qbitel_active_connections Active connection count
qbitel_pqc_handshake_duration_seconds PQC-TLS handshake duration

Pre-Built Dashboards

QBITEL Bridge ships with several Grafana dashboards:

  • Operational Overview -- system health, request rates, error rates
  • Protocol Analysis -- discovery success rates, classification metrics
  • Threat Detection -- alert volumes, severity distribution, response times
  • LLM Operations -- inference latency, token usage, cache hit rates
  • SLO Dashboard -- service level objective tracking
  • Explainability -- model decision explanations and drift monitoring

Distributed Tracing

OpenTelemetry provides end-to-end distributed tracing across all components:

# Configure OpenTelemetry exporter
export OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
export OTEL_SERVICE_NAME=qbitel-ai-engine
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling rate

Alerting Rules

Pre-configured Prometheus alerting rules are defined in ops/monitoring/prometheus/alerts.yml:

  • HighErrorRate -- fires when error rate exceeds 5% for 5 minutes
  • ServiceDown -- fires when health check fails for 2 minutes
  • HighLatency -- fires when p99 latency exceeds thresholds
  • LLMFailure -- fires when LLM inference fails repeatedly
  • ComplianceDrift -- fires when compliance status degrades

Structured Logging

All components emit structured JSON logs for aggregation:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "service": "ai_engine",
  "trace_id": "abc123",
  "message": "Protocol discovery completed",
  "protocol_count": 3,
  "processing_time_ms": 245
}

Service Level Objectives

SLI Target Window
Availability 99.9% 30-day rolling
API Latency (p99) < 500ms 30-day rolling
Error Rate < 0.1% 30-day rolling
Discovery Success Rate > 95% 7-day rolling

Next Steps