Monitoring & Observability
QBITEL Bridge provides comprehensive observability through metrics, distributed tracing, structured logging, and pre-built dashboards.
Observability Stack
| Component | Role | Port |
|---|---|---|
| Prometheus | Metrics collection and alerting | 9090 |
| Grafana | Dashboards and visualization | 3000 |
| OpenTelemetry Collector | Telemetry data pipeline | 4317 (gRPC) |
| Tempo | Distributed tracing backend | 3200 |
| Sentry | Error tracking and alerting | -- |
Deploy Monitoring Stack
# Deploy the full observability stack via Kustomize
kubectl apply -k kubernetes/observability/
# Or via Helm with monitoring enabled
helm install qbitel-bridge ./helm/qbitel-bridge \
--set monitoring.enabled=true
# Access Grafana
kubectl port-forward -n qbitel-monitoring svc/grafana 3000:3000
# Open http://localhost:3000 (admin / qbitel-admin) Key Metrics
AI Engine Metrics
| Metric | Description |
|---|---|
qbitel_discovery_requests_total | Total protocol discovery requests |
qbitel_discovery_latency_seconds | Discovery processing latency histogram |
qbitel_classification_accuracy | Protocol classification accuracy gauge |
qbitel_threat_alerts_total | Total threat alerts by severity |
qbitel_agent_decisions_total | Autonomous agent decisions by type |
qbitel_llm_inference_seconds | LLM inference latency |
System Metrics
| Metric | Description |
|---|---|
qbitel_http_requests_total | HTTP request count by method and status |
qbitel_http_request_duration_seconds | HTTP request duration histogram |
qbitel_active_connections | Active connection count |
qbitel_pqc_handshake_duration_seconds | PQC-TLS handshake duration |
Pre-Built Dashboards
QBITEL Bridge ships with several Grafana dashboards:
- Operational Overview -- system health, request rates, error rates
- Protocol Analysis -- discovery success rates, classification metrics
- Threat Detection -- alert volumes, severity distribution, response times
- LLM Operations -- inference latency, token usage, cache hit rates
- SLO Dashboard -- service level objective tracking
- Explainability -- model decision explanations and drift monitoring
Distributed Tracing
OpenTelemetry provides end-to-end distributed tracing across all components:
# Configure OpenTelemetry exporter
export OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
export OTEL_SERVICE_NAME=qbitel-ai-engine
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling rate Alerting Rules
Pre-configured Prometheus alerting rules are defined in ops/monitoring/prometheus/alerts.yml:
- HighErrorRate -- fires when error rate exceeds 5% for 5 minutes
- ServiceDown -- fires when health check fails for 2 minutes
- HighLatency -- fires when p99 latency exceeds thresholds
- LLMFailure -- fires when LLM inference fails repeatedly
- ComplianceDrift -- fires when compliance status degrades
Structured Logging
All components emit structured JSON logs for aggregation:
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "INFO",
"service": "ai_engine",
"trace_id": "abc123",
"message": "Protocol discovery completed",
"protocol_count": 3,
"processing_time_ms": 245
} Service Level Objectives
| SLI | Target | Window |
|---|---|---|
| Availability | 99.9% | 30-day rolling |
| API Latency (p99) | < 500ms | 30-day rolling |
| Error Rate | < 0.1% | 30-day rolling |
| Discovery Success Rate | > 95% | 7-day rolling |
Next Steps
- Troubleshooting -- use monitoring data to diagnose issues
- Production Checklist -- monitoring setup verification
- Compliance Frameworks -- compliance monitoring dashboards