$

[REPLACE] noc --watch production

服务 42 正常 3 警告 1 严重 SLA 本月 99.94%
09:41:00
service-matrix --all46 services
api-coreOK 38ms
api-edgeOK 21ms
authOK 17ms
searchWARN 412ms
paymentOK 52ms
billingOK 44ms
notifyOK 29ms
media-convDOWN
user-svcOK 33ms
order-svcOK 41ms
inventoryOK 36ms
gatewayOK 12ms
report-genWARN 1.2s
schedulerOK 8ms
webhookOK 26ms
cdn-syncOK 64ms
geo-ipOK 9ms
ml-inferWARN 688ms
latency --p99 --15mms
gateway → apip99 86ms
api → db-primaryp99 24ms
api → cachep99 4ms
api → searchp99 412ms
api → media-convtimeout
tail -f /var/log/prod.loglive
09:40:58CRIT[REPLACE] media-conv pod 3/3 CrashLoopBackOff
09:40:41WARN[REPLACE] search p99 412ms 超过阈值 300ms
09:40:22INFO[REPLACE] deploy api-core v2.18.3 完成滚动发布
09:39:57INFO[REPLACE] autoscaler 扩容 worker 4 → 6
09:39:30WARN[REPLACE] report-gen 队列深度 1,840
09:39:04INFO[REPLACE] 备份任务 db-primary 完成 (42GB)
09:38:46INFO[REPLACE] cert-manager 续期 *.example.com 成功
09:38:12WARN[REPLACE] ml-infer GPU 利用率 96%
09:37:55INFO[REPLACE] 灰度流量 5% → 25% (api-edge)
09:37:31INFO[REPLACE] healthcheck 全量通过 (46/46)
09:37:08INFO[REPLACE] watching 1,284 targets
uptime --30dSLO 99.9%
99.98%API 集群
99.95%数据层
99.41%媒体管线
queues --depthjobs
events-ingest3,214
email-out912
report-gen1,840
media-conv7,468
webhook-retry486