---
name: logging-monitoring
description: Sets up logging and monitoring stack with alerting rules and dashboards. Use when adding observability, configuring alerts, setting up log aggregation, or building operational dashboards.
metadata:
  category: product-building
  author: skillar
  version: "1.0"
---

# Logging and Monitoring

> **Usage:** Copy this skill into Claude → replace [BRACKETS] with your details → get polished output.

## What You Get
A complete observability setup with structured logging configuration, monitoring tool selection, alerting rules, and dashboard designs — giving you full visibility into your application's health and performance.

## Instructions

You are an observability engineer who has built monitoring stacks for high-traffic applications. You have learned that monitoring is not about collecting every metric — it is about surfacing the signals that predict problems before users notice. You design systems where the team trusts their dashboards and acts on alerts instead of ignoring them.

Design a logging and monitoring stack for the following application:

- **Application type:** [APP_TYPE — e.g., web API with background workers and scheduled jobs]
- **Tech stack:** [TECH_STACK — e.g., Node.js, PostgreSQL, Redis, deployed on AWS ECS]
- **Current observability:** [CURRENT — e.g., console.log only, no monitoring, basic CloudWatch]
- **Traffic and scale:** [SCALE — e.g., 10k requests/hour, 50k daily active users]
- **Budget for tools:** [BUDGET — e.g., $0-200/month for monitoring services]
- **Team size and on-call:** [TEAM — e.g., 3 devs, informal on-call rotation]
- **Compliance requirements:** [COMPLIANCE — e.g., audit log required, log retention 1 year, none]

1. TOOL SELECTION
   - Recommend specific tools for logging, metrics, tracing, and error tracking
   - Compare 2-3 options per category with pricing at stated scale
   - Justify managed service vs self-hosted based on team size and budget
   - Design the data flow: application → collection → storage → visualization
   - Ensure all tools integrate cleanly with the existing tech stack
   - Calculate total monthly cost and per-unit costs as scale increases

2. STRUCTURED LOGGING
   - Define the standard log entry schema with required and optional fields
   - Specify log levels (debug, info, warn, error, fatal) with usage guidelines
   - Implement request-scoped context propagation (request ID, user ID, trace ID)
   - Configure log output format (JSON for production, human-readable for development)
   - Set up log rotation and retention policies by environment
   - Design PII filtering to prevent sensitive data from reaching log storage
   - Provide code configuration for the logging library

3. APPLICATION METRICS
   - Define the RED metrics: Rate, Errors, Duration for every service
   - Add business metrics: signups, transactions, key feature usage
   - Implement custom metrics for queue depths, cache hit rates, connection pool usage
   - Set up histogram buckets for response time distribution
   - Configure metric labels and cardinality limits to prevent cost explosion
   - Design the metrics collection pipeline with appropriate flush intervals

4. ALERTING RULES
   - Define alerts for the four golden signals: latency, traffic, errors, saturation
   - Set thresholds using baseline data or reasonable starting points
   - Implement multi-window alerting to reduce false positives
   - Design escalation paths: Slack notification → PagerDuty page → phone call
   - Create alert suppression rules for maintenance windows and known issues
   - Write alert descriptions that include impact assessment and first response steps
   - Set up heartbeat alerts for critical background jobs and scheduled tasks

5. DASHBOARD DESIGN
   - Design an executive dashboard: system health at a glance with traffic lights
   - Create a service-level dashboard: request rate, error rate, latency percentiles
   - Build a database dashboard: query duration, connection pool, slow queries
   - Add an infrastructure dashboard: CPU, memory, disk, network per service
   - Design a business metrics dashboard: signups, conversions, revenue
   - Use consistent time ranges, color coding, and annotation practices

6. INCIDENT RESPONSE INTEGRATION
   - Design the alert-to-investigation workflow with tool links
   - Create log search templates for common investigation patterns
   - Set up distributed tracing to follow requests across services
   - Configure automated log correlation from alert to relevant log entries
   - Build runbook links directly into alert notifications
   - Establish post-incident review process using monitoring data

Deliver the setup as a tool architecture diagram, followed by configuration files and code snippets for each component. Include a "day one" minimal setup that can be expanded incrementally, so the team gets value immediately.

Be specific to my situation. No generic filler.
