---
name: error-handling-strategy
description: Designs error handling patterns for user-facing errors, logging, alerting, recovery. Use when standardizing error handling, reducing noisy logs, improving user experience on failures, or debugging production issues.
metadata:
  category: product-building
  author: skillar
  version: "1.0"
---

# Error Handling Strategy

> **Usage:** Copy this skill into Claude → replace [BRACKETS] with your details → get polished output.

## What You Get
A complete error handling architecture covering error classification, user-facing messages, structured logging, alerting rules, and recovery patterns — with code examples ready for implementation.

## Instructions

You are a reliability engineer who has been paged at 3 AM enough times to know that good error handling is the difference between "we caught it before users noticed" and "the CEO is on a war room call." You design error handling that serves three audiences: users who need clarity, developers who need debuggability, and ops teams who need actionable alerts.

Design an error handling strategy for the following application:

- **Application type:** [APP_TYPE — e.g., web app with REST API, background workers, and third-party integrations]
- **Tech stack:** [TECH_STACK — e.g., React frontend, Node.js Express backend, PostgreSQL, Redis]
- **Current error handling:** [CURRENT — e.g., inconsistent try/catch, generic 500 errors, console.log only]
- **Error pain points:** [PAIN_POINTS — e.g., users see cryptic errors, logs are useless for debugging, no alerts]
- **External dependencies:** [DEPENDENCIES — e.g., Stripe API, SendGrid, AWS S3, OpenAI API]
- **User sensitivity:** [SENSITIVITY — e.g., fintech with zero tolerance for data errors, or consumer app where retry is fine]

1. ERROR CLASSIFICATION TAXONOMY
   - Define error categories: validation, authentication, authorization, not found, conflict, external service, internal
   - Map each category to HTTP status code and user-facing behavior
   - Classify errors as retryable vs terminal for each category
   - Define severity levels: critical (data loss risk), error (feature broken), warning (degraded), info
   - Create error code format (e.g., ERR_AUTH_001) for programmatic handling
   - Document which errors should page on-call vs create a ticket vs log silently

2. USER-FACING ERROR DESIGN
   - Write user-facing error message templates for each category
   - Ensure messages explain what happened, not technical details
   - Include actionable next steps in every error message
   - Design error response envelope with code, message, and details fields
   - Handle field-level validation errors with specific guidance per field
   - Create friendly fallback messages for unexpected errors

3. BACKEND ERROR HANDLING PATTERNS
   - Design a custom error class hierarchy with base error and category subclasses
   - Implement global error handler middleware with consistent formatting
   - Define try/catch patterns for database, external API, and business logic errors
   - Handle async errors and unhandled promise rejections
   - Design circuit breaker pattern for external service failures
   - Implement graceful degradation when non-critical services are down

4. FRONTEND ERROR HANDLING
   - Design global API error interceptor with retry logic
   - Implement error boundary components for React crash isolation
   - Create toast/notification patterns for transient errors
   - Design full-page error states for critical failures
   - Handle offline mode and network connectivity errors
   - Implement form validation error display patterns

5. STRUCTURED LOGGING
   - Define log entry schema with required fields (timestamp, level, service, trace_id, error_code)
   - Establish correlation ID propagation from request through all service calls
   - Specify what context to include in each log level (error, warn, info, debug)
   - Define PII scrubbing rules to prevent sensitive data in logs
   - Set log retention policies by environment and severity
   - Design structured log format (JSON) for machine parsing

6. ALERTING AND MONITORING
   - Define alerting rules for each error category and severity level
   - Set error rate thresholds that trigger alerts (absolute count and percentage)
   - Configure alert routing: who gets paged for what, and escalation paths
   - Design error dashboard with key metrics and drill-down capability
   - Implement anomaly detection for unusual error patterns
   - Create runbook templates for the 5 most likely error scenarios

Deliver the strategy as an architecture document first, then provide implementation code examples for the error class hierarchy, middleware, and frontend patterns. Include a decision tree for "I just got an error — what do I do?" that new developers can follow.

Be specific to my situation. No generic filler.
