πŸ“Š

CloudWatch - Monitoring and Observability

Metrics, logs, alarms and dashboards

⏱️ Estimated reading time: 20 minutes

CloudWatch Metrics

CloudWatch Metrics provides metrics to monitor AWS resources and applications.

Key Concepts:
- Metric: Variable to monitor (CPUUtilization, NetworkIn, etc.)
- Namespace: Container for metrics (AWS/EC2, AWS/RDS, Custom)
- Dimension: Metric attribute (InstanceId, DBInstanceIdentifier)
- Timestamp: Metric moment
- Statistic: Aggregation (Average, Sum, Min, Max, SampleCount)

Default Metrics (Basic Monitoring):
- EC2: CPU, Disk, Network (every 5 minutes, free)
- RDS: DatabaseConnections, CPU, FreeableMemory
- ELB: RequestCount, TargetResponseTime, HealthyHostCount
- Lambda: Invocations, Duration, Errors, Throttles

Detailed Monitoring:
- Metrics every 1 minute (has cost)
- More granularity for fast response
- Necessary for fast scaling with Auto Scaling

Custom Metrics:
- Send your own metrics using PutMetricData API
- Examples: RAM memory, used disk space, active processes
- Resolution: Standard (1 min) or High-Resolution (1 second)

🎯 Key Points

  • βœ“ Default metrics do NOT include RAM memory in EC2
  • βœ“ Custom metrics need CloudWatch Agent or API calls
  • βœ“ Detailed monitoring (1 min) has additional cost
  • βœ“ Metrics retained: 1s (3h), 1m (15d), 5m (63d), 1h (455d)
  • βœ“ High-resolution metrics allow alarms every 10s or 30s

πŸ’» Working with CloudWatch metrics

# Send custom metric
aws cloudwatch put-metric-data \n  --namespace "MyApp/Backend" \n  --metric-name "MemoryUsed" \n  --value 85 \n  --unit Percent \n  --dimensions Instance=i-0123456789abcdef0

# Get metric statistics
aws cloudwatch get-metric-statistics \n  --namespace AWS/EC2 \n  --metric-name CPUUtilization \n  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \n  --start-time 2024-01-01T00:00:00Z \n  --end-time 2024-01-01T23:59:59Z \n  --period 3600 \n  --statistics Average

CloudWatch Alarms

CloudWatch Alarms enable automatic actions based on metrics.

Alarm States:
- OK: Metric within defined threshold
- ALARM: Metric exceeded threshold
- INSUFFICIENT_DATA: Not enough data to evaluate

Alarm Components:
- Metric: Which metric to monitor
- Threshold: Limit value (e.g., CPU > 80%)
- Evaluation Periods: How many consecutive periods to evaluate
- Datapoints to Alarm: How many periods must be in ALARM
- Actions: What to do when state changes

Available Actions:
- SNS Notification: Send email, SMS, Lambda, etc.
- Auto Scaling Action: Scale ASG
- EC2 Action: Stop, Terminate, Reboot, Recover instance
- Systems Manager Action: Execute automation runbook

Composite Alarms:
- Combine multiple alarms with AND/OR
- Reduces alarm noise
- Example: High CPU AND High Memory = ALARM

🎯 Key Points

  • βœ“ Alarms act on state changes (OK β†’ ALARM β†’ OK)
  • βœ“ Evaluation periods and datapoints help avoid false positives
  • βœ“ Billing alarms only in us-east-1 (global metrics)
  • βœ“ Test alarms with set-alarm-state to verify actions
  • βœ“ Composite alarms reduce unnecessary notifications

CloudWatch Logs

CloudWatch Logs enables centralization of application, system, and AWS service logs.

Hierarchy:
- Log Groups: Log container (e.g., /aws/lambda/my-function)
- Log Streams: Event sequence from one source (e.g., specific instance)
- Log Events: Individual record with timestamp and message

Log Sources:
- SDK/Agent: Applications send logs with SDK or CloudWatch Agent
- Elastic Beanstalk: Automatic application logs
- ECS/Fargate: Container logs
- Lambda: Automatic (console.log, print, etc.)
- VPC Flow Logs: Network traffic
- API Gateway, CloudTrail, Route53

Retention:
- Default: indefinite (forever)
- Configurable: 1 day to 10 years
- Cost per GB stored and GB ingested

Export:
- S3: Batch export (up to 12 hours delay)
- Kinesis Data Firehose: Near real-time streaming
- Lambda subscriptions: Real-time processing

🎯 Key Points

  • βœ“ CloudWatch Logs Insights enables SQL-like queries
  • βœ“ Metric filters convert logs to CloudWatch metrics
  • βœ“ Subscription filters send logs to Kinesis/Lambda/Firehose
  • βœ“ Log groups can have KMS encryption
  • βœ“ CloudWatch Agent unifies custom metrics and logs

πŸ’» CloudWatch Logs management

# Create log group
aws logs create-log-group --log-group-name /my-app/production

# Configure retention (7 days)
aws logs put-retention-policy \n  --log-group-name /my-app/production \n  --retention-in-days 7

# Create metric filter
aws logs put-metric-filter \n  --log-group-name /my-app/production \n  --filter-name ErrorCount \n  --filter-pattern "[ERROR]" \n  --metric-transformations \n    metricName=AppErrors,metricNamespace=MyApp,metricValue=1

CloudWatch Logs Insights

CloudWatch Logs Insights is a fully integrated log analysis service that enables searching and analyzing log data.

Features:
- Query language: SQL-like for complex searches
- Visualizations: Time series and bar charts
- Saved queries: Reuse frequent queries
- Multiple log groups: Query across multiple groups simultaneously
- Pay per use: Per GB of data scanned

Common Commands:
- fields: Select fields to display
- filter: Filter logs by condition
- stats: Aggregations (count, avg, sum, min, max)
- sort: Sort results
- limit: Limit number of results

Query Examples:
- Search errors: filter @message like /ERROR/
- Top IPs: stats count() by sourceIP | sort count desc
- Average latency: stats avg(duration) by bin(5m)
- User logs: filter userId = "user123" | fields @timestamp, @message

🎯 Key Points

  • βœ“ Logs Insights charges per GB scanned, not time
  • βœ“ Queries can span up to 20 log groups
  • βœ“ Results limited to 10,000 rows
  • βœ“ Auto-detects JSON fields in logs
  • βœ“ Ideal for troubleshooting and ad-hoc analysis

Amazon EventBridge (CloudWatch Events)

EventBridge is a serverless event bus connecting application data with AWS services.

Key Concepts:
- Event: State change in system (e.g., EC2 instance terminated)
- Event Bus: Channel receiving events (default, custom, partner)
- Rule: Filters events and routes to targets
- Target: Event destination (Lambda, SNS, SQS, Step Functions)

Event Sources:
- AWS Services: EC2, Auto Scaling, CodePipeline, 90+ services
- Schedule: Cron expressions or rate expressions
- Custom Applications: Your app sends events
- SaaS Partners: Datadog, Auth0, Shopify, etc.

Event Patterns:
- Filters based on JSON event content
- Supports prefix matching, numeric matching, exists checks
- Example: {"source": ["aws.ec2"], "detail-type": ["EC2 Instance State-change"]}

Use Cases:
- React to infrastructure changes
- Scheduled tasks (cloud cron jobs)
- Microservices integration
- Receive and process SaaS events

🎯 Key Points

  • βœ“ EventBridge is evolution of CloudWatch Events (more features)
  • βœ“ Can filter events before invoking targets (cost savings)
  • βœ“ One rule can have up to 5 targets
  • βœ“ Schema Registry automatically discovers event structure
  • βœ“ Archive and Replay enable debugging and reprocessing

πŸ’» Create EventBridge rules

# Create rule to detect EC2 termination
aws events put-rule \n  --name DetectEC2Termination \n  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": {"state": ["terminated"]}
  }'

# Add Lambda as target
aws events put-targets \n  --rule DetectEC2Termination \n  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:NotifyTermination"

# Create scheduled rule (daily at 9am UTC)
aws events put-rule \n  --name DailyBackup \n  --schedule-expression "cron(0 9 * * ? *)"

CloudWatch Agent

CloudWatch Agent is a unified agent that collects OS and application metrics and logs.

Collected System Metrics:
- Memory: Utilization, available, used
- Disk: Used space, inodes
- CPU: Per core, states (user, system, idle)
- Processes: Counts, states
- Network: Connections, errors
- Swap: Utilization

Configuration:
- JSON file defines what to collect
- Interactive wizard or manual
- Stored in Systems Manager Parameter Store
- Can be deployed to multiple instances with Systems Manager

Two Versions:
- CloudWatch Logs Agent: Logs only (legacy)
- CloudWatch Agent: Metrics + Logs (recommended)

Required Permissions:
- IAM role with CloudWatchAgentServerPolicy
- Attached to EC2 instance or ECS task
- Don't use access keys (security best practice)

🎯 Key Points

  • βœ“ Agent needed for memory and disk metrics in EC2
  • βœ“ StatsD and collectd protocols supported
  • βœ“ Can aggregate metrics before sending (cost savings)
  • βœ“ Logs automatically parsed if JSON format
  • βœ“ Use Systems Manager for at-scale deployment