π
CloudWatch - Monitoring and Observability
Metrics, logs, alarms and dashboards
β±οΈ Estimated reading time: 20 minutes
CloudWatch Metrics
CloudWatch Metrics provides metrics to monitor AWS resources and applications.
Key Concepts:
- Metric: Variable to monitor (CPUUtilization, NetworkIn, etc.)
- Namespace: Container for metrics (AWS/EC2, AWS/RDS, Custom)
- Dimension: Metric attribute (InstanceId, DBInstanceIdentifier)
- Timestamp: Metric moment
- Statistic: Aggregation (Average, Sum, Min, Max, SampleCount)
Default Metrics (Basic Monitoring):
- EC2: CPU, Disk, Network (every 5 minutes, free)
- RDS: DatabaseConnections, CPU, FreeableMemory
- ELB: RequestCount, TargetResponseTime, HealthyHostCount
- Lambda: Invocations, Duration, Errors, Throttles
Detailed Monitoring:
- Metrics every 1 minute (has cost)
- More granularity for fast response
- Necessary for fast scaling with Auto Scaling
Custom Metrics:
- Send your own metrics using PutMetricData API
- Examples: RAM memory, used disk space, active processes
- Resolution: Standard (1 min) or High-Resolution (1 second)
Key Concepts:
- Metric: Variable to monitor (CPUUtilization, NetworkIn, etc.)
- Namespace: Container for metrics (AWS/EC2, AWS/RDS, Custom)
- Dimension: Metric attribute (InstanceId, DBInstanceIdentifier)
- Timestamp: Metric moment
- Statistic: Aggregation (Average, Sum, Min, Max, SampleCount)
Default Metrics (Basic Monitoring):
- EC2: CPU, Disk, Network (every 5 minutes, free)
- RDS: DatabaseConnections, CPU, FreeableMemory
- ELB: RequestCount, TargetResponseTime, HealthyHostCount
- Lambda: Invocations, Duration, Errors, Throttles
Detailed Monitoring:
- Metrics every 1 minute (has cost)
- More granularity for fast response
- Necessary for fast scaling with Auto Scaling
Custom Metrics:
- Send your own metrics using PutMetricData API
- Examples: RAM memory, used disk space, active processes
- Resolution: Standard (1 min) or High-Resolution (1 second)
π― Key Points
- β Default metrics do NOT include RAM memory in EC2
- β Custom metrics need CloudWatch Agent or API calls
- β Detailed monitoring (1 min) has additional cost
- β Metrics retained: 1s (3h), 1m (15d), 5m (63d), 1h (455d)
- β High-resolution metrics allow alarms every 10s or 30s
π» Working with CloudWatch metrics
# Send custom metric
aws cloudwatch put-metric-data \n --namespace "MyApp/Backend" \n --metric-name "MemoryUsed" \n --value 85 \n --unit Percent \n --dimensions Instance=i-0123456789abcdef0
# Get metric statistics
aws cloudwatch get-metric-statistics \n --namespace AWS/EC2 \n --metric-name CPUUtilization \n --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \n --start-time 2024-01-01T00:00:00Z \n --end-time 2024-01-01T23:59:59Z \n --period 3600 \n --statistics Average CloudWatch Alarms
CloudWatch Alarms enable automatic actions based on metrics.
Alarm States:
- OK: Metric within defined threshold
- ALARM: Metric exceeded threshold
- INSUFFICIENT_DATA: Not enough data to evaluate
Alarm Components:
- Metric: Which metric to monitor
- Threshold: Limit value (e.g., CPU > 80%)
- Evaluation Periods: How many consecutive periods to evaluate
- Datapoints to Alarm: How many periods must be in ALARM
- Actions: What to do when state changes
Available Actions:
- SNS Notification: Send email, SMS, Lambda, etc.
- Auto Scaling Action: Scale ASG
- EC2 Action: Stop, Terminate, Reboot, Recover instance
- Systems Manager Action: Execute automation runbook
Composite Alarms:
- Combine multiple alarms with AND/OR
- Reduces alarm noise
- Example: High CPU AND High Memory = ALARM
Alarm States:
- OK: Metric within defined threshold
- ALARM: Metric exceeded threshold
- INSUFFICIENT_DATA: Not enough data to evaluate
Alarm Components:
- Metric: Which metric to monitor
- Threshold: Limit value (e.g., CPU > 80%)
- Evaluation Periods: How many consecutive periods to evaluate
- Datapoints to Alarm: How many periods must be in ALARM
- Actions: What to do when state changes
Available Actions:
- SNS Notification: Send email, SMS, Lambda, etc.
- Auto Scaling Action: Scale ASG
- EC2 Action: Stop, Terminate, Reboot, Recover instance
- Systems Manager Action: Execute automation runbook
Composite Alarms:
- Combine multiple alarms with AND/OR
- Reduces alarm noise
- Example: High CPU AND High Memory = ALARM
π― Key Points
- β Alarms act on state changes (OK β ALARM β OK)
- β Evaluation periods and datapoints help avoid false positives
- β Billing alarms only in us-east-1 (global metrics)
- β Test alarms with set-alarm-state to verify actions
- β Composite alarms reduce unnecessary notifications
CloudWatch Logs
CloudWatch Logs enables centralization of application, system, and AWS service logs.
Hierarchy:
- Log Groups: Log container (e.g., /aws/lambda/my-function)
- Log Streams: Event sequence from one source (e.g., specific instance)
- Log Events: Individual record with timestamp and message
Log Sources:
- SDK/Agent: Applications send logs with SDK or CloudWatch Agent
- Elastic Beanstalk: Automatic application logs
- ECS/Fargate: Container logs
- Lambda: Automatic (console.log, print, etc.)
- VPC Flow Logs: Network traffic
- API Gateway, CloudTrail, Route53
Retention:
- Default: indefinite (forever)
- Configurable: 1 day to 10 years
- Cost per GB stored and GB ingested
Export:
- S3: Batch export (up to 12 hours delay)
- Kinesis Data Firehose: Near real-time streaming
- Lambda subscriptions: Real-time processing
Hierarchy:
- Log Groups: Log container (e.g., /aws/lambda/my-function)
- Log Streams: Event sequence from one source (e.g., specific instance)
- Log Events: Individual record with timestamp and message
Log Sources:
- SDK/Agent: Applications send logs with SDK or CloudWatch Agent
- Elastic Beanstalk: Automatic application logs
- ECS/Fargate: Container logs
- Lambda: Automatic (console.log, print, etc.)
- VPC Flow Logs: Network traffic
- API Gateway, CloudTrail, Route53
Retention:
- Default: indefinite (forever)
- Configurable: 1 day to 10 years
- Cost per GB stored and GB ingested
Export:
- S3: Batch export (up to 12 hours delay)
- Kinesis Data Firehose: Near real-time streaming
- Lambda subscriptions: Real-time processing
π― Key Points
- β CloudWatch Logs Insights enables SQL-like queries
- β Metric filters convert logs to CloudWatch metrics
- β Subscription filters send logs to Kinesis/Lambda/Firehose
- β Log groups can have KMS encryption
- β CloudWatch Agent unifies custom metrics and logs
π» CloudWatch Logs management
# Create log group
aws logs create-log-group --log-group-name /my-app/production
# Configure retention (7 days)
aws logs put-retention-policy \n --log-group-name /my-app/production \n --retention-in-days 7
# Create metric filter
aws logs put-metric-filter \n --log-group-name /my-app/production \n --filter-name ErrorCount \n --filter-pattern "[ERROR]" \n --metric-transformations \n metricName=AppErrors,metricNamespace=MyApp,metricValue=1 CloudWatch Logs Insights
CloudWatch Logs Insights is a fully integrated log analysis service that enables searching and analyzing log data.
Features:
- Query language: SQL-like for complex searches
- Visualizations: Time series and bar charts
- Saved queries: Reuse frequent queries
- Multiple log groups: Query across multiple groups simultaneously
- Pay per use: Per GB of data scanned
Common Commands:
- fields: Select fields to display
- filter: Filter logs by condition
- stats: Aggregations (count, avg, sum, min, max)
- sort: Sort results
- limit: Limit number of results
Query Examples:
- Search errors:
- Top IPs:
- Average latency:
- User logs:
Features:
- Query language: SQL-like for complex searches
- Visualizations: Time series and bar charts
- Saved queries: Reuse frequent queries
- Multiple log groups: Query across multiple groups simultaneously
- Pay per use: Per GB of data scanned
Common Commands:
- fields: Select fields to display
- filter: Filter logs by condition
- stats: Aggregations (count, avg, sum, min, max)
- sort: Sort results
- limit: Limit number of results
Query Examples:
- Search errors:
filter @message like /ERROR/- Top IPs:
stats count() by sourceIP | sort count desc- Average latency:
stats avg(duration) by bin(5m)- User logs:
filter userId = "user123" | fields @timestamp, @messageπ― Key Points
- β Logs Insights charges per GB scanned, not time
- β Queries can span up to 20 log groups
- β Results limited to 10,000 rows
- β Auto-detects JSON fields in logs
- β Ideal for troubleshooting and ad-hoc analysis
Amazon EventBridge (CloudWatch Events)
EventBridge is a serverless event bus connecting application data with AWS services.
Key Concepts:
- Event: State change in system (e.g., EC2 instance terminated)
- Event Bus: Channel receiving events (default, custom, partner)
- Rule: Filters events and routes to targets
- Target: Event destination (Lambda, SNS, SQS, Step Functions)
Event Sources:
- AWS Services: EC2, Auto Scaling, CodePipeline, 90+ services
- Schedule: Cron expressions or rate expressions
- Custom Applications: Your app sends events
- SaaS Partners: Datadog, Auth0, Shopify, etc.
Event Patterns:
- Filters based on JSON event content
- Supports prefix matching, numeric matching, exists checks
- Example:
Use Cases:
- React to infrastructure changes
- Scheduled tasks (cloud cron jobs)
- Microservices integration
- Receive and process SaaS events
Key Concepts:
- Event: State change in system (e.g., EC2 instance terminated)
- Event Bus: Channel receiving events (default, custom, partner)
- Rule: Filters events and routes to targets
- Target: Event destination (Lambda, SNS, SQS, Step Functions)
Event Sources:
- AWS Services: EC2, Auto Scaling, CodePipeline, 90+ services
- Schedule: Cron expressions or rate expressions
- Custom Applications: Your app sends events
- SaaS Partners: Datadog, Auth0, Shopify, etc.
Event Patterns:
- Filters based on JSON event content
- Supports prefix matching, numeric matching, exists checks
- Example:
{"source": ["aws.ec2"], "detail-type": ["EC2 Instance State-change"]}Use Cases:
- React to infrastructure changes
- Scheduled tasks (cloud cron jobs)
- Microservices integration
- Receive and process SaaS events
π― Key Points
- β EventBridge is evolution of CloudWatch Events (more features)
- β Can filter events before invoking targets (cost savings)
- β One rule can have up to 5 targets
- β Schema Registry automatically discovers event structure
- β Archive and Replay enable debugging and reprocessing
π» Create EventBridge rules
# Create rule to detect EC2 termination
aws events put-rule \n --name DetectEC2Termination \n --event-pattern '{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {"state": ["terminated"]}
}'
# Add Lambda as target
aws events put-targets \n --rule DetectEC2Termination \n --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:NotifyTermination"
# Create scheduled rule (daily at 9am UTC)
aws events put-rule \n --name DailyBackup \n --schedule-expression "cron(0 9 * * ? *)" CloudWatch Agent
CloudWatch Agent is a unified agent that collects OS and application metrics and logs.
Collected System Metrics:
- Memory: Utilization, available, used
- Disk: Used space, inodes
- CPU: Per core, states (user, system, idle)
- Processes: Counts, states
- Network: Connections, errors
- Swap: Utilization
Configuration:
- JSON file defines what to collect
- Interactive wizard or manual
- Stored in Systems Manager Parameter Store
- Can be deployed to multiple instances with Systems Manager
Two Versions:
- CloudWatch Logs Agent: Logs only (legacy)
- CloudWatch Agent: Metrics + Logs (recommended)
Required Permissions:
- IAM role with CloudWatchAgentServerPolicy
- Attached to EC2 instance or ECS task
- Don't use access keys (security best practice)
Collected System Metrics:
- Memory: Utilization, available, used
- Disk: Used space, inodes
- CPU: Per core, states (user, system, idle)
- Processes: Counts, states
- Network: Connections, errors
- Swap: Utilization
Configuration:
- JSON file defines what to collect
- Interactive wizard or manual
- Stored in Systems Manager Parameter Store
- Can be deployed to multiple instances with Systems Manager
Two Versions:
- CloudWatch Logs Agent: Logs only (legacy)
- CloudWatch Agent: Metrics + Logs (recommended)
Required Permissions:
- IAM role with CloudWatchAgentServerPolicy
- Attached to EC2 instance or ECS task
- Don't use access keys (security best practice)
π― Key Points
- β Agent needed for memory and disk metrics in EC2
- β StatsD and collectd protocols supported
- β Can aggregate metrics before sending (cost savings)
- β Logs automatically parsed if JSON format
- β Use Systems Manager for at-scale deployment