π
Analytics - Redshift, Athena, QuickSight
Kinesis, Athena, QuickSight and analytics services
β±οΈ Estimated reading time: 20 minutes
Amazon Redshift - Data Warehouse
Amazon Redshift is a fully managed, petabyte-scale data warehouse based on PostgreSQL, optimized for OLAP (analytics).
Features:
- Columnar storage: Optimized for analytical queries
- Massive compression: Reduces storage and I/O
- MPP (Massively Parallel Processing): Distributes queries across multiple nodes
- 10x better performance than traditional data warehouses
- Cost: $0.25/hour per TB/year (much cheaper than OLTP databases)
Architecture:
- Leader Node: Plans queries, coordinates execution
- Compute Nodes: Execute queries and store data
- Node Slices: Memory and disk partitions in compute node
Node Types:
- RA3 (recommended): Independent compute and storage, S3-backed
- DC2 (Dense Compute): SSD, better for performance
- DS2 (Dense Storage): HDD, more capacity, more economical
Redshift Serverless:
- No manual cluster provisioning
- Automatic auto-scaling
- Pay per RPU (Redshift Processing Units) used
Features:
- Columnar storage: Optimized for analytical queries
- Massive compression: Reduces storage and I/O
- MPP (Massively Parallel Processing): Distributes queries across multiple nodes
- 10x better performance than traditional data warehouses
- Cost: $0.25/hour per TB/year (much cheaper than OLTP databases)
Architecture:
- Leader Node: Plans queries, coordinates execution
- Compute Nodes: Execute queries and store data
- Node Slices: Memory and disk partitions in compute node
Node Types:
- RA3 (recommended): Independent compute and storage, S3-backed
- DC2 (Dense Compute): SSD, better for performance
- DS2 (Dense Storage): HDD, more capacity, more economical
Redshift Serverless:
- No manual cluster provisioning
- Automatic auto-scaling
- Pay per RPU (Redshift Processing Units) used
π― Key Points
- β Redshift is OLAP (analytics), NOT for OLTP (transactional)
- β Bulk data loading more efficient than individual inserts
- β Snapshots are incremental, can be copied to other regions
- β Enhanced VPC Routing forces traffic through VPC (more secure)
- β Spectrum enables S3 queries without loading to Redshift
π» Redshift operations
-- Create table with KEY distribution
CREATE TABLE sales (
sale_id INT,
product_id INT,
date DATE,
quantity INT,
price DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (date);
-- Load data from S3
COPY sales
FROM 's3://my-bucket/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET;
-- Query with Redshift Spectrum (data in S3)
SELECT product_id, SUM(quantity)
FROM spectrum.external_sales
WHERE date >= '2024-01-01'
GROUP BY product_id; Redshift Performance Optimization
Redshift offers multiple techniques to optimize queries and reduce costs.
Distribution Styles:
- AUTO: Redshift decides automatically
- EVEN: Round-robin distribution (default)
- KEY: Distribute rows by column value (co-location)
- ALL: Copy full table to all nodes (small tables)
Sort Keys:
- Compound: Multi-column order (e.g., date, region)
- Interleaved: Optimizes filters on any key column
- Accelerates queries with WHERE, JOIN, ORDER BY on sorted columns
Compression:
- Automatic encoding during COPY
- Types: AZ64, LZO, Runlength, Delta, Mostly
- Analyze with ANALYZE COMPRESSION
Best Practices:
- Use COPY instead of INSERT for bulk loading
- Run VACUUM to reorganize data
- ANALYZE updates statistics for query planner
- Workload Management (WLM) to prioritize queries
- Materialized views for frequent queries
Distribution Styles:
- AUTO: Redshift decides automatically
- EVEN: Round-robin distribution (default)
- KEY: Distribute rows by column value (co-location)
- ALL: Copy full table to all nodes (small tables)
Sort Keys:
- Compound: Multi-column order (e.g., date, region)
- Interleaved: Optimizes filters on any key column
- Accelerates queries with WHERE, JOIN, ORDER BY on sorted columns
Compression:
- Automatic encoding during COPY
- Types: AZ64, LZO, Runlength, Delta, Mostly
- Analyze with ANALYZE COMPRESSION
Best Practices:
- Use COPY instead of INSERT for bulk loading
- Run VACUUM to reorganize data
- ANALYZE updates statistics for query planner
- Workload Management (WLM) to prioritize queries
- Materialized views for frequent queries
π― Key Points
- β DISTKEY should be column frequently used in JOINs
- β SORTKEY should be column in WHERE and ORDER BY
- β VACUUM reclaims space from deleted/updated rows
- β Deep Copy (CREATE TABLE AS) can be faster than VACUUM
- β Concurrency Scaling auto-scales for query peaks
Amazon Athena - Serverless Analysis
Amazon Athena is a serverless interactive query service to analyze S3 data using standard SQL.
Features:
- Serverless: No infrastructure to provision
- Standard SQL: Based on Presto, supports ANSI SQL
- Pay per query: $5 per TB of data scanned
- S3 Integration: Reads data directly from S3
- Multiple formats: CSV, JSON, Parquet, ORC, Avro
Use Cases:
- Ad-hoc log analysis (VPC Flow Logs, CloudTrail, ALB logs)
- Business intelligence with QuickSight
- Data lake analysis in S3
- Queries on data without prior ETL
Cost Optimization:
- Use columnar formats: Parquet/ORC (80-90% less scan)
- Compress data: GZIP, SNAPPY reduces scanned bytes
- Partition data: By date, region, etc. (scans only needed partitions)
- Use glue crawler to automatically catalog data
Federated Query:
- Query data in relational, NoSQL, custom data sources
- Uses Lambda Data Source Connectors
- Join S3 data with RDS, DynamoDB, on-premises databases
Features:
- Serverless: No infrastructure to provision
- Standard SQL: Based on Presto, supports ANSI SQL
- Pay per query: $5 per TB of data scanned
- S3 Integration: Reads data directly from S3
- Multiple formats: CSV, JSON, Parquet, ORC, Avro
Use Cases:
- Ad-hoc log analysis (VPC Flow Logs, CloudTrail, ALB logs)
- Business intelligence with QuickSight
- Data lake analysis in S3
- Queries on data without prior ETL
Cost Optimization:
- Use columnar formats: Parquet/ORC (80-90% less scan)
- Compress data: GZIP, SNAPPY reduces scanned bytes
- Partition data: By date, region, etc. (scans only needed partitions)
- Use glue crawler to automatically catalog data
Federated Query:
- Query data in relational, NoSQL, custom data sources
- Uses Lambda Data Source Connectors
- Join S3 data with RDS, DynamoDB, on-premises databases
π― Key Points
- β Athena charges for data SCANNED, not data returned
- β Parquet/ORC dramatically reduces costs vs CSV/JSON
- β Partitioning is key for efficient queries
- β Glue Data Catalog stores table metadata
- β CTAS (Create Table As Select) creates optimized new tables
π» Athena queries with partitions
-- Create external table pointing to S3
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
timestamp string,
elb_name string,
client_ip string,
target_ip string,
request_processing_time double,
target_status_code int
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)')
LOCATION 's3://my-bucket/alb-logs/';
-- Add partitions
ALTER TABLE alb_logs ADD
PARTITION (year='2024', month='01', day='15')
LOCATION 's3://my-bucket/alb-logs/2024/01/15/';
-- Optimized query with partitions
SELECT client_ip, COUNT(*) as requests
FROM alb_logs
WHERE year='2024' AND month='01' AND target_status_code >= 500
GROUP BY client_ip
ORDER BY requests DESC
LIMIT 10; Amazon QuickSight - Business Intelligence
Amazon QuickSight is a serverless business intelligence service to create interactive dashboards and visualizations.
Features:
- Serverless: Automatic scaling, no infrastructure
- Machine Learning: Anomaly detection, forecasting
- Embeddable: Integrate dashboards in applications
- In-memory engine (SPICE): Fast cached queries
- Collaborative: Share analysis and dashboards
Data Sources:
- AWS: RDS, Aurora, Redshift, Athena, S3, OpenSearch
- SaaS: Salesforce, Jira, ServiceNow
- Databases: MySQL, PostgreSQL, SQL Server, Oracle (on-prem)
- Files: Excel, CSV, JSON
SPICE (Super-fast, Parallel, In-memory Calculation Engine):
- Imports data to in-memory engine
- Ultra-fast queries without touching source
- 10GB per user included
- Programmable automatic refresh
ML Insights:
- Anomaly detection: Automatically detects outliers
- Forecasting: Predictions based on historical data
- Auto-narratives: Generates textual data descriptions
Features:
- Serverless: Automatic scaling, no infrastructure
- Machine Learning: Anomaly detection, forecasting
- Embeddable: Integrate dashboards in applications
- In-memory engine (SPICE): Fast cached queries
- Collaborative: Share analysis and dashboards
Data Sources:
- AWS: RDS, Aurora, Redshift, Athena, S3, OpenSearch
- SaaS: Salesforce, Jira, ServiceNow
- Databases: MySQL, PostgreSQL, SQL Server, Oracle (on-prem)
- Files: Excel, CSV, JSON
SPICE (Super-fast, Parallel, In-memory Calculation Engine):
- Imports data to in-memory engine
- Ultra-fast queries without touching source
- 10GB per user included
- Programmable automatic refresh
ML Insights:
- Anomaly detection: Automatically detects outliers
- Forecasting: Predictions based on historical data
- Auto-narratives: Generates textual data descriptions
π― Key Points
- β QuickSight is serverless alternative to Tableau/Power BI
- β Pricing per user/session, not infrastructure
- β SPICE accelerates queries but has capacity limit
- β Row-Level Security (RLS) controls what data each user sees
- β Can connect to VPC with ENI for private data
Redshift vs Athena - When to use each
Both services analyze data, but have different use cases:
Use Redshift when:
- Complex and frequent queries on same data
- Need complex JOINs and sub-queries
- Predictable workloads with constant use
- Need consistent performance and low latency
- Structured and well-defined data
- Traditional OLAP workloads
Use Athena when:
- Ad-hoc and intermittent analysis
- Data already in S3 (logs, exports)
- Don't want to manage infrastructure
- Simple to medium queries
- Data lake explorations
- Variable costs and occasional queries
Comparison:
- Cost: Athena pay-per-query, Redshift pay-per-hour (cluster)
- Setup: Athena zero setup, Redshift requires provisioning
- Performance: Redshift faster for repetitive queries
- Scaling: Athena infinite auto-scale, Redshift manual resize
- Use cases: Redshift = traditional DW, Athena = flexible analysis
Use Redshift when:
- Complex and frequent queries on same data
- Need complex JOINs and sub-queries
- Predictable workloads with constant use
- Need consistent performance and low latency
- Structured and well-defined data
- Traditional OLAP workloads
Use Athena when:
- Ad-hoc and intermittent analysis
- Data already in S3 (logs, exports)
- Don't want to manage infrastructure
- Simple to medium queries
- Data lake explorations
- Variable costs and occasional queries
Comparison:
- Cost: Athena pay-per-query, Redshift pay-per-hour (cluster)
- Setup: Athena zero setup, Redshift requires provisioning
- Performance: Redshift faster for repetitive queries
- Scaling: Athena infinite auto-scale, Redshift manual resize
- Use cases: Redshift = traditional DW, Athena = flexible analysis
π― Key Points
- β Athena better for exploration and occasional analysis
- β Redshift better for predictable and constant analytics
- β Can use together: Athena for S3, Redshift for DW
- β Redshift Spectrum enables S3 queries from Redshift
- β QuickSight can connect to both