📈

Analytics - Redshift, Athena, QuickSight

Kinesis, Athena, QuickSight and analytics services

⏱️ Estimated reading time: 20 minutes

Amazon Redshift - Data Warehouse

Amazon Redshift is a fully managed, petabyte-scale data warehouse based on PostgreSQL, optimized for OLAP (analytics).

Features:
- Columnar storage: Optimized for analytical queries
- Massive compression: Reduces storage and I/O
- MPP (Massively Parallel Processing): Distributes queries across multiple nodes
- 10x better performance than traditional data warehouses
- Cost: $0.25/hour per TB/year (much cheaper than OLTP databases)

Architecture:
- Leader Node: Plans queries, coordinates execution
- Compute Nodes: Execute queries and store data
- Node Slices: Memory and disk partitions in compute node

Node Types:
- RA3 (recommended): Independent compute and storage, S3-backed
- DC2 (Dense Compute): SSD, better for performance
- DS2 (Dense Storage): HDD, more capacity, more economical

Redshift Serverless:
- No manual cluster provisioning
- Automatic auto-scaling
- Pay per RPU (Redshift Processing Units) used

🎯 Key Points

✓ Redshift is OLAP (analytics), NOT for OLTP (transactional)
✓ Bulk data loading more efficient than individual inserts
✓ Snapshots are incremental, can be copied to other regions
✓ Enhanced VPC Routing forces traffic through VPC (more secure)
✓ Spectrum enables S3 queries without loading to Redshift

💻 Redshift operations

-- Create table with KEY distribution
CREATE TABLE sales (
  sale_id INT,
  product_id INT,
  date DATE,
  quantity INT,
  price DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (date);

-- Load data from S3
COPY sales
FROM 's3://my-bucket/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET;

-- Query with Redshift Spectrum (data in S3)
SELECT product_id, SUM(quantity)
FROM spectrum.external_sales
WHERE date >= '2024-01-01'
GROUP BY product_id;

Redshift Performance Optimization

Redshift offers multiple techniques to optimize queries and reduce costs.

Distribution Styles:
- AUTO: Redshift decides automatically
- EVEN: Round-robin distribution (default)
- KEY: Distribute rows by column value (co-location)
- ALL: Copy full table to all nodes (small tables)

Sort Keys:
- Compound: Multi-column order (e.g., date, region)
- Interleaved: Optimizes filters on any key column
- Accelerates queries with WHERE, JOIN, ORDER BY on sorted columns

Compression:
- Automatic encoding during COPY
- Types: AZ64, LZO, Runlength, Delta, Mostly
- Analyze with ANALYZE COMPRESSION

Best Practices:
- Use COPY instead of INSERT for bulk loading
- Run VACUUM to reorganize data
- ANALYZE updates statistics for query planner
- Workload Management (WLM) to prioritize queries
- Materialized views for frequent queries

🎯 Key Points

✓ DISTKEY should be column frequently used in JOINs
✓ SORTKEY should be column in WHERE and ORDER BY
✓ VACUUM reclaims space from deleted/updated rows
✓ Deep Copy (CREATE TABLE AS) can be faster than VACUUM
✓ Concurrency Scaling auto-scales for query peaks

Amazon Athena - Serverless Analysis

Amazon Athena is a serverless interactive query service to analyze S3 data using standard SQL.

Features:
- Serverless: No infrastructure to provision
- Standard SQL: Based on Presto, supports ANSI SQL
- Pay per query: $5 per TB of data scanned
- S3 Integration: Reads data directly from S3
- Multiple formats: CSV, JSON, Parquet, ORC, Avro

Use Cases:
- Ad-hoc log analysis (VPC Flow Logs, CloudTrail, ALB logs)
- Business intelligence with QuickSight
- Data lake analysis in S3
- Queries on data without prior ETL

Cost Optimization:
- Use columnar formats: Parquet/ORC (80-90% less scan)
- Compress data: GZIP, SNAPPY reduces scanned bytes
- Partition data: By date, region, etc. (scans only needed partitions)
- Use glue crawler to automatically catalog data

Federated Query:
- Query data in relational, NoSQL, custom data sources
- Uses Lambda Data Source Connectors
- Join S3 data with RDS, DynamoDB, on-premises databases

🎯 Key Points

✓ Athena charges for data SCANNED, not data returned
✓ Parquet/ORC dramatically reduces costs vs CSV/JSON
✓ Partitioning is key for efficient queries
✓ Glue Data Catalog stores table metadata
✓ CTAS (Create Table As Select) creates optimized new tables

💻 Athena queries with partitions

-- Create external table pointing to S3
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
  timestamp string,
  elb_name string,
  client_ip string,
  target_ip string,
  request_processing_time double,
  target_status_code int
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)')
LOCATION 's3://my-bucket/alb-logs/';

-- Add partitions
ALTER TABLE alb_logs ADD
  PARTITION (year='2024', month='01', day='15')
  LOCATION 's3://my-bucket/alb-logs/2024/01/15/';

-- Optimized query with partitions
SELECT client_ip, COUNT(*) as requests
FROM alb_logs
WHERE year='2024' AND month='01' AND target_status_code >= 500
GROUP BY client_ip
ORDER BY requests DESC
LIMIT 10;

Amazon QuickSight - Business Intelligence

Amazon QuickSight is a serverless business intelligence service to create interactive dashboards and visualizations.

Features:
- Serverless: Automatic scaling, no infrastructure
- Machine Learning: Anomaly detection, forecasting
- Embeddable: Integrate dashboards in applications
- In-memory engine (SPICE): Fast cached queries
- Collaborative: Share analysis and dashboards

Data Sources:
- AWS: RDS, Aurora, Redshift, Athena, S3, OpenSearch
- SaaS: Salesforce, Jira, ServiceNow
- Databases: MySQL, PostgreSQL, SQL Server, Oracle (on-prem)
- Files: Excel, CSV, JSON

SPICE (Super-fast, Parallel, In-memory Calculation Engine):
- Imports data to in-memory engine
- Ultra-fast queries without touching source
- 10GB per user included
- Programmable automatic refresh

ML Insights:
- Anomaly detection: Automatically detects outliers
- Forecasting: Predictions based on historical data
- Auto-narratives: Generates textual data descriptions

🎯 Key Points

✓ QuickSight is serverless alternative to Tableau/Power BI
✓ Pricing per user/session, not infrastructure
✓ SPICE accelerates queries but has capacity limit
✓ Row-Level Security (RLS) controls what data each user sees
✓ Can connect to VPC with ENI for private data

Redshift vs Athena - When to use each

Both services analyze data, but have different use cases:

Use Redshift when:
- Complex and frequent queries on same data
- Need complex JOINs and sub-queries
- Predictable workloads with constant use
- Need consistent performance and low latency
- Structured and well-defined data
- Traditional OLAP workloads

Use Athena when:
- Ad-hoc and intermittent analysis
- Data already in S3 (logs, exports)
- Don't want to manage infrastructure
- Simple to medium queries
- Data lake explorations
- Variable costs and occasional queries

Comparison:
- Cost: Athena pay-per-query, Redshift pay-per-hour (cluster)
- Setup: Athena zero setup, Redshift requires provisioning
- Performance: Redshift faster for repetitive queries
- Scaling: Athena infinite auto-scale, Redshift manual resize
- Use cases: Redshift = traditional DW, Athena = flexible analysis

🎯 Key Points

✓ Athena better for exploration and occasional analysis
✓ Redshift better for predictable and constant analytics
✓ Can use together: Athena for S3, Redshift for DW
✓ Redshift Spectrum enables S3 queries from Redshift
✓ QuickSight can connect to both

← Back to AWS-SAA-C03