10 hacks to get more out of Amazon Redshift
Whether you’re using Amazon Redshift to store data, move data, speed up your database environment, or run big data analytics, there are ten things you need to know to get the most out of it. Are you missing out on any of these secrets?
#1: Precompute results to materialize views
For repeated or predictable queries, materialised views can give you a big performance boost. Applications can quickly query the data in that view instead of crunching through a big table. If the data in the table changes, just run the SQL statement “refresh materialized view“. You can even go for an incremental refresh to avoid recomputing the whole table.
#2: Use elastic resize and concurrency scaling to handle bursts of workload
Concurrency scaling and elastic resize let you resize clusters on the fly to cope with sudden rises and falls in demand.
With elastic resize, you can double or halve the number of compute nodes in a cluster or change the node type. It doesn’t require you to restart the cluster, and only takes a few minutes. If you enable concurrency scaling, your cluster can resize itself in response to the incoming workload.
#3: Use the Amazon Redshift Advisor to lighten the admin load
Amazon Redshift Advisor will run tests on your cluster and give you recommendations to help you boost its performance and save on operating costs. For instance, it can recommend the best distribution key or sort key for your tables, show you where you can save space with compression encodings, and spot where table data is out of date or missing.
#4: Increase throughput with Auto WLM
Auto WLM (workload management) uses machine learning to maximise throughput. You can set query priorities to make sure the most important work gets priority, and set query monitoring rules that let you change priorities dynamically. You can also use short query acceleration to let small jobs jump the queue, and concurrency scaling to bring in extra clusters when needed.
#5: Make use of data lake integration
Amazon Redshift is integrated with other Amazon web services like S3. You can query data from files on Amazon S3, bring in extra processing power from the Amazon Redshift Spectrum compute layer, or export data to the data lake by writing to external tables or using the UNLOAD command.
#6: Make ETL operations a picnic with temporary tables
Amazon Redshift offers temporary tables designed to last for one SQL session. If you use these correctly, they can give you a significant performance boost on some ETL operations. The trick is to avoid using SELECT…INTO to create them, and instead use CREATE TABLE, which allows you to set the column encoding and distribution and sort keys. Temporary tables without column encoding can cause distribution errors.
#8: Use federated queries to connect OLAP and OLTP with the data lake
Skip the ETL operations and table-making altogether with federated queries, which let you run analytics on data in the data lake or on your OLAP/OLTP databases. Or use federated queries to simplify ETL and data ingestion: instead of moving data via Amazon S3 using COPY, you can just ingest it straight into a table.
#8: Use COPY to maintain efficient data loads
What the COPY command is useful for is performing data loads of file-based data as quickly as possible using the minimum of resources. It loads data in parallel, using every compute node in your cluster, from any SSH connection or from sources like Amazon DynamoDB, Amazon S3, and Amazon EMR HDFS file systems. Compress your data files down to about 1MB-1GB a pop for best results.
#9: Get extra performance insights through Amazon CloudWatch and QMR
As well as the Amazon Redshift Advisor, check out CloudWatch metrics, which are data points you can use with Amazon CloudWatch monitoring. Between these and QMR (query monitoring rules), you shouldn’t need to write your own metrics. Even if you haven’t set query monitoring rules, Redshift automatically collects QMR data. You can query it with commands like:
query_cpu_time (CPU time for an SQL statement)
query_temp_blocks_to_disk (temporary disk space needed for a job)
spectrum_scan_row_count (number of Spectrum rows scanned by a query)
#10: Use the latest Amazon Redshift drivers from AWS
If you haven’t switched to the new Amazon Redshift-specific JDBC and ODBC drivers yet, it’s time to do that. If you want to change parameters (though you’ll rarely need to), you’ll need these drivers.
For JDBC, to avoid out-of-memory errors on large data sets, use a LIMIT or OFFSET clause to restrict results. And for ODBC, use the CURSOR command.