How AQUA For Amazon Redshift Performs Its Queries Up To 10X Faster Than Before

By Patrick Moorhead - May 26, 2021

I love to research and write about semiconductors and chips! Many industry pundits were crowning software king about a decade ago, with semiconductors reduced commodity serfdom. From what I hope you are seeing right now is that when you combine the two and then add cloud, the real magic happens.  

Today that is true, as innovative companies like Amazon leverage silicon, software, and the cloud, to differentiate. Last year I wrote an article on how Amazon was decommoditizing compute with EC2. Today, I have another exciting story on how Amazon achieved performance at scale for analytics by pushing computation into storage using innovative silicon.


The genesis of AQUA (Advanced Query Accelerator)

Today there are two ways to achieve performance at scale for analytics. A shared-nothing, local storage Massively Parallel Processing (MPP) model can give you good performance. Still, the tradeoff is rigidity in scaling as you add nodes and ratios of fixed amounts of disk and memory. 

Another approach is shared storage over the network, which gives you the flexibility of storage and compute decoupling. But the disadvantage is the penalty on the network transfer of data. As data warehouses continue to grow, the network bandwidth needed in the shared storage model becomes a bottleneck impacting query performance.

Another key trend is the massive increase in disk bandwidth from NVMe (nonvolatile memory express) drives. NVMe is storage access and transport protocol for flash and next-generation solid-state drives (SSDs). It supports thousands of parallel command queues, much faster than the single command queue used by hard disks and traditional all-flash architectures. The performance is such that it is outstripping the ability of what CPUs can consume.

Analytic engines typically perform "needle in haystack” type queries (“find things that match x”) or summarization queries ("summarize a trillion raw events over the past year"). The resulting computation returns only a small amount of data.

The genesis of AQUA came from moving computation closer to storage to take advantage of the bandwidth resulting in less data that the CPU can handle without slowing down.

AQUA is all custom silicon 

AQUA is a hardware-accelerated cache with a multi-tenant distributed architecture. AQUA will scale out with push-down processing. Push-down processing leaves most of the data in place by extracting a small set of results and then if needed, manipulating it.

AQUA is available with Redshift RA3 compute nodes, and both existing and new Redshift RA3 clusters can take advantage of AQUA. For those unfamiliar with AWS terminology: Redshift is Amazon’s analytical data warehouse that can handle petabyte-scale data. The RA3 node is the latest generation compute node targeted at analytics workloads. RA3 nodes work with cached data in the cluster and “cooler” data on S3 object storage.

AQUA requires no code changes and no additional costs. When turned on, the Redshift query optimizer is smart enough to figure out if AQUA can help, and if so, it will push the query down into the AQUA service layer.

Within an AQUA node, there are multiple hardware modules. Custom silicon for compression and encryption, along with custom FPGA’s (Field-programmable gate array) for analytics operations like scanning, filtering, and aggregation.

Stream processors are implemented in FPGA's, operating right up against the disk controllers, between the discs and the CPUs for query acceleration. Nitro hardwareaccelerates compression and encryption.

With the RA3 nodes, Redshift storage is on S3 and then cached as needed and kept in the AQUA layer. AQUA scales out and processes data in parallel across many nodes.

How does AQUA (Advanced Query Accelerator) work?

When the Redshift cluster receives a query, it builds a query tree to execute it best. It knows it can send operations down to AQUA. Once it's figured out that it can ship portions of the query to AQUA, it does that and then waits to get results back.

In AQUA, the subquery arrives, and the load balancer splits the subquery across multiple nodes to process in parallel. AQUA nodes with the relevant data already cached will be biased towards receiving the queries to avoid unnecessary re-hydration (data extraction from storage blocks).

Operations like compression, encryption, filtering, aggregation are running at the line rate of the SSD drive, which is a considerable advantage because data is processed as quickly as the drives can deliver, eliminating the bottleneck between storage bandwidth and CPU bandwidth.

AQUA nodes process the query, including hydrating data from S3 if it's not already in the cache, and sends the results back up to the Redshift cluster where the subsequent processing happens.

With computation attached to storage, typically less than 5% of the data scanned is returned.  Sending less data back makes network bandwidth much less of an issue regarding overall end-to-end performance. 

Over time, customers will be able to run smaller clusters because more processing can push down to AQUA, resulting in the improved total cost of ownership.

AQUA is achieving up to 10x better query performance than other enterprise data warehouses. 

Staying secure without sacrificing performance

AQUA supports authentication, encryption, isolation, and compliance to keep both data at rest and data in transit secure. The same permissions to access Amazon Redshift are used for AQUA. Queries processed on AQUA servers are run in isolated processes to ensure data protection.

Getting started

The customer can enable AQUA either from the console, command-line interface, or API. Upon activation, Redshift will automatically push queries down to AQUA with system tables that keep track of everything done. It is easy to turn on and off between sessions to benchmark what effect AQUA has on performance.

Common use cases

Everyday use cases include dashboard speedups, search, text and sentiment analysis, clickstream, and log analytics.  

Multiple databases that feed data warehouses with normalized relational data are very efficient to query. But we are evolving to semi-structured data, JSON (JavaScript Object Notation), log events, clickstream events coming into the analytic warehouse, where AQUA excels.

Find articles that include terms like "NASCAR" or finding product recommendations that match a particular category are kinds of queries that scream with AQUA. 

Wrapping up

Amazon has found a way to push computation into storage resulting in computational storage optimized for analytics. AQUA brings compute to the storage, so data does not have to move back and forth. 

Kudos to Amazon for an innovation that improves analytics performance that runs queries up to ten times faster than before. The good news is that this order-of-magnitude leap in performance is available to Amazon Redshift users at no additional cost.

Note: Moor Insights & Strategy writers and editors may have contributed to this article. 

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.