Correlated Subqueries (CSQs) in Greenplum

Correlated Subqueries

Correlated Subqueries provide an efficient way to express results in terms of results of another query. The Greenplum Database supports correlated subqueries (CSQs) that provide compatibility with many existing applications. A CSQ is an ordinary SELECT  query with a WHERE clause or target list that contains references to the parent outer clause. A CSQ may be a scalar subquery (Example 1) or a table subquery (Example 2) depending on whether it returns a single row or multiple rows. Correlated subqueries with skip-level correlations are not supported.

Correlated Subquery Examples

Example 1 – Scalar correlated subquery

SELECT * FROM t1 WHERE t1.x > (SELECT MAX(t2.x) FROM t2 WHERE t2.y = t1.y);

Example 2 – Correlated EXISTS subquery

SELECT * FROM t1 WHERE 

EXISTS (SELECT 1 FROM t2 WHERE t2.x = t1.x);

Greenplum Database executes CSQs in two ways, as follows: 

(a) The CSQ might be unnested into join operations, which is efficient  OR

(b) The CSQ might be executed for every row of the outer query in a naïve manner. 

A large class of queries falls in category (a), including all queries from the TPC-H benchmark. 

Queries that contain CSQs in the SELECT list or are connected by OR conditions fall in category (b). Example 3 and Example 4 illustrate how some of these queries can be rewritten to improve performance.

Example 3 - CSQ in the Select List

Original Query

SELECT T1.a,

(SELECT COUNT(DISTINCT T2.z) FROM t2 WHERE t1.x = t2.y) dt2 

FROM t1;

Rewrite this query to performing an inner join with t1 first, and then to perform a left join with t1 again. The rewrite applies only for an equijoin in the correlated condition. 

Rewritten Query

SELECT t1.a, dt2 FROM t1 

LEFT JOIN 

(SELECT t2.y AS csq_y, COUNT(DISTINCT t2.z) AS dt2 

FROM t1, t2 WHERE t1.x = t2.y GROUP BY t1.x) 

ON (t1.x = csq_y);

Example 4 - CSQs connected by OR Clauses

Original Query

SELECT * FROM t1 

WHERE 

x > (SELECT COUNT(*) FROM t2 WHERE t1.x = t2.x) 

OR x < (SELECT COUNT(*) FROM t3 WHERE t1.y = t3.y)

Rewrite this query to separate it into two parts with a union on the OR conditions. 

Rewritten Query

SELECT * FROM t1 

WHERE x > (SELECT count(*) FROM t2 WHERE t1.x = t2.x) 

UNION 

SELECT * FROM t1 

WHERE x < (SELECT count(*) FROM t3 WHERE t1.y = t3.y)

To determine if the given CSQ is in category (a) or (b), use EXPLAIN SELECT or EXPLAIN ANALYZE SELECT statement to view the query plan. The existence of SubPlan nodes in the plan indicates that the query falls in category (b).