Correlated Subqueries (CSQs) in Greenplum
Correlated Subqueries
Correlated Subqueries provide an efficient way to express results in terms of results of another query. The Greenplum Database supports correlated subqueries (CSQs) that provide compatibility with many existing applications. A CSQ is an ordinary SELECT query with a WHERE clause or target list that contains references to the parent outer clause. A CSQ may be a scalar subquery (Example 1) or a table subquery (Example 2) depending on whether it returns a single row or multiple rows. Correlated subqueries with skip-level correlations are not supported.
Correlated Subquery Examples
Example 1 – Scalar correlated subquery
SELECT * FROM t1 WHERE t1.x > (SELECT MAX(t2.x) FROM t2 WHERE t2.y = t1.y);
Example 2 – Correlated EXISTS subquery
SELECT * FROM t1 WHERE
EXISTS (SELECT 1 FROM t2 WHERE t2.x = t1.x);
Greenplum Database executes CSQs in two ways, as follows:
(a) The CSQ might be unnested into join operations, which is efficient OR
(b) The CSQ might be executed for every row of the outer query in a naïve manner.
A large class of queries falls in category (a), including all queries from the TPC-H benchmark.
Queries that contain CSQs in the SELECT list or are connected by OR conditions fall in category (b). Example 3 and Example 4 illustrate how some of these queries can be rewritten to improve performance.
Example 3 - CSQ in the Select List
Original Query
SELECT T1.a,
(SELECT COUNT(DISTINCT T2.z) FROM t2 WHERE t1.x = t2.y) dt2
FROM t1;
Rewrite this query to performing an inner join with t1 first, and then to perform a left join with t1 again. The rewrite applies only for an equijoin in the correlated condition.
Rewritten Query
SELECT t1.a, dt2 FROM t1
LEFT JOIN
(SELECT t2.y AS csq_y, COUNT(DISTINCT t2.z) AS dt2
FROM t1, t2 WHERE t1.x = t2.y GROUP BY t1.x)
ON (t1.x = csq_y);
Example 4 - CSQs connected by OR Clauses
Original Query
SELECT * FROM t1
WHERE
x > (SELECT COUNT(*) FROM t2 WHERE t1.x = t2.x)
OR x < (SELECT COUNT(*) FROM t3 WHERE t1.y = t3.y)
Rewrite this query to separate it into two parts with a union on the OR conditions.
Rewritten Query
SELECT * FROM t1
WHERE x > (SELECT count(*) FROM t2 WHERE t1.x = t2.x)
UNION
SELECT * FROM t1
WHERE x < (SELECT count(*) FROM t3 WHERE t1.y = t3.y)
To determine if the given CSQ is in category (a) or (b), use EXPLAIN SELECT or EXPLAIN ANALYZE SELECT statement to view the query plan. The existence of SubPlan nodes in the plan indicates that the query falls in category (b).