Query execution process is greenplum

Post date: Nov 26, 2013 8:11:2 PM

When a user runs a SQL query in greenplum, it is the master who receives the SQL. Master then parse the query , optimize the query, creates execution plan of the query also called query plan. Query plan can me either parallel or targeted Master then dispatched the plan to the segments for execution. Each segment is then responsible for executing local database operations on its own particular set of data. Most database operations—such as table scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Each operation is performed on a segment database independent of the data associated with the other segment databases.

There are certain queries that may only access data on a single segment, such as single-row INSERT, UPDATE, DELETE or SELECT operations, or queries that return a small set of rows and filter on the table distribution key column(s). In queries such as these, the query plan is not dispatched to all segments, but is targeted to the segment that contains the affected row(s).

A query plan is the set of operations that Greenplum Database will perform to produce the answer to a given query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top.

In addition to the typical database operations (tables scans, joins, etc.), Greenplum Database has an additional operation type called a motion

A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a query of the system catalog tables on the master does not require data to move across the interconnect.

In order to achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. 

A slice is a portion of the plan that can be worked on independently at the segment-level. A query plan is sliced wherever a motion operation occurs in the plan, one slice on each side of the motion.

Some commonly used terms in greenplum query execution plan.

1. Gather motion: A gather motion is when the segments send results back up to the master for presentation to the client.

2. Redistribute motion:  redistribute motion moves tuples between the segments in order to complete the join.

3. Query dispatcher (QD): This process is created in the master.  Query dispatcher also called query worker process is responsible for creating and dispatching the query plan, and for accumulating and presenting the final results. 

4. Query executor (QE): This process is created on the segments.  Query executor also called query worker process is responsible for completing its portion of work and communicating its intermediate results to the other worker processes. For each slice of the query plan there is at least one worker process assigned. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel.

5. Gangs: Related processes that are working on the same portion of the query plan are referred to as gangs.

6. Interconnect: As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between the segments is what is referred to as the interconnect component of Greenplum Database.